You can learn a function that embeds diffs with vulnerability A near each other, and vulnerability B near each other, etc which is much more efficient than asking an LLM about hundreds of chunks one at a time.
Maybe you even use the LLM to find vulnerable snippets at the beginning, but a multi class classifier or embedding model will be way better at runtime.
Perhaps you can learn such a function, but it may be hard to learn a suitable embedding space directly, so it makes sense to lean on the more general capabilities of an LLM model (perhaps fine-tuned and distilled for more efficiency).
In principle, there is no reason why an LLM should be able to do better than a more focused model, and a lot of reasons why it will be worse. You’re wasting a ton of parameters memorizing the capital of France and what the powerhouse of a cell is.
If data is the issue you can probably even generate vulnerabilities to create a synthetic dataset.
I've thought about this and am very interested in this problem. Specifically, how can you efficiently come up with a kernel function that maps a "classic" embedding space to answer a specific ranking problem?
With enough data, you could train a classic ml model, or you could keep the llm in the inference pipeline, but is there another way?
1. Train an embedding model which forces “similar” inputs close together using triplet loss. Here “similar” can mean anything, but you would probably want to mark similar vulnerabilities as being similar.
2. If you have a fixed set of N vulnerabilities you can train a multi class classifier. Of course it’s a pain in the ass to add a new class later on.
3. For any particular vulnerability you could train a ranking model using hinge loss. This is what most industrial ranking and recommendation systems do.
Maybe you even use the LLM to find vulnerable snippets at the beginning, but a multi class classifier or embedding model will be way better at runtime.