You can learn a function that embeds diffs with vulnerability A near each other,...

telotortium · 2025-02-26T16:08:08 1740586088

Perhaps you can learn such a function, but it may be hard to learn a suitable embedding space directly, so it makes sense to lean on the more general capabilities of an LLM model (perhaps fine-tuned and distilled for more efficiency).

janalsncm · 2025-02-27T06:10:38 1740636638

In principle, there is no reason why an LLM should be able to do better than a more focused model, and a lot of reasons why it will be worse. You’re wasting a ton of parameters memorizing the capital of France and what the powerhouse of a cell is.

If data is the issue you can probably even generate vulnerabilities to create a synthetic dataset.

jasonjmcghee · 2025-02-26T16:16:53 1740586613

I've thought about this and am very interested in this problem. Specifically, how can you efficiently come up with a kernel function that maps a "classic" embedding space to answer a specific ranking problem?

With enough data, you could train a classic ml model, or you could keep the llm in the inference pipeline, but is there another way?

janalsncm · 2025-02-27T06:05:37 1740636337

The typical methods would be

1. Train an embedding model which forces “similar” inputs close together using triplet loss. Here “similar” can mean anything, but you would probably want to mark similar vulnerabilities as being similar.

2. If you have a fixed set of N vulnerabilities you can train a multi class classifier. Of course it’s a pain in the ass to add a new class later on.

3. For any particular vulnerability you could train a ranking model using hinge loss. This is what most industrial ranking and recommendation systems do.