The reason you would insert an LLM into the vision stack is to deal with the wei...

The reason you would insert an LLM into the vision stack is to deal with the weird and unexpected. Tesla’s current stop sign approach is to train a classifier from scratch on thousands of stop signs images. It’s not surprising that architecture can’t deal with stop signs that fall outside the distribution.

LLMs with vision work completely differently. You’re leveraging the world model, built from a terabyte of text data, to aid your classification. The classic example of an image they handle well is a man ironing clothes on the back of a taxi. Where traditional image classifiers wouldn’t have a hope of handling that, vision LLMs describe it with ease.

https://llava.hliu.cc/