I like Qwen2-VL 7B because it outputs shorter captions with less fluff. But if y...

I like Qwen2-VL 7B because it outputs shorter captions with less fluff. But if you need to do anything advanced that relies on reasoning and instruction following the model completely falls flat on it's face.

For example, I have a couple way-too-wordy captions made with another captioner, which I'd like to cut down to the essentials while correcting any mistakes. Qwen2 is completely ignoring images with this approach, and decides to only focus on the given caption, which makes it unable to even remotely fix issues in said caption.

I am really hoping Pixtral will be better for instruction following. But I haven't been able to run it because they didn't prioritize transformers support, which in turn has hindered the release of any quantized versions to make it fit on consumer hardware.