Just for fun I created a new personal benchmark for vision-enabled LLMs: playing... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		technologesus 6 months ago \| parent \| context \| favorite \| on: Gemma 3 QAT Models: Bringing AI to Consumer GPUs Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert. Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu

jvictor118 6 months ago [–]

i've found the vision capabilities are very bad with spatial awareness/reasoning. They seem to know that certain things are in the image, but not where they are relative to each other, their relative sizes, etc.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact