LLMs clearly struggle when presented with JSON, especially large amounts of it. ...

ryoshu · 2025-05-21T20:55:26 1747860926

I'm consistently surprised that people don't use XML for LLMs as the default given XML comes with built-in semantic context. Convert the XML to JSON output deterministically when you need to feed it to other pipelines.

iJohnDoe · 2025-05-21T21:57:00 1747864620

Any reason for this for my own learning? Was XML more prevalent during training? Something better about XML that makes it easier for the LLM to work with?

XML seems more text heavy, more tokens. However, maybe more context helps?

CSMastermind · 2025-05-22T00:48:48 1747874928

It's in the official OpenAI prompting guidelines: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

But it's also evident for anyone who has used these models. It's also not unique to OpenAI, this bias is prevalent in every model I've ever tested from GPT 3 to the latest offerings from every single frontier model provider.

As to why I would guess it's because XML bakes semantic meaning into the tags it uses so it's easier for the model to understand the structure of the data. <employee>...</employee> is a lot easier to understand than { "employee": { ... }}.

I would guess that the models are largely ignoring the angular brackets and just focusing on the words which have unique tokens and thus are easier to pair up than the curly braces that are the same throughout JSON. Just speculation on my part though.

And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.

nitwit005 · 2025-05-22T19:04:24 1747940664

I've seen the suggestion it's because it's been trained on a lot of HTML, but the GPT docs suggest using markdown as a default choice, which is relatively less common.

crabl · 2025-05-22T01:24:37 1747877077

We've been using Markdown tables to return data to the LLM with some success