I swim around a lot in the "XML High Priesthood" pool, and the latest new thing is this: AI (sucking down unstructured documents) isn't capable of efficient functioning without Knowledge Graph, and donchaknow a complex XML schema and a knowledge graph are practically the same thing.
So they're glueing on some new functionality to try and get writer teams to take the plunge and - same old same old - buy multimillion dollar tools to make PDFs with. One sign of a terminal bagholder is seeing the same tech come up every few years with the latest fashionable thing stapled on its face. They went through a "blockchain" phase too, where all the individual document elements would be addressable "through the chain".
Anyway . . .
Anyway, thing is, there's a teensy shred of truth in what they're saying, but everything else about what they're suggesting would, I think, either not work at all, or make retrieval even less dependable. Also, to do what they're trying to do, you don't actually need a gigantic full on XML schema. Using Asciidoc roles consistently would get you the same benefit, and would save a hell of a lot of space in a very limited window.
Additionally, when you have strict input token limits: it’s way easier to chunk Markdown while keeping track of context than it is to chunk HTML at all.
Let's see... the linked arXiv article has been withdrawn by the author with the following comment:
> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion
I think YAML actually uses more tokens than JSON without indents, especially with deep data. For example "," being a single token makes JSON quite compact.