Hacker Newsnew | past | comments | ask | show | jobs | submit | AdrienBrault's commentslogin

To feed content to LLMs


"But it's not structured!"

I swim around a lot in the "XML High Priesthood" pool, and the latest new thing is this: AI (sucking down unstructured documents) isn't capable of efficient functioning without Knowledge Graph, and donchaknow a complex XML schema and a knowledge graph are practically the same thing.

So they're glueing on some new functionality to try and get writer teams to take the plunge and - same old same old - buy multimillion dollar tools to make PDFs with. One sign of a terminal bagholder is seeing the same tech come up every few years with the latest fashionable thing stapled on its face. They went through a "blockchain" phase too, where all the individual document elements would be addressable "through the chain".

Anyway . . .

Anyway, thing is, there's a teensy shred of truth in what they're saying, but everything else about what they're suggesting would, I think, either not work at all, or make retrieval even less dependable. Also, to do what they're trying to do, you don't actually need a gigantic full on XML schema. Using Asciidoc roles consistently would get you the same benefit, and would save a hell of a lot of space in a very limited window.


Yeah, this. Markdown uses less tokens than HTML and most LLMs have been trend on large amounts of Markdown.

That's why tools like this exist: https://jina.ai/reader/

Demo: https://r.jina.ai/https://news.ycombinator.com/item?id=40695...


Additionally, when you have strict input token limits: it’s way easier to chunk Markdown while keeping track of context than it is to chunk HTML at all.


Gemini Ultra is not available in France, even though it is in all neighboring countries: Germany, Spain, Belgium, Luxembourg, Switzerland, and Italy.

Is that because of french legislation, or Mistral? ;-)


I'm like 98% sure it's the former. Geofencing would only be a minor inconvenience to the latter.


Probably much better than alexa. Gpt 3.5 is miles ahead alexa


Sorry that was a bad joke


Production ready



Let's see... the linked arXiv article has been withdrawn by the author with the following comment:

> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

The URL in question: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...

This article was written by Aleks Farseev, the CEO of SoMonitor.ai, who makes the claim with no source or explanation:

> ChatGPT is not just smaller (20 billion vs. 175 billion parameters) and therefore faster than GPT-3


Hmm right, the ~300B figure may have been for the non-turbo 3.5


I think YAML actually uses more tokens than JSON without indents, especially with deep data. For example "," being a single token makes JSON quite compact.

You can compare JSON and YAML on https://platform.openai.com/tokenizer


Not "HN-like", but I have found Simon Willison's blog/newsletter very helpful: - https://simonwillison.net - https://simonw.substack.com


I’m hopeful that https://dagger.io will help CI “shift left”! Especially the ability/easiness to run it locally, and the built in caching/DAG


That feels like a change that will help misinformation


And crypto scams.


And hide the White House's like to dislike ratio. Check out some videos-- the ratio is horrible: https://www.youtube.com/c/WhiteHouse/videos


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: