Hacker News new | past | comments | ask | show | jobs | submit | simonw's comments login

I've had great results against PDFs from recent vision models. Gemini, OpenAI and Claude can all accept PDFs directly now and treat them as image input.

For longer PDFs I've found that breaking them up into images per page and treating each page separately works well - feeing a thousand page PDF to even a long context model like Gemini 2.5 Pro or Flash still isn't reliable enough that I trust it.

As always though, the big challenge of using vision LLMs for OCR (or audio transcription) tasks is the risk of accidental instruction following - even more so if there's a risk of deliberately malicious instructions in the documents you are processing.


"kinda like BBC Radio 3 if it were a neighborhood"

Thanks for that, put a smile on my face.


Related documents aside, technical documentation benefits from really great search.

Embeddings are a _very_ useful tool for building better search - they can handle "fuzzy" matches, where a user can say things like "that feature that lets me run a function against every column of data" because they can't remember the name of the feature.

With embeddings you can implement a hybrid approach, where you mix both keyword search (still necessary because embeddings can miss things that use jargon they weren't trained on) and vector similarity search.

I wish I had good examples to point to for this!


In-site search is super important. I suspect that many docs maintainers don't realize how heavily it's used. Many docs sites don't even track in-site search queries!

One of the things I love about Sphinx is that it has a decent, client-side, JS-powered offline search. I recently hacked together a workflow for making it search-as-you-type [1]. jasonjmcghee's comment [2] has got me pondering whether we can augment it with transformer.js embeddings.

[1] https://github.com/orgs/sphinx-doc/discussions/13222

[2] https://news.ycombinator.com/item?id=43964913


I love all-MiniLM-L6-v2 - 87MB is tiny enough that you could just load it into RAM in a web application process on a small VM. From my experiments with it the results are Good Enough for a lot of purposes. https://simonwillison.net/2023/Sep/4/llm-embeddings/#embeddi...

87MB is still quite big, though. Think of all the comments here on HN where people were appalled at a certain site loading 10-50 MB of images. Hopefully browser vendors will figure out a secure way to download a model once and re-use that single model on any website that requests it. Rather than potentially downloading a separate instance of all-MiniLM-L6-v2 for each site. I know that Chrome has an AI initiative but I didn't see any docs about this particular problem: https://developer.chrome.com/docs/ai

It's crazy because chrome ships an embedding model, it's just not accessible to users / developers (afaik)

https://dejan.ai/blog/chromes-new-embedding-model/


Personaly I hate it because it has a very short context length and *silently* crops the text after a tweet like text size. Inve been on a crusade about this on github and nobody seems to know this.

My go to right now is on ollama: snowflake-arctic-embed2


TIL (from "Choose a week to question all your life choices") about:

  <input type="week">
https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...

Weird that it's supported by mobile Safari but not desktop Safari (according to the support table). And not in Firefox yet.


Note that these are ISO 8601 weeks, which may not match the expectations of users everywhere. E.g. the US has a different system.

https://en.wikipedia.org/wiki/Week#Other_week_numbering_syst...


Same is true on Firefox: works fine on mobile, but doesn't work at all on desktop. Especially weird as the mobile week picker is just a date picker that ignores the picked day and returns a week it seems; surely the same could've been implemented for desktop.

There's a larger context here which is really interesting: as far as I can tell (and I'd love to hear confirmation from people more credible than me) the way LLMs and other models train on unlicensed data is NOT legal under current UK copyright law.

The UK government and trying to make it legal, presumably for concern over staying competitive in this rapidly growing space.

Baroness Kidron, mentioned in this story, is the leading figure in UK parliament who is pushing back against this.


Should an AI model be able to answer the question "which team won the superbowl in 2023" if there are thousands of articles out there containing that information but not a single one of them has been licensed for use by AI?

If you could separate the information from the intellectual property, sure; but if the model is also capable of generating a similar article, that's the point where it starts infringing on the IP of all the authors whose articles were fed into the model.

So in practice, no, it shouldn't. Not because that information itself is bad, but because it probably isn't limited to just that answer.

In summary, I think it is definitely a problem when:

1. The model is trained on a certain type of intellectual property 2. The model is then asked to produce content of the same type 3. The authors of the training data did not consent

And slightly less so, but still questionable when instead:

2. The IP becomes an integral part of the new product

which, arguably, is the case for any and all AI training data; individually you could take any of them out and not much would happen, but remove them all and the entire product is gone.


No.

That's a funny example since broadcasters have to pay a fee to say "The Super Bowl" in the first place. If they don't, they have to use some euphemism like "the big game."

The answer is definitely no. You cannot use something that you don't have a license for unless it belongs to you.


I didn't know that about euphemisms, that's a great little detail - makes this hypothetical question even more interesting!

(For what it's worth to, Claude disagrees and claims that news organizations ARE allowed to use the term Super Bowl, but companies that aren't official sponsors can't use it in their ads. But Claude is not a lawyer so <shrug>)


Right. All of these "finance and government would be so much better with smart contracts" suggestions seem predicated on the idea that human beings can design a system correctly on the first attempt and deploy an immutable version of that system they can then run independently forever without any bugs or exploits that need to be fixed in the future.

Human beings cannot do that.


Good point.

It seems the common approach now is to make a v2/v3/etc. of your protocol and let your own users migrate. Previous versions will still run forever, but your frontend can push migration paths.


"I'm just not sure I see where AI has made my search results better or more reliable."

Did you prefer Google search results ten years ago? Those were still using all manner of machine learning algorithms, which is what we used to call "AI".


Even 20 years ago, it wasn't using AI for the core algorithm, but for plenty of subsystems, like (IIRC) spellchecking, language classification, and spam detection.

Came here to say exactly that. The use of "AI" as a weird, all-encompassing boogeyman is a big part of the problem here - it's quickly growing to mean "any form of technology that I don't like or don't understand" for a growing number of people.

The author of this piece made no attempt at all to define what "AI" they were talking about here, which I think was irresponsible of them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: