So here's something I've been wanting to do for a while, but have kinda been struggling to figure out _how_ to do it. txtai looks like it has all the tools necessary to do the job, I'm just not sure which tool(s), and how I'd use them.
Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:
* What does the feat "Sentinel" do?
* Who is Elminster?
* Which God(s) do Elves worship in Faerûn?
* Where I can I find the spell "Crusader's Mantle"?
And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.
I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?
Fine tune will bias something to return specific answers. It's great for tone and classification. It's terrible for information. If you get info out of it, it's because it's a consistent hallucination.
Embeddings will turn the whole thing into a bunch of numbers. So something like Sentinel will probably match with similar feats. Embeddings are perfect for searching. You can convert images and sound to these numbers too.
But these numbers can't be stored in any regular DB. Most of the time it's somewhere in memory, then thrown out. I haven't looked deep into txtai but it looks like what it does. This is okay, but it's a little slow and wasteful as you're running the embeddings each time. So that's what vector DBs are for. But unless you're running this at scale where every cent adds up, you don't really need one.
As for preprocessing, many embedding models are already good enough. I'd say try it first, try different models, then tweak as needed. Generally proprietary models do better than open source, but there's likely an open source one designed for game books, which would do best on an unprocessed D&D book.
However it's likely to be poor at matching pages afaik, unless you attach that info.
RAG sounds sophisticated but it's actually quite simple. For each question, a database (vector database, keyword, relational etc) is first searched. The top n results are then inserted into a prompt and that is what is run with the LLM.
Before fine-tuning, I'd try that out first. I'm planning to have another example notebook out soon building on this.
Ah, that's very helpful, thanks! I'll have a dig into this at some point relatively soon.
An example of how I might provide references with page numbers or chapter names would be great (even if this means a more complex text-extraction pipeline). As would examples showing anything I can do to indicate differences that are obvious to me but that an LLM would be unlikely to pick up, such as the previously mentioned in-character vs out-of-character distinction. This is mostly relevant for asking questions about the setting, where in-character information might be suspect ("unreliable narrator"), while out-of-character information is generally fully accurate.
Tangentially, is this something that I could reasonably experiment with without a GPU? While I do have a 4090, it's in my Windows gaming machine, which isn't really set up for AI/LLM/etc development.
Will do, I'll have the new notebooks published within the next couple weeks.
In terms of a no GPU setup, yes it's possible but it will be slow. As long as you're OK with slow response times, then it will eventually come back with answers.
Thanks, I'd really appreciate it! The blog post you linked earlier was what finally made RAG "click" for me, making it very clear how it works, at least for the relatively simple tasks I want to do.
All the people saying "don't use fine-tuning" don't realize that most of traditional fine-tuning's issues are due to modifying all of the weights in your model, which causes catastrophic forgetting
There's tons of parameter efficient fine-tuning methods, i.e. lora, "soft prompts", ReFt, etc which are actually good to use alongside RAG and will likely supercharge your solution compared to "simply using RAG". The fewer parameters you modify, the more knowledge is "preserved".
Also, look into the Graph-RAG/Semantic Graph stuff in txtai. As usual, David (author of txtai) was implementing code for things that the market only just now cares about years ago.
For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.
Disclosure: I'm the maintainer of LLMStack project
I did something similar to this using RAG except for Vampire rather than D&D. It wasn't overwhelmingly difficult, but I found that the system was quite sensitive to how I chunked up the books. Just letting an automated system prepare the PDFs for me gave very poor results all around. I had to ensure that individual chunks had logical start/end positions, that tables weren't cut off, and so on.
I wouldn't fine-tune, that's too much cost/effort.
Yeah, that's about what I'd expected (and WoD books would be a priority for me to index). Another commentator mentioned that Knowledge Graphs might be useful for dealing with the limitations imposed by RAG (e.g., have to limit results because context window is relatively small), which might be worth looking into as well. That said, properly preparing this data for a KG, ontologies and all, might be too much work.
RAG is all you need*. This is a pretty DIY setup, but I use a private instance of Dify for this. I have a private Git repository where I commit my "knowledge", a Git hook syncs the changes with the Dify knowledge API, and then I use the Dify API/chat for querying.
*it would probably be better to add a knowledge graph as an extra step, which first tells the system where to search. RAG by itself is pretty bad at summarizing and combining many different docs due to the limited LLM context sizes, and I find that many questions require this global overview. A knowledge graph or other form of index/meta-layer probably solves that.
From a quick search, it seems like Knowledge Graphs are particularly new, even by AI standards, so it's harder to get one up off the ground if you haven't been following AI extremely closely. Is that accurate, or is it just the integration points with AI that are new?
First I would calculate the number of tokens you actually need. If its less than 32k there are plenty of ways to pull this off without RAG. If more (millions), you should understand RAG is an approximation technique and results may not be as high quality. If wayyyy more (billions), you might actually want to finetune
Fine-tuning is almost certainly the wrong way to go about this. It's not a good way of adding small amounts of new knowledge to a model because the existing knowledge tends to overwhelm anything you attempt to add in the fine-tuning steps.
Look into different RAG and tool usage mechanisms instead. You might even be able to get good results from dumping large amounts of information into a long context model like Gemini Flash.
No fine-tuning is necessary. You can use something reasonably good at RAG that's small enough to run locally like the Command-R model run by Ollama and a small embedding model like Nomic. There are dozens of simple interfaces that will let you import files to create a RAG knowledgebase to interact with as you describe, AnythingLLM is a popular one. Just point it at your locally-running LLM or tell them to download one using the interface. Behind the scenes they store everything in LanceDB or similar and perform the searching for you when you submit a prompt in the simple chat interface.
Very easy to do with Milvus and LangChain. I built a private slack bot that takes PDFs, chunks it into Milvus using PyMuPDF, the uses LangChain for recall, its surprising good for what your describe and took maybe 2 hours to build and run locally.
From my limited experience, Staff+ seems to have a lot of the same responsibilities as a manager, but without the direct reports—they're both “leadership” positions and focus on long(er)-term planning, business needs, cross-team communication, and enabling others rather than doing the work themselves. Though in lieu of people management, Staff+ engineers do get to spend some time coding, but it's pretty rarely the majority of their job.
So to that extent, I think there's quite a lot in common between engineering and management tracks after a certain point, both because there's a genuine need for that, and because direct code contributions just don't scale in the same way that helping others does.
I think you can set it to be the other way around and have it be "never mark as spam when via team@example.com" - of course, depending on how much spam it gets that might be worse.
That... pretty much already exists, in the form of Home Assistant + Zigbee and/or Thread? Though that's still wireless, and I haven't seen any focus on trying to connect everything with wires (not something I'd be keen on, personally, I'm quite happy with the wireless protocols).
FWIW, English Wikitionary (appears to!) have fewer words than German Wiktionary. I've run into this trying to extract words from eBooks (then converting to the "base" form, to essentially de-duplicate). I think it's mostly compound or more niche words, but I imagine you'd still run into them at least occasionally with most written works.
There's a nice project for converting and extracting the data from English Wiktionary into JSON but it doesn't support any other languages, AFAIK, which is a bit of a shame but also not very surprising - Wiktionary is a lot more complex, technically, than I expected!
The latter. I'm very definitely not at that level either, but looking at German words from books that couldn't be found on English Wiktionary, I was able to find them on German Wiktionary. One example would be "Weihnachtsfest" - not sure it's "officially" a compound word, though if you know "Weihnacht" and "Fest", then the meaning should be clear. In any case, it shows up as a single word and trying to "split" words made up of other words is an exercise in insanity.
Another example is "krächzender", which might also serve to give some idea of the particular pains in processing German text. It's not in English Wiktionary, but krächzen is, and is a verb. So "krächzender" is the adjectival form of the verb, and if you know "krächzen" and the general rules around adjective formation it would probably be obvious. But would you rely on a computer to parse those rules, or would you want a table with all the declensions laid out? And if you're building a vocab list for a book, is it a separate entry in the list, or does it fall under the verb?
Obviously, German Wiktionary only has definitions & explanations in German so it's not great for beginners, but any tool that's trying to automatically do stuff with German text would likely benefit from using German Wiktionary.
I have no idea if it's true for other languages, but I wouldn't be surprised if it's also true for other major languages spoken by Wikipedia users (e.g., French, Spanish, but maybe not Chinese).
I think the Jellyfin integration could be more than just a niche feature. I've used https://www.languagereactor.com/, but that only supports Netflix & YouTube, which is a bit limiting.
Reasons it's useful:
* If you've got both Native & Target Language subtitles, you can see a natural translation if you're struggling to understand something
* If there isn't a Native translation, then you can machine-translate one - especially useful early on to catch common idioms/etc that aren't just the sum of each individual word.
* Jellyfin also supports eBooks, although its reader isn't great - but if someone has already built their library, it would be nice to be able to re-use it somehow.
I would be very interested in seeing that particular feature expand, but I don't imagine it's at all simple!
Tangentially related, but I could see some desire for Calibre support as well, somehow. Calibre was very much designed to be completely stand-alone and it doesn't really support other apps trying to read its database, but it is possible.
I'd also really like some language-specific features, like separable-verb handling for German (see this comment: https://news.ycombinator.com/item?id=38915786) - it's relatively important and lacking support really limits the usefulness of vocab tools. It would also be a nightmare to handle for subtitles, since it's not always clear where a sentence ends, but such is life - subtitles are sadly not aimed at language leaners. For books and not-terrible Podcast transcripts, though, it wouldn't be so bad.
I thought of it as a niche feature because I thought most of the users would come from language learning communities, where most people are not into self-hosting. So even if someone would set up a server just for this, chances are they do not have or interested in Jellyfin also. But I've seen several comments about it, and it seems like a lot of people are from the self-hosting community so maybe it's more popular.
I'm also planning to support YouTube and improve on Jellyfin support, but I'll work on other issues and features first.
Well, part of it is being on Hacker News, which will definitely skew towards "self-host everything!", and on top of that Jellyfin is genuinely free and open-source while the more popular alternative (Plex) isn't, so probably more popular here again, and not necessarily reflective of the popularity amongst self-hosters in general!
I definitely wouldn't expect it to be high on the list of priorities, but I do appreciate that it's under consideration at the very least.
Interesting! I have a partially-built, related, tool, to extract "words" from e-books, so I could build flashcard lists and make sure I knew the majority of words that were used - most of them would be common words but every book has a decently-sized selection of specialised vocabulary. I did think about trying to get something fancy done with an LLM or an NLP for figuring out the separable verbs, but in the end, I took a very... brute-force approach, basically grabbing the final word in the "phrase", then prepending that to every word in the phrase one by one and asking "is this a known separable verb?" - I'm not sure how well it worked, but that's a different story.
Which looks quite interesting to have HTTPS for my internal-only pages without need to deal with an external service, although you have to be very careful to setup your certs correctly with "Name Constraints" (https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.10) to avoid the risk of someone being able to MitM everything if they're able to get in and start issuing themselves certificates.
It's pretty clear that it's "Based on the Farnsworth Munsell 100 Hue Test," and "this is not a replacement for the full test!" I think if you did particularly poorly on the online test then it's worth looking at whether you need to do the full test, as you might have colour blindness (or a terrible monitor). But a perfect score isn't super-meaningful.
Looking at this test on my monitors I would bet it's the 'terrible monitor' before color blindness by a longshot, one shows pretty good color the other is entirely washed out.
Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:
* What does the feat "Sentinel" do?
* Who is Elminster?
* Which God(s) do Elves worship in Faerûn?
* Where I can I find the spell "Crusader's Mantle"?
And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.
I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?