Hacker Newsnew | past | comments | ask | show | jobs | submit | fredliu's commentslogin

Does anyone have real life experience (preferably verified in production environment) of fine-tuning actually adding new knowledge to the existing LLM in a reliable and consistent manner? I've seen claims that fine-tuning only adapt the "forms" but can't adding new knowledge, while some claim otherwise. I couldn't convince myself either way with my limited adhoc/anecdotal experiments.


I’ve taught LLMs imaginary words and their meanings with minute amounts of data (two or three examples) via full fine-tuning, LoRA and QLoRA.

I have no idea where the myth of ‘can’t add new knowledge via fine-tuning’ came from. It’s a sticky meme that makes no sense.

Pretraining obviously adds knowledge to a model. The difference between pretraining and fine-tuning is the number of tokens and learning rate. That’s it.


It seems like few shot prompting and providing some examples to LLMs with large context windows vastly out performs any amount of rag, or fine tuning.

Aren't rag and fine tuning fundamentally flawed, because they only play at the surface of the model? Like sprinkles on the top of the cake, expecting them to completely change the flavor. I know LoRA is supposed to appropriately weight the data, but the results say that's not the solution.

Also anecdotal, but way less work!


Long context windows get confused, so shorter is better, and they cannot fit everything in general. I'm not sure where you are seeing results that say otherwise.

RAG is effectively prompt context optimization, so categorically rejecting doing that doesn't make sense to me. Maybe if models internalized that or scaled... But they don't.


Totally agree. Every decision on what context to put in a context window is “RAG”. Somehow the term was co-opted to refer to “context selected by vector similarity”, so presumably when people say “is RAG hanging around”, what they mean is “are vectors a complete solution”, to which the answer is obviously “no”. But you still need some sort of _relevance function_ to pick your context - even if it’s pin-the-tail-on-the-donkey. That’s “RAG”.

Doesn’t make sense to ask “will we still have to curate our context?” The answer is of course you will.


RAG and fine-tuning are very different. Few-shot prompting and RAG are both variants of in-context learning.


That's definitely my experience as well, sufficiently large context window with a capable enough general purpose LLM solves lots if not all of the problems rag/fine tuning claim to solve.


I've also found (anecdotal) significant success in just throwing in available context before prompting. I've written multiple automations in this way as well.


I asked this on Twitter a few weeks ago and didn't manage to dig out any examples: https://twitter.com/simonw/status/1786163920388177950


Afaict gorilla, as in that thread ;-)

Nexusflow probably too, as it also does function calling and would need to bake in, or explicit fine-tuning for RAG use, which I don't recall seeing

I haven't look recently, but there is also a cool category of models that provide GIS inferencing via LLM


This blog post I saw recently might be relevant: https://refact.ai/blog/2024/fine-tuning-on-htmlx-making-web-...


Yeah... So looks like at least it's still an open question. I guess until we can definitively know how "knowledge" is collectively represented among the weights, it's hard to say either way. The other part of the question is how to evaluate the existence of "knowledge" in an LLM. TFA suggests a way, but still not 100% convinced that's THE way...


TFA says you can teach it new facts, but it's very slow and makes the model hallucinate more.


A new dark age incomming


Ice age


Not really answering your question, but all the "alignment" of the big models is done through a combination of supervised fine tuning and RLHF. So all the chat and censorship and other specific behaviors are at least in part fine tuned in. Maybe that is closer to forms rather than actually knowing more...


Would be curious to see if anyone find it really useful. I've tried both Copilot and Codewhisper (Amazon Q now) before, wasn't impressed and uninstalled both. Just tried Q in VSCode again, I can't figure out how to ask questions relevant to the specific workspace that's useful to me. It seems like a bolt-on chat interface to your IDE with a bad UX. Feels like even "clippy" was more useful back in the day...


I tried out Copilot Workspaces before they fixed the waitlist access check yesterday and it seems to work a lot better by running a multistep process and allowing the user to incrementally modify the plan and rerun code generation.

The UI is similar to the code review interface except the file list on the left is generated from a plan and there's a bullet point list for each file of changes the plan generates. It enables a REPL loop where the AI code gens, the user tests the changes, then updates the file plans and reruns code generation again, creating and adding files as necessary. The end result is a PR with generated description or a commit direct to main.

I'm excited for this next wave of AI coding agents but Amazon seems to have rushed this into production


The Beam feature of bigAGI (IMO one of the best model provider agnostic GenAI UX) enables GenAI users to send same prompt to multiple GenAI models at the same time, and gives the user different approaches to examine, select and fuse the best results into a better answer, through a very intuitive and seamless UX. It has been my go-to way of using GenAI in the past few weeks. IMO the results are better than any individual model's results alone. The best thing is, it could (semi)automatically select the best results from the models, for instance, it used to be the Claude 3 Opus model's results were favored, now that the best results lean more towards gpt-4-turbo-2024-04-09, but you can achieve "best model auto selection" with Beam without having to manually pick one model over the other.


Exactly my thought, as mentioned in the other thread, Chat's linear conversation style is not fit for reasoning/exploration type of tasks, while Beam's fan-out -> select --> merge is a much better and natural flow!


With Beam, we can easily experiment approaches such as Chain-Of-Though-with-Self-Consistency (CoT-SC) and other reasoning meta framework, but with more manual control. I always had issues using LLM's chat driven interface to figuring out/explore issues that i'm interested, since conversation/chats is always linear while reasoning/working on some ideas is structural. Beam seems to be a much better UX than the linear chat UX that saves me a lot of copy and paste and save and retry. Awesome work!


Yes, the only issue is the usage of tokens, which is obviously greater as we are sampling more of the solutions space. But it's a compromise to have GPT-4.5 level intelligence with GPT-4.


Probably even higher jump as the models have some amount of unique training data, and they are fact-checking each other, to a more common “truth”, and hallucinations are weeded out.


Awesome feature! Quick question, how do you choose which model to use when you "fuses" multiple beams back into one?


There's a combo box on the right side, and when you click on the "Add Merge" (green) button, the currently active model will be selected.


got it!


I have small kids, toddlers, who can already speak the language but still developing their "sense of the world" or "theory of mind" if you will. Maybe it's just me, but talking to toddlers often reminds me of interacting with LLMs, where you would have this realization from time to time "oh, they don't get this, need to break down more to explain". Of course LLM has more elaborate language skills due to its exposure to a lot more text (toddlers definitely can't speak like Shakespeare if you ask them, unless, maybe, you are the tiger parents that's been feeding them Romeo and Juliet since 1.), but their ability of "reasoning" and "understanding" seems to be on a similar level. Of course, the other "big" difference, is that you expect toddlers to "learn and grow" to eventually be able to understand and develop meta cognitive abilities, while LLMs, unless you retrain them (maybe with another architecture, or meta architecture), "stay the same".


> Maybe it's just me, but talking to toddlers often reminds me of interacting with LLMs

It's not just you. It hit me almost a year ago, when I realized my then 3.5yo daughter has a noticeable context window of about 30 seconds - whenever she went on her random rant/story, anything she didn't repeat within 30 seconds would permanently fall out of the story and never be mentioned again.

It also made me realize why small kids talk so repetitively - what they don't repeat they soon forget, and what they feel like repeating remains, so over the course of couple minutes, their story kind of knots itself in a loop, being mostly made of the thoughts they feel compelled to carry forward.


And, if you change their context, the story unspooling will change.


Yes. And if they're looped enough in their original story, this feels like the spring from a mechanical watch rapidly unwinding.


It's not just true about toddlers but also for adults in particular time frame. Maturity of thought is cultural phenomenon. Descartes used to think animals are automaton while they behaved exactly like humans in almost all aspects in which he could investigate animals and humans during those times and yet he reached illogical conclusion.


That's a great point. Just thinking out loud, if we can time travel back to the cavemen time, and assuming we speak their language, there would still be so much that we couldn't explain or they wont' be able to understand even for the smartest cavemen adults. Unless, of course we spend significant time and effort to "bring them up to speed" with modern education.


In Jayne's 'The Origin of Consciousness in the Breakdown of the Bicameral Mind', there's some interesting investigation into some of our oldest known tales... Beowulf, The Iliad, etc.

In those texts, emotional and mental states are almost always referred to with analogs to physical sensation. 'Anger' is the heating of your head, 'fear' is the thudding of your heart. He claims that at the time, there wasn't a vocabulary that expressed abstract mental states, and so the distinction between the mind and body was not clear-cut. Then, over time, specialized terms to represent those states were invented, passed into common usage, which enabled an ability to introspect that didn't exist before.

(All examples are made up, I read it more than 20 years ago. But it made an impression.)


I don't think it has anything to do with brain development. I think it's entirely related to the development of an individual concept, whenever the structure of ideas that make the concept is too simple.

I would claim that most people use intuition/assumptions rather than internal chain-of-thought, when communicating, meaning they will present that simplified concept without second thought, leading to the same behavior as the toddler. It's actually trivial to find someone that doesn't use assumptions, because they take a moment to respond, using an internal chain-of-thought type consideration to give a careful answer. I would even claim that a fast response is seen as more valuable than a slow one, with a moment of silence for a response being an indication of incompetence. I know I've seen it, where some expert takes a moment to consider/compress, and people get frustrated/second guess them.


Open source LLM generic frontend project such as bigAGI (https://github.com/enricoros/big-agi) has been having this feature for many months now. The good news: it even works with open source and local LLMs.


Isn't fountain code doing something similar? Albeit for slightly different purpose?


I might be wrong, but looks like this could help with speculative decoding which can already vastly improves the inference speed?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: