We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.
> Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...
I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.
That's the thing, it's a novel form of computing that's increasingly moving away from computer science. It deserves to be treated as a discipline of its own, with lots of words of caution and danger stickers slapped over it.
It’s text (word) manipulation based on probalistic rules derived from analyzing human-produced text. And everyone knows language is imperfect. That’s why we have introduced logic and formalism so that we can reliably transmit knowledge.
That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.
If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).
Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.
If I can do node-red or a function chain for prompts and outputs, that would be sweet.
For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.
Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.
Then just loop them back into LLM.
I find that long chats or chains just confuse most models and we start seeing gibberish.
Right now I'm favoring something like:
"We're going to do task {task}. The current situation and context is {context}.
Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."
I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.
One option for doing this is to incrementally build up the "document" using isolated prompts for each section. I say document because I am not exactly sure what the character sheet looks like, but I am assuming it can be constructed one section at a time. You create a prompt to create the first section. Then, you create a second prompt that gives the agent your existing document and prompts it to create the next section. You continue until all the sections are finished. In some cases this works better than doing a single conversation.
You can do multi shot workflows pretty easy, I like to have the model produce markdown, then add code blocks (```json/yaml```) to extract the interim results. You can lay out multiple "phases" in your prompt and have it perform each one in turn, and have each one reference prior phases. Then at the end you just pull out the code blocks for each phase and you have your structured result.
RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:
> So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.
I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.
You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications
Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.
As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :
I would love to see this article also expand to touch upon things like :
- data management - (tooling, frameworks, open vs closed data management, labelling & annotations)
- inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?)
- prompts - areas like caching, management, versioning, evaluations
- model observability - tokens, costs, latency, drift?
- evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs
I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.
I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.
I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.
But it seems to me that's what they're doing: "We have LLMs, what to do with them?" But anyway, I'm seriously just looking for an example of app that is build with stuff described in the article.
Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.
Since you seem to have not noticed my comment above, here's another example of a project that implements many of these techniques. Me and many others have used this to transcribe hour long videos into a well organized "docs site" that makes the content easy to read.
This was completely auto-generated in a few minutes. The author of the library reviewed it and said that it's nearly 100% correct and people in the company where it was built rely on these docs.
Tell me how long it would take you to write these docs. I'm really confused where your dismissive mentality is coming from in the face of what I think is overwhelming evidence to the contrary. I'm happy to provide example after example after example. I'm sorry, but you are utterly, completely wrong in your conclusions.
But that seems to belong to the category "text transformation" (e.g. translating, converting unstructed notes into structured data, etc.), which I acknowledge LLMs are good at; instead of category "I'll magically debug your SQL wish!".
I believe we were discussing the former not the latter? I agree that for lots of problem solving tasks it can be hit or miss - in my experience, all the models are quite bad at writing decent frontend code when it comes to the rendered page looking the way you want it to.
What you're describing is more about reasoning abilities - that's not really what the article was about or the problems the techniques are for. The techniques in article are more for stuff like Q&A, classification, summarization, etc.
I've tried this type of thing quite a bit (generating documentation based on code I've written), and it's generally pretty bad. Even just generating a README for a single source file project produces bloviated fluff that I have to edit rigorously. I'd say it does about 40% of the job, which is obviously a technical marvel, but in a practical sense it's more novelty than utility.
Please just go and try the lumentis library I mentioned - that is what was used to generate this. It works. For the library docs I showed, I literally just wrote a zsh script to concat all the code together into one file, each one wrapped with XML open/close tags, and fed that in. Just because you weren't able to do it doesn't mean it's a novelty.
Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.
Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.
Almost all of this should flow from common-sense. I would use what makes sense for your application, and not worry about the rest. It's a toolbox, not a rulebook. The one point that comes more from experience than from common-sense is to always pin your model versions. As a final tip, if despite trying everything, you still don't like the LLM's output, just run it again!
Here is a summary of all points:
1. Focus on Prompting Techniques:
1.1. Start with n-shot prompts to provide examples demonstrating tasks.
1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific.
1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG).
2. Structure Inputs and Outputs:
2.1. Format inputs using serialization methods like XML, JSON, or Markdown.
2.2. Ensure outputs are structured to integrate seamlessly with downstream systems.
3. Simplify Prompts:
3.1. Break down complex prompts into smaller, focused ones.
3.2. Iterate and evaluate each prompt individually for better performance.
4. Optimize Context Tokens:
4.1. Minimize redundant or irrelevant context in prompts.
4.2. Structure the context clearly to emphasize relationships between parts.
5. Leverage Information Retrieval/RAG:
5.1. Use RAG to provide the LLM with knowledge to improve output.
5.2. Ensure retrieved documents are relevant, dense, and detailed.
5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval.
6. Workflow Optimization:
6.1. Decompose tasks into multi-step workflows for better accuracy.
6.2. Prioritize deterministic execution for reliability and predictability.
6.3. Use caching to save costs and reduce latency.
7. Evaluation and Monitoring:
7.1. Create assertion-based unit tests using real input/output samples.
7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs.
7.3. Regularly review LLM inputs and outputs for new patterns or issues.
8. Address Hallucinations and Guardrails:
8.1. Combine prompt engineering with factual inconsistency guardrails.
8.2. Use content moderation APIs and PII detection packages to filter outputs.
9. Operational Practices:
9.1. Regularly check for development-prod data skew.
9.2. Ensure data logging and review input/output samples daily.
9.3. Pin specific model versions to maintain consistency and avoid unexpected changes.
10. Team and Roles:
10.1. Educate and empower all team members to use AI technology.
10.2. Include designers early in the process to improve user experience and reframe user needs.
10.3. Ensure the right progression of roles and hire based on the specific phase of the project.
11. Risk Management:
11.1. Calibrate risk tolerance based on the use case and audience.
11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.
Interesting blog. It seems to be a compendium of advice for all kinds of folks ranging from end user to integration partner. For a slightly different take on how to use LLMs to build software, you might be interested in https://www.infoq.com/articles/llm-productivity-experiment/ which documents an experiment where the same prompt was given to various prominent LLMs asking to write two unit tests for an already existing code base. The results were collected, metrics were analyzed, then comparisons were made. No advice on how to write better prompts but some insight on how to work with and what you can expect from LLMs in order to improve developer productivity.
No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO
We work in some pretty serious domains and try to stay away from fine tuning:
- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs
- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3
- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"
- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc
We do use tuned smaller dumber models, such as part of a coarse relevancy filter in a firehose pipeline... but these are outliers. Likewise, we expect to be using them more... but again, for rarer cases and only after we've exhausted other stuff. I'm guessing as we do more fine-tuning, it'll be more on embeddings than LLMs, at least until OSS models get a lot better.
See if the article said this, I would have agreed - fine-tuning is a tool and it should be used thoughtfully. Although I personally believe that in this funding climate it makes sense to make data collection and model training a core capability of any AI product. However that will only be available and wise for some founders.
Agreed, model training and data collection are great!
The subtle bit is just doesn't have to be for LLMs, as these are typically part of a system-of-models. E.g., we <3 RAG, and GNNs for improving your KG is fascinating. Likewise, dspy's explorations in optimizing prompts, vs LLMs, is very cool.
Yeah I would recommend sticking to RAG on naively chunked data for weekend projects by 1 person. Likewise, a consumer tool like perplexity's search engine where you minimize spend per user task or go bankrupt, same thing, do the cheap thing and move on, good enough
Once RAG projects become important and good answers matter - we work with governments, manufacturers, banks, cyber teams, etc - working through data quality, data representation, & retrieval quality helps
Note that we didn't start here: We began with naive RAG, then relevancy filtering, then agentic & neurosymbolic querying, then dynamic example prompt injection, and now are getting into cleaning up the database/kg itself
For folks doing investigative/analytics projects in this space, happy to chat about what we are doing w Louie.AI. These are more implementation details we don't normally write about.
We tried dspy and a couple others like it. They're neat and I'm happy those teams are experimenting with these frameworks. At the same time, they try to do "too much" by taking over the control flow of your code and running autotuning everywhere over it. We needed to write our own agent framework as even tools like langchain are too insecure and inefficient for being an enterprise platform, and frameworks like dspy are even more far out there.
A year+ later, the most interesting kernel of insight to us from dspy is autotuning a single prompt: it's an optimizeable model just like any other. As soon as you have an eval framework in place for your prompts, having something like dspy tune your prompts on a per-LLM basis would be very cool. I'm not sure where they are on that, it seems against the grain for their focus. We're only now reaching the point where we would see ROI on that kind of thing, it took a long time to get here.
We do run an agentic framework, so doing cross-prompt autotuning would be neat too -- especially for how the orchestrator (ex: CoT) composes with individual agents. We call this the "composition problem" and it's frustrating. However, again, dspy and friends do "too much", by trying to also be the agent framework & runtime, while we just want the autotuner.
The rest is neat but scary for most production scenarios, while a prompt autotuner can give significant lift + resilience in a predictable & maintainable way to most typical LLM apps
Again... I'm truly happy and supportive that academics are exploring a wild side of the design space. Just, as we are in the 'we ship code people rely on' side of the universe, it's hard to find scenarios where its potential benefits outweigh its costs.
Entity resolution - RAG often mixes vector & symbolic queries, and ER improves reverse indexing, which is a starting point for a lot of the symbolic ones
Identifying misinfo - Ranking & summarization based on internet data should be a lot more careful, and sometimes the controversy is the interesting part
> The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO
The article has a section called "When to finetune", along with links to separate pages describing how to do so. They absolutely don't say that "fine-tuning isn't even a consideration". Instead, they describe the situations in which fine-tuning is likely to be helpful.
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...
I took at look at Magic earlier today and it didn't work at all for me, sorry to say. After the example prompt, I tried to learn about a table and it generated bad SQL (correct query to pull a row, but with limit 0). I asked it to show me the DDL and it generated invalid SQL. Then I tried to ask it to do some population statistics on the customer table and ended up confused about why there appears to be two windows in the cell, with the previously generated SQL on the left and the newly generated SQL on the right. The new SQL wouldn't run when I hit run cell, the error showed the originally generated SQL. I gave up and bounced.
I went back while writing this comment and realized it might be showing me a diff (better use of color would have helped, I have been trained by github). But I was at a loss for what to do with that. I just now figured out the Keep button exists and it accepted the diff and now it sort of makes sense, but the SQL still doesn't return any results.
My honest feedback is that there is way too much stuff I don't understand on the screen and it makes me confused and a little stressed. Ease me into it please, I'm dumb. There seems to be cells that are linked together and cells that aren't(? separated by purplish background) and I don't understand it. I am a jupyter user and I feel like this should be intuitive to me, but it isn't. I am not a designer, but I suspect the structural markings like cell boundaries are too faint compared to the content of the cells and/or the exterior of a cell having the same color as the interior is making it hard for me. I feel lost in a sea of white.
But the core issue is that, excluding the prompt I copy-pasted word for word which worked like a charm, I am 0 out of 4 on actually leveraging AI to solve the problems I asked of Magic. I like the concept of natural language BI (I worked on in the early days when Alexa came out) so I probably gave it more chances than I would have for a different product.
For me, it doesn't fit my criteria for good problems to solve with AI in 2024 - the conversational interface and binary right/wrong nature of querying/presenting data accurately make the cost of failure too high, which is a death sentence for AI products IMO (compare to proactive, non-blocking products like copilot or shades-of-wrong problems like image generation or conversations with imaginary characters). But text-to-SQL and data presentation make sense as AI capabilities in 2024 so I can see why that could be a good product to pursue. If it worked, I would definitely use it.
This was kind of conventional wisdom ("fine tune only when absolutely necessary for your domain", "fine-tuning hurts factuality"), but some recent research (some of which they cite) has actually quantitatively shown that RAG is much preferable to FT for adding domain-specific knowledge to an LLM:
But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.
This is exactly (one of) our use cases at Eraser - taking code or natural language and producing diagram-as-code DSL.
As with other situations that want a custom DSL, our syntax has its own quirks and details, but is similar enough to e.g. Mermaid that we are able to produce valid syntax pretty easily.
What we've found harder is controlling for edge cases about how to build proper diagrams.
Agree that your use-case is different. The papers above are dealing mostly with adding a domain-specific textual corpus, still answering questions in prose.
"Teaching" the LLM an entirely new language (like a DSL) might actually need fine-tuning, but you can probably build a pretty decent first-cut of your system with n-shot prompts, then fine-tune to get the accuracy higher.
Fine-tuning is an absolutely necessary for true AI, and even if it's desirable, it's unfeasible to do for now for any large model considering how expensive GPUs are. If I had infinite money, I'd throw it at continuous fine-tuning and would throw away the RAG. Fine-tuning also requires appropriate measures to prevent forgetting of older concepts.
It is not unfeasible. It is absolutely realistic to do distributed finetuning of an 8B text model on previous generation hardware. You can add finetuning to your set of options for about the cost of one FTE - up to you whether that tradeoff is worth it, but in many places it is. The expertise to pull it off is expensive, but to get a mid-level AI SME capable of helping a company adopt finetuning, you are only going to pay about the equivalent of 1-3 senior engineers.
Expensive? Sure, all of AI is crazy expensive. Unfeasible? No
I don't consider a small 8B model to be worth fine-tuning. Fine-tuning is worthwhile when you have a larger model with capacity to add data, perhaps one that can even grow its layers with the data. In contrast, fine-tuning a small saturated model will easily cause it to forget older information.
All things considered, in relative terms, as much as I think fine-tuning would be nice, it will remain significantly more expensive than just making RAG or search calls. I say this while being a fan of fine-tuning.
A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.
Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.
> A well-trained 8B model will already be over-saturated with information from the start
Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.
Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.
The reason to fine tune is to get a model that performs well on a specific task. It could lose 90 percent of it's knowledge and beat the unturned model at the narrow task at hand. That's the point, no?
It is not really possible to lose 90% of one's brain and do well on certain narrow tasks. If the tasks truly were so narrow, you would be better off training a small model just for them from scratch.
Fine tuning has been on the way out for a while. It's hard to do right and costly. LoRAs are better for influencing output style as they don't dumb down the model, and they're easier to create. This is on top of RAG just being better for new facts like the other reply mentioned.
How much of that is just the flood of traditional engineers into the space and the fact that collecting data and then fine-tuning models is orders of magnitude more complex than just throwing in RAG? I suspect a huge amount of RAG's popularity is just that any engineer can do a version of it + ChatGPT API calls in a day.
As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)
I disagree. No amount of fine tuning will ever give the LLM the relevant context with which to answer my question. Maybe if your context is a static Wikipedia or something that will never change, you can fine tune it. But if your data and docs keep changing, how is fine tuning going to be better than RAG?
Luckily it's not one or the other. You can fine tune and use RAG.
Sometimes RAG is enough. Sometimes fine tuning on top of RAG is better. It depends on the use case. I can't think of any examples where you would want to fine tune and not use rag as well.
Sometimes you fine tune a small model so it performs close to a larger varient on that specific narrow task and you improve inference performance by using a smaller model.
Continuous retraining and deployment maybe? But I'm actually not anti-RAG (although I think it is overrated because the retrieval problem is still handled extremely naively), I just think that fine-tuning should also be in your toolkit.
Why is the retrieval part overrated? There isnt even a single way to retrieve. It could be a simple keyword sesrch, a vector sesrch, a combo, or just simply retrieving a single doc and stuffing it in the context
People will disagree, but my problem with retrieval is that every technique that is popular uses one-hop thinking - you retrieve information that is directly related to the prompt using old-school techniques (even though the embeddings are new, text similarity is old). LLMs are most powerful, IMO, at horizontal thinking. Building a prompt using one-hop narrow AI techniques and then feeding it into a powerful generally capable model is like building a drone but only letting it fly over streets that already exist - not worthless, but only using a fraction of the technology's power.
A concrete example is something like a tool for querying an internal company wiki and the query "tell me about the Backend Team's approach to sprint planning". Normal retrieval approaches will pull information directly related to that query. But what if there is no information about the backend team's practices? As a human, you would do multi-hop/horizontal information extraction - you would retrieve information about who makes up the backend team, you would then retrieve information about them and their backgrounds/practices. You might might have a hypothesis that people carry over their practices from previous experiences, so you look at the previous teams and their practices. Then you would have the context necessary to give a good answer. I don't know of many people implementing RAG like that. And what I described is 100% possible for AI to do today.
Techniques that would get around this like iterative retrieval or retrieval-as-a-tool don't seem popular.
People cant do that because of cost. If every single query involved taking everything even remotely related to the query, and passing it to OpenAI, it would get expensive very very fast.
Its not a technical issue, its a practicality issue imo.
That's very true. Although it is feasible if you are well-resourced and make the investment to own the toolchain end-to-end. Serving costs can be quite low (relatively speaking) if you control everything. And you have to pick the correct problem where the cost is worthwhile.
I don't see why this is seen as an either-or by people? Fine-tuning doesn't eliminate the need for RAG, and RAG doesn't obviate the need for fine-tuning either.
Note that their guidance here is quite practical:
> If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.
This is pure gold !! Thank you so much eugene and gang for doing this. For those of them which I have encountered, I can 100 % agree with them. This is fantastic !! So many good insights.
If you didn't follow what has been happing in the LLM space, this document gives you everything you need to know about state of the art LLM usage & applications.
Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.
Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.
However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):
Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)
We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.
I often see these messages from the community doubting the reality, but LLMs are a powerful tool in the tool chest. But I think most companies are not staffed with skilled enough engineers with a creative enough bent to really take advantage of them yet or be willing to fund basic research and from first principles toolchain creation. That’s ok. But it’s foolish to assume this is all hype like crypto was. The parallels are obvious but the foundations are different.
No one is saying that all of AI is hype. It clearly isn't.
But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.
There are very strong parallels to crypto in that (a) people are starting with the technology and trying to find problems and (b) there is a cult like atmosphere where non-believers are seen as being anti-progress and anti-technology.
Yeah I think a key is LLMs in business are not generally useful alone. They require classical computing techniques to really be powerful. Accurate computation is a generally well established field and you don’t need an LLM to do optimization or math or even deductive logical reasoning. That’s a waste of their power which is typically abstract semantic abductive “reasoning” and natural language processing. Overlaying this with constraints, structure, and augmenting with optimizers, solvers, etc, you get a form of computing that was impossible more than 5 years prior and is only practical in the last 9 months.
On the crypto stuff yeah I get it - especially if you’re not in the weeds of its use. A lot of people formed opinions from GPT3.5, Gemini, copilot, and other crappy experiences and haven’t kept up with the state of the art. The rate of change in AI is breathtaking and I think hard to comprehend for most people. Also the recent mess of crypto and the fact grifters grift etc also hurts. But people who doubt -are- stuck in the past. That’s not necessarily their fault and it might not even apply to their career or lives in the present and the flaws are enormous as you point out. But it’s such a remarkably powerful new mode of compute that it in combination with all the other powerful modes of compute is changing everything and will continue too, especially if next generation models keep improving as they seem to be likely to.
That text applies to basically every new technology. Point is that you can't predict it's usefulness in 20 years from that.
To me it still looks like a hammer made completely from rubber. You can practice to get some good hits, but it is pretty hard to get something reliable. And a beginner will basically just bounce it around. But it is sold as rescue for beginners.
I didn't see anything in the article that indicated the authors believed that those who don't see use cases for LLMs are anti-progress or anti-technology. Is that comment related to the authors of this article, or just a general grievance you have unrelated to this article?
> We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.
That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.
Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.
However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):
Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)
You've linked to a query generator for a custom programming language and a 1 hour video about LLM tools. The cynic in me feels like the former could probably be done by chatgpt off the shelf.
But those do not seem to be real world business cases.
Can you expand a bit more why you think they are? We don't have hours to spend reading, and you say you've been allowed to talk about them.
So can you summarise the business benefits for us, which is what people are asking for, instead of linking to huge articles?
> The cynic in me feels like the former could probably be done by chatgpt off the shelf.
Hello! I'm the owner of the feature in question who experimented with chatgpt last year in the course of building the feature (and working with Hamel to improve it via fine-tuning later).
Even today, it could not work with ChatGPT. To generate valid queries, you need to know which subset of a user's dataset schema is relevant to their query, which makes it equally a retrieval problem as it does a generation problem.
Beyond that, though, the details of "what makes a good query" are quite tricky and subtle. Honeycomb as a querying tool is unique in the market because it lets you arbitrarily group and filter by any column/value in your schema without pre-indexing and without any cost w.r.t. cardinality. And so there are many cases where you can quite literally answer someone's question, but there are multitudes of ways you can be even more helpful, often by introducing a grouping that they didn't directly ask for. For example, "count my errors" is just a COUNT where the error column exists, but if you group by something like the HTTP route, the name of the operation, etc. -- or the name of a child operation and its calling HTTP route for requests -- you end up actually showing people where and how these errors come from. In my experience, the large majority of power users already do this themselves (it's how you use HNY effectively), and the large majority of new users who know little about the tool simply have no idea it's this flexible. Query Assistant helps them with that and they have a pretty good activation rate when they use it.
Unfortunately, ChatGPT and even just good old fashioned RAG is often not up to the task. That's why fine-tuning is so important for this use case.
Thanks for the reply. Huge fan of honeycomb and the feature. Spent many years in observability and built a some of the large in use log platforms. Tracing is the way of the future and hope to see you guys eat that market. I did some executive tech strategy stuff at some megacorp on observability and it’s really hard to unwedge metrics and logs but I’ve done my best when it was my focus. Good luck and thanks for all you’re doing over there.
They think they are real business use cases, because real businesses use them to solve their use cases. They know that chatgpt can't solve this off the shelf, because they tried that first and were forced to do more in order to solve their problem.
There's a summary for ya! More details in the stuff that they linked if you want to learn. Technical skills do require a significant time investment to learn, and LLM usage is no different.
I’ve listed plenty in my comment history. I don’t generally feel compelled to trot them all out all the time - I don’t need to “prove” anything and if you think I’m lying that’s your choice. Finally, many of our uses are trade secrets and a significant competitive advantage so I don’t feel the need to disclose them to the world if our competitors don’t believe in the tech. We can keep eating their lunch.
Processing high volumes of unstructured data (text)… we’re using a STAG architecture.
- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually
- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule
- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.
- Push the result to webhook
Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
Another is preventing LLMs from adding intro or conclusion text.
> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
(Plug) I shipped a dedicated OpenAI-compatible API for this, jsonmode.com a couple weeks ago and just integrated Groq (they were nice enough to bump up the rate limits) so it's crazy fast. It's a WIP but so far very comparable to JSON output from frontier models, with some bonus features (web crawling etc).
We actually built an error-tolerant JSON parser to handle this. Our customers were reporting exactly the same issue- trying a bunch of different techniques to get more usefully structured data out.
> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
How are you struggling with this, let alone as a significant barrier? JSON adherence with a well thought out schema hasn't been a worry between improved model performance and various grammar based constraint systems in a while.
> Another is preventing LLMs from adding intro or conclusion text.
Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.
Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment
Asking even a top-notch LLM to output well formed JSON simply fails sometimes. And when you’re running LLMs at high volume in the background, you can’t use the best available until the last mile.
You work around it with post-processing and retries. But it’s still a bit brittle given how much stuff happens downstream without supervision.
Constrained output with GBNF or JSON is much more efficient and less error-prone. I hope nobody outside of hobby projects is still using error/retry loops.
Constraining output means you don’t get to use ChatGPT or Claude though, and now you have to run your own stuff. Maybe for some folks that’s OK, but really annoying for others.
You're totally right, I'm in my own HPC bubble. The organizations I work with create their own models and it's easy for me to forget that's the exception more than the rule. I apologize for making too many assumptions in my previous comment.
Out of curiosity- do those orgs not find the loss of generality that comes from custom models to be an issue? e.g. vs using Llama or Mistral or some other open model?
I do wonder why, though. Constraining output based on logits is a fairly simple and easy-to-implement idea, so why is this not part of e.g. the OpenAI API yet? They don't even have to expose it at the lowest level, just use it to force valid JSON in the output on their end.
It’s significantly easier to output an integer than a JSON with a key value structure where the value is an integer and everything else is exactly as desired
That's because you've dumbed down the problem. If it was just about outputting one integer, there would be nothing to discuss. Now add a bunch more fields, add some nesting and other constraints into it...
The more complexity you add the less likely the LLM is to give you a valid response in one shot. It’s still going to be easier to get the LLM to supply values to a fixed scheme than to get the LLM to give the answers and the scheme
The best available actually have the fewest knobs for JSON schema enforcement (ie. OpenAI's JSON mode, which technically can still produce incorrect JSON)
If you're using anything less you should have a grammar that enforces exactly what tokens are allowed to be output. Fine Tuning can help too in case you're worried about the effects of constraining the generation, but in my experience it's not really a thing
I only became aware of it recently and therefore haven’t done more than play with in a fairly cursory way, but unstructured.io seems to have a lot of traction and certainly in my little toy tests their open-source stuff seems pretty clearly better than the status quo.
“Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule”
This is really interesting, is there any architecture documentation/articles that you can recommend?
I'm late to this party, but here's a post I wrote about it. This is more motivation but we are working on technical posts/papers for release. Happy to field emails in the meantime if this is timely for you.
We have a company mail, fax, and phone room that receives thousands of pages a day that now sorts, categorizes, and extracts useful information from them all in a completely automated way by LLMs. Several FTEs have been reassigned elsewhere as a result.
It certainly has use cases, just not as many as the hype lead people to believe.
For me:
-Regex expressions: ChatGPT is the best multi-million regex parser to date.
-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.
-Artwork inspiration: Not only for visual inspiration, in the case of image generators, but descriptive as well. The verbosity of some LLMs can help describe things in more detail than a person would.
-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.
I have a friend who uses ChatGPT for writing quick policy statement for her clients (mostly schools). I have a friend who uses it to create images and descriptions for DnD adventures. LLMs have uses.
The problem I see is, who can an "application" be anything but a little window onto the base abilities of ChatGPT and so effectively offers nothing more to an end-user. The final result still have to be checked and regular end-users have to do their own prompt.
Edit: Also, I should also say that anyone who's designing LLM apps that, rather than being end-user tools, are effectively gate keepers to getting action or "a human" from a company deserves a big "f* you" 'cause that approach is evil.
I think it comes down to relatively unexciting use cases that have a high business impact (process automation, RPA, data analysis), not fancy chatbots or generative art.
For example, we focused on the boring and hard task of web data extraction.
Traditional web scraping is labor-intensive, error-prone, and requires constant updates to handle website changes. It's repetitive and tedious, but couldn't be automated due to the high data diversity and many edge cases. This required a combination of rule-based tools, developers, and constant maintenance.
We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...
We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.