I think that it's basically fair and I often write simple agents using exactly the technique that you describe. I typically provide a TypeScript interface for the available tools and just ask the model to respond with a JSON block and it works fine.
That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.
There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.
But to be clear, mdoc already accounts for this through its selective disclosure protocol, without the need for a zero knowledge proof technology. When you share an mdoc you are really just sharing a signed pile of hashes ("mobile security object") and then you can choose which salted pre-images to share along with the pile of hashes. So for example your name and your birth date are two separate data elements and sharing your MSO will share the hashes for both, but you might only choose to share the pre-image representing your birthday, or even a simple boolean claim that you are over 21 years old.
What you don't get with this scheme (and which zero knowledge proofs can provide) is protection against correlation: if you sign into the same site twice or sign into different sites, can the site owners recognize that it is the same user? With the design of the core mdoc selector disclosure protocol, the answer is yes.
Last week I tried Google's Jules coding agent and saw it requested broad GitHub OAuth permissions --essentially "full access to everything your account can do." When you authorize it, you're granting access to all your repositories.
This is partly driven by developer convenience on the agent side, but it's also driven by GitHub OAuth flow. It should be easier to create a downscoped approval during authorization that still allows the app to request additional access later. It should be easy to let an agent submit an authorization request scoped to a specific repository, etc.
Instead, I had to create a companion GitHub account (https://github.com/jmandel-via-jules) with explicit access only to the repositories and permissions I want Jules to touch. It's pretty inconvenient but I don't see another way to safely use these agents without potentially exposing everything.
GitHub does endorse creating "machine users" as dedicated accounts for applications, which validates this approach, but it shouldn't be necessary for basic repository scoping.
Please let me know if there is an easier way that Ip'm just missing.
My personal challenge last year was to solve everything on my mobile phone, using LLMs (mostly ChatGPT4 with code interpreter; I didn't paste in the problems, but rather described the code I wanted.)
This year I'm declaring "Advent of Claude"!
Challenge: Write a Claude custom style to solve Advent of Code puzzles within Claude's UI.
Score: # adventofcode.com stars earned in 2 daily conversation turns.
Fine print: web app artifacts are allowed, including paste of your custom input into the artifact UI; one click only.
While I personally wouldn't find it a ton of fun to solve the puzzles that way, that's pretty cool. Nice work.
Is there a place where you're blogging this or at least aggregating the links so we can see how far you get with it as the puzzles get more challenging?
I find "higher level" format issues to be of greater concern. These are issues like: is the recipe structured in a way that makes the prep/process flow clear, makes it obvious when a certain ingredient needs to be prepped but divided into multiple parts for use in different stages, or when different stages lead to products that are combined and subsequent poisons in the workflow?
Was just about to say the same thing. I don't care as much about the structure but the presentation. Something I have also disliked is how recipes are standardized around ingredients and quantities at the top and steps below. I often have a recipe open on my phone and I find its a non-optimal instruction set.
Have been doing something exactly like yourself to split it into functionally two parts, there is the shopping list and for me the optimal steps of prep/cooking which includes the quantity for each item.
Presentation and structuring is really, really important. The best I've found so far is a multicolumn format: https://i.imgur.com/w0UrJt5.png
Column 1 is the quantity. This doesn't really belong in the first column but it matches traditional ways of writing things and doesn't cause any actual trouble to do it that way, so whatever, we can do it that way. Column 2 is the ingredient. And column 3 is the cooking instructions. The rows are then grouped (shaded) by which ingredients go into which cooking instructions.
You can scan down columns 1 and 2 to get a prep / mise en place list, or just column 2 to get a shopping list (possibly involving deduplication if an ingredient is called for more than once), then execution is just running down column 3. The only real problem with execution is when it gets nonlinear (you want to overlap steps 3 and 4 in that recipe, for example) but that's a problem with any format I know of.
It's not perfect, but it works really, really well, and better than any other format I've ever seen.
...also now I want chili since it's cold and wet here in Seattle. And I should probably revise that recipe to reflect what I really do, but it's just chili, it's pretty tolerant of whatever you have lying around....
Take a look at my website, https://letscooktime.com/ and let me know what you think of the way I render recipes. There is an internal representation like this:
1. A recipe is composed of multiple components
1. A component is composed of ingredient requirements and steps
1. An ingredient requirement is a tuple of an ingredient and a quantity
I found good success using this model for recipes, specially complex baking recipes like breads with multiple repeat ingredients.
For myself, the one thing missing the quantity of items inside of the recipe itself. I am often using my phone to read a recipe while cooking and its annoying to have to scroll back and forth to see how much of something I need to add. I have all the ingredients out and ready, just tell me in the steps how much of a seasoning I need to be adding.
Yes! Phone screens are too small to show both the instructions and the ingredients at the same time. To address that in CookTime, I highlight any instance of an ingredient found in the recipe instructions and show the quantity if you click/tap on it. I could change it to just show it unconditionally if that wasn't apparent when you looked at a recipe (like this one for example https://letscooktime.com/Recipes/Details?id=ed962bb3-64b7-42...)
I gave your example the instruction "Reformat in two-column format, with the ingredients listed on the left side as they are used" (i.e. the style used in Julia Child's _The Art of French Cooking_) and chatgpt failed horribly.
>LLMD-8B achieves state of the art responses on PubMedQA over all models
Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)
>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.
In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix.
Thanks for reading! We'll definitely include our Sonnet results in the next revision. It's worth pointing out that we're comparing accuracy on text responses and not log probability based scoring, which I think is the number you're referring to (based on Section E of this paper https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...). But if I'm mistaken and you have a direct pointer, that'd be super helpful! In general, we've been basing our comparisons against the models in the Open Medical LLM leaderboard here: https://huggingface.co/spaces/openlifescienceai/open_medical...
Also definitely a good idea on the ablation study. We had some results internally based on a production-tuned version of our model that includes a much higher weighting of records-data. It's an imperfect ablation, but it supports the story -- so I think it's there, but you're right that it would be more complete to develop and include the data directly.
I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.
In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.
(I'm also very curious to know how 3.5 Sonnet performs.)
Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)
A few thoughts -- with some color on 'the why' because we'd love to get your input on how best to get the story and data across. And thoughts you have would be great.
So for method: We did NOT force single token responses. Our goal was to say "if we use [model x] to serve an app for [this task], how accurate would it be?" -- so we wanted to get as close to pasting the prompt directly in and just grading if the output was correct or not. In some cases, that directly works; in others, we'd have to lightly adjust the system prompt (e.g. "Answer ONLY yes, no, or maybe"); and in some cases, it required significant effort (e.g. to parse stubbornly verbose responses).
For the models like GPT-4o, Llama3-70B, and Sonnet that have great instruction following behavior, this works in a straightforward way (and is something we should be able to just add in an appendix). We were surprised how hard this was for a fair number of the domain-specific models with great log-prob benchmark results on the leaderboard -- ultimately a huge gap between numbers saying 'this is a great medical AI model!' and the ability to use it in production -- and to us that was an important part of the story.
For this set of models where a ton of engineering was required to get workable responses, sharing code is the best we can do. I worry a little about rabbit holing on details of how we could improve tuning or output parsing, because if a model requires so much bespoke effort to work on a task it's been built to perform (in the log-prob terms), the point still stands that you couldn't be confident using it across different types of tasks.
Stepping back, for us this method supported our experience that benchmark performance is pretty disconnected to how a model did with records. This behavior was a big piece of that puzzle that we wanted to show. I think there's some nuance though in how we get this across without getting tied up in the details and options for benchmark hacking.
To your question about the difference between our results and log-prob with T=0: behaviorally, I think of a model like Grok that is tuned to be funny, and perhaps it heavily downweights 'yes' or 'no' on a task like this in favor of saying something entertaining; it may have excellent log-probability benchmark performance, but it would be a much worse choice to power your app than the benchmark scores suggest. We wanted our accuracy to be more reflective of that reality-in-production.
And to your comment about using the phrase state-of-the-art: for us, we _didn't_ want to say "you can get the best model for PubMedQA by doing xyz like we did"; instead, we wanted to say "even if you fully invest in getting great benchmark performance, it doesn't do much for your ability to work with records." So for us, s-o-a is more shorthand for saying "we appropriately exhausted what one can do to tune benchmark performance, and here's a top line number that shows that, so we can stand by the relationship we see between benchmarks and performance on records."
Finally, a last note on something I was seeing yesterday when pawing through some structuring and abstraction tasks that GPT-4o got wrong but LLMD did well. It really is amazing how many different pockets of necessary domain bias/contextual bias the records are teaching the model. One obvious example I was seeing was GPT-4o is undertrained to interpret whether "lab" means "lab test" or "laboratory facility." LLMD has picked up on the association that a task asking for a reference range is referring to a lab test, and that behavior is coming from pre-training and instruction fine-tuning (I suspect more the latter). In contrast, if we don't tune the prompt to be explicit, GPT-4o will start dropping street names into the lab-name outputs, etc.
To me, the implication is that you could do a whack-a-mole approach to load the prompt with ultra precise instructions and it would improve performance on records. But based on what we saw in the paper, that likely _only_ works on the big models like GPT-4o and Sonnet, and not on the domain models that are so hard to coerce into giving reasonable responses. But also, there's a long-tail of such things that would drown you, and so you really have no choice to train on records data. Another tiny example we saw a few weeks ago that has a huge impact on app level performance was that the unit for MCV test is so often wrong in records, but the answer can be assumed to be fL in most cases. So we'd need to add tons of things like that if we didn't have records to train on.
tldr; you need to train on records; if you can't and you have a very well defined purpose/input space, use a big model like GPT-4o and load on the prompt to be very precise -- that should work well; pursuing benchmark performance doesn't get you much practically; if you need to work in an unconstrained environment, you have to train on records to pick up all those small biases that matter.
There's so much good stuff here, and I agree it's an important message for you to get across.
I think trying to convey these ideas through a quantitative benchmark result (particularly a benchmark which has a clear common interpretation that you're essentially redefining) risks 1) misleading readers, and 2) failing to convey the rich and detailed analysis you've included here in your HN comment.
I'd suggest you restrict your quantitative PubMedQA analysis to report previously published numbers for other models (so you're not in the role of having to defend choices that might cripple other models) or a very straightforward log probs analysis if no outside numbers are available (making it clear which numbers you've produced vs sourced externally). Then separately explain that many of the small models with high benchmark scores exhibit poor instruction following capabilities (which will not be a surprise for many readers, since these models aren't necessary tuned or evaluated for that), and you can make the point that some of them are so poor at instruction following that they're very hard to deploy in contexts that require instruction following; you could even demonstrate that they're only able to follow an instructions to "conclude answers with 'Final Answer: [ABCDE]'" on x% of questions, given a standard prompt that you've created and published. In other words, if it's clear that the problem is in instruction following, analyze that.
(Not all abstraction pipelines leveraging an LLM need it to exhibit instruction following, and in your own case, I'm not sure you can claim that your model follows instructions well on the basis of its PubMedQA or abstraction performance, since you've fine tuned on prompt,answer pairs in both domains. You'd need a different baseline for comparison to really explore this claim.)
Then I'd suggest creating a detailed table of wrong/surprising stuff that frontier models don't understand about healthcare data, but which your model does understand. Categorize them, show examples in the table, and explain them in narrative much like you've done here.
Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.
Well-configured zstd can save a lot of bandwidth over gzip at this scale without major performance impact, especially with the custom dictionary. Initialising zstd with a custom dictionary also isn't very difficult for the client side.
As for application development, I think web socket APIs are generally exposed much better and used much easier than SSEs. I agree that SSEs are a more appropriate technology to use here, but they're used so little that I don't think the tooling is good. Just about every language has a dedicated websocket client library, but SSEs are usually implemented as a weird side effect of a HTTP connection you need to keep alive manually.
The stored ZSTD objects make sense, as you only need to compress once rather than compress for every stream (as the author details). It also helps store the data collected more efficiently on the server side if that's what you want to do.
I don't have an understanding of SSE in depth, but one of the points the post is arguing for is compress once (using zstd dictionary) and send that to every client.
The dictionary allows for better compression without needing a large amount of data, and sending every client the same compressed binary data saves a lot of CPU time in compression. Streams, usually, require running the compression for each client.
I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.
I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)
Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.
It's well supported by scanners but can create unwieldy values for users to copy/paste.
For more recent work with dynamic content (and the assumption that a web server is involved in the flow), we're just limiting the payload size and using ordinary byte mode (https://docs.smarthealthit.org/smart-health-links/spec)
I'm very pleased this UX includes "can edit any previous conversation turn" functionality, making conversations a tree rather than a list.
For me this is one of the highest-impact and most-often-overlooked features of the ChatGPT Web UI (so much so that openai does not even include this feature in their native clients).
That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.
There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.