Hacker Newsnew | past | comments | ask | show | jobs | submit | leroman's commentslogin

The biggest challenge an agent will face with tasks like these is the diminishing quality in relation to the size of the input, specifically I find input of above say 10k tokens dramatically reduced quality of generated output.

This specific case worked well, I suspect, since LLMs have a LOT of previous knowledge with HTML, and saw multiple impl and parsing of HTML in the training.

Thus I suspect that in real world attempts of similar projects and any non well domain will fail miserably.


In my experience it is closer to 25k, but that’s a minor point. What task do you need to do that requires more than that many tokens?

No, seriously. If you break your task into bite sized chunks, do you really need more than that at a time? I rarely do.


What model are you working with where you still get good results at 25k?

To your q, I make huge effort in making my prompts as small as possible (to get the best quality output), I go as far as removing imports from source files, writing interfaces and types to use in context instead of fat impl code, write task specific project / feature documentation.. (I automate some of these with a library I use to generate prompts from code and other files - think templating language with extra flags). And still for some tasks my prompt size reaches 10k tokens, where I find the output quality not good enough


I'm working with Anthropic models, and my combined system prompt is already 22k. It's a big project, lots of skill and agent definitions. Seems to work just fine until it reaches 60k - 70k tokens.


Interesting, thanks!


Cool idea! but kind of wasteful.. I just feel wrong if I waste energy.. At least you could first turn it into markdown with a library that preserves semantic web structures (I authored this- https://github.com/romansky/dom-to-semantic-markdown) saving many tokens = much less energy used..


This is exactly the sort of thing that should be running on a local LLM.

Using a big cloud provider for this is madness.


The title was so confusing to me, the reason I opened the link was to understand how you made the SSH tunnel manager learn the GO programming language


I don't think the title is confusing, if that were the desired meaning then it'd say "I made an SSH tunnel manager learn Go" i.e. no "to".

I don't think "I made X to do Y" ever means "I made X do Y" does it?


Not for native speakers, but I've heard non-native speakers use "I made X to do Y" in that way.


To be fair: it is a "Show HN" title (which I believe is typically used to denote a project being "shown [off]" by the op).


It's hilarious they put Claude 3.5 Sonnet in the far right corner while it scores the highest and beats most of Grok's numbers.


Yes, and I also noted how it beats Claude 3.5 Sonnet in Chatbot Arena by a bit of a margin.

This further feeds into my concern that the more advanced AI models we get, random enthusiasts at that site may no longer be able to rank them well, and tuning for Chatbot Arena might be a thing. One that is also exploited by GPT-4o. GPT-4o absolutely does not rank wildly ahead of Claude 3.5 Sonnet in a wide variety of benchmarks, yet it does in Chatbot Arena... People actually using Claude 3.5 Sonnet are also quite satisfied with its performance, often ranking it more helpful than GPT-4o when solving engineering problems, but at the expense of tighter usage limits.

Chatbot Arena was great when they were still fairly stupid, but these days, remember that everyday people are put against the task of ranking premium LLM's even solving some logic puzzles, trick questions and with a deep general knowledge far beyond that of singular humans. They can strike against traditional weaknesses like math, but then all of them suffer. So it's not an easy task at all and I'm not sure the site is very reliable anymore other than for smaller models.


There was a mini-uproar when GPT-4o-mini (an obviously "dumber" model) outscored claude-3.5-sonnet on Chatbot Arena, so much so that LMSYS released a subset of the battles: https://huggingface.co/spaces/lmsys/gpt-4o-mini_battles

You can review for yourself and decide if it was justified (you can compare based on W/L/T responses and matchups). Generally, Claude still has more refusals (easy wins for the model that actually answers the request), often has worse formatting (arguable if this is better, but people like it more), and is less verbose (personally, I'd prefer the right answer with less words, but ChatArena users generally disagree).

If you look at the questions (and Chat Arena and Wildchat analyses), most people aren't using LLMs for math, reasoning, or even coding - if anything the arena usage is probably overly skewed to reasoning/trick questions due to the subset of people poking at the models.

Of course, different people value different things. I've almost exclusively been using 3.5 Sonnet since it came out because it's been the best code assistant and Artifacts are great, only falling back to GPT-4o for occasional Code Interpreter work (for tricky problems, Mistral's Codestral actually seems to be a good fallback, often being able to debug issues that neither of those models can, despite being a tiny model in comparison).


Is there yet standardized ways of objectively testing LLMs? The Chatbot Arena thing has always felt weird to me; basically ranking them based on vibes.


Short answer is no, because there is no 'standardized' use case.

One thing is sure - that current commonly used benchmarks are mostly polluted and worthless. So you have to go to niche ones.

For example the one I check for coding is Aider LLM leaderboard [1].

We maintain Kagi LLM Benchmarking Project [2] optimized for the use case of using LLMs in search.

[1] https://aider.chat/docs/leaderboards/

[2] https://help.kagi.com/kagi/ai/llm-benchmark.html


Not really. There's a hundred benchmarks, but all of them suffer from the same issues. They're rated by other LLMs, and the tasks are often too simple and similar to each other. The hope is that just gathering enough of these benchmarks means you get a representative test suite, but in my view we're still pretty far off.


Use this https://livebench.ai It's a better benchmark.


Your concerns are valid.

Two more things concerning Chatbot Arena:

- The prompts people use on have an incredible sample bias towards certain tasks and styles, and as such are unrepresentative of "overall performance" which is what people expect from a leaderboard.

- It is incredibly easy to game by a company, their employees or their fanboys if they would like to. No idea if anyone has done so, but it's trivial.

Just to give one example of the bias; advances in non-English performance don't even register on the leaderboard because almost everyone rating completions there is doing so in English. You could have a model that's a 100 in English and a 0 on every other language, and it would do better on the leaderboard than a model that's a 98 in every human language in the world.


Thanks for sharing!!

Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!




Bumped this together with the side-by-side comparison task.. so will look into it :)


This is some great feedback, thanks!

1. there some crazy links with lots of arguments and tracking stuff in them, so it gets very long, the refification turns them into a numbered "ref[n]" scheme, where you also get a map of ref[n]->url to do reverse translation.. it really saves a lot, in my experience. It's also optional, so you can be mindful when you want to use this feature..

2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).

3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help

4. There are some built in functions to traverse and modify the AST. It is just JSON in the end of the day so you could leverage the types and write your own logic to parse it, as long as it conforms to the format you can always serialize it, as you mentioned..

5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future

6. Sounds cool but out of scope at the moment :)

7. This feature would serve to help with scraping and kind of point the LLM to some element? Then the part I'm missing is how you would code this in advance.. There could be some meta-data tag you could add and it would be taken through the pipeline and added on the other side to the generated elements in some way..


Ah, I suppose you mean a web page one could visit to see a demo :) Added to the backlog!


This totally makes sense, I will look into adding support for additional ways to detect the main content, super interesting!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: