> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.
I'm curious about the solutions the op has tried so far here.
"Because there’s too much you need to feed into it" - what does the author mean by this? If it is the amount of data, then I would say sampling needs to be implemented. If that's the extent of the information required from the agent builder, I agree that an LLM-as-a-judge e2e eval setup is necessary.
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).
I can confirm this after hundreds of talks about the topic over the last 2 years. 90% of cases are simply not high-volume or high-stakes enough for the devs to care enough. I'm a founder of an evaluation automation startup, and our challenge is spotting teams right as their usage starts to grow and quality issues are about to escalate. Since that’s tough, we're trying to make the getting-to-first-evals so simple that teams can start building the mental models before things get out of hand.
A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
What are the main shortcomings of the solutions you tried out?
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
Its a 2 day project at best to create your own bespoke llm as judge e2e eval framework. Thats what we did. Works fine. Not great. Still need someone to write the evals though.
> And I’m saying this as a Swede. Buy German cars, specifically within the Volkswagen auto group (Audi, VW, Skoda etc) if you want reliable quality.
I own a 2020 BMW with an electronic gearbox, which broke at around 80k km just a couple of months after the warranty expired (yeah I know!). It was a bit of a headache going back and forth with BMW to request a free repair. Fortunately, the headquarters agreed to cover the cost, and they installed a refurbished electronic gearbox. I was quite relieved that I didn’t have to pay about €10K out of pocket!
All that to say that I wouldn’t call BMW particularly reliable in terms of quality these days, but their customer support was decent, at least in my case.
Ultimately those are tools and I think the goal is to educate students to use them properly. Also because I don't expect the knowledge paradox to disappear anytime soon with these models.
> Q: What makes a good custom interface for reviewing LLM outputs?
Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ...
Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.
I think the truth is somewhere in between. I find label studio to be lacking a lot of niceties and generally built for very the average text labeling or image labeling use case, but anything else (like a multi-step agent workflow or some sort of multi-modal task specific problem) it is not quite right for and you do end up doing a bit of trying to build your own custom interface.
So, imho you should try label studio but timebox and really decide for yourself quickly if it's going to work for you in a day, and if not go vibecode a different view and try it out or build labeling into a copy of a front end you're already using for your task if that's quick.
What I think we really need here is a "lovable meets labelstudio" that starts with simple defaults and lets anyone use natural language, sketches, screenshots, to create custom interfaces and modify them quickly.
I'm ostensibly an expert in the product and I probably use that 90%+ of the time (unless I'm testing something specific) -- using a sketch as input is a cool idea though!
Disclaimer: I'm the VP Product at HumanSignal the company behind Label Studio.
Label studio is fine if it covers your need, but in many cases the core opportunity in an eval interface is fitting in with the SME’s workflow or current tech stack.
If label studio looks like what they can use, it’s fine. If not, a day of vibecoding is worth the effort to make your partners with special knowledge comfortable.
This awful advice can’t be blanket applied and misses the point: starting from zero is extremely easy now with LLMs, the last 10% is the hardest part. Not only that, if you don’t start from zero you aren’t able to build from whatever you think the new first principles are. Spacex would not exist if it tried to extend old paradigm of rocketry.
There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
I'd have agreed with you, if the principles would be different. But what was showed in the content is EXACTLY what those tools are doing today. Actually those tools are way more powerful and considering & covering way more scenarios.
> There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
Generally speaking all the options are ok, but not if you want to have something up as fast as you can or if your team is piloting something. I think the time you spend to vibe code it is greater than to setting any of those tools up.
And BTW, you shouldn't vibe code something that flows proprietary data. At least you would work with co-pilots
Recently started using Cursor for adding a new feature on a small codebase for work, after a couple of years where I didn't code. It took me a couple of tries to figure out how to work with the tool effectively, but it worked great! I'm now learning how to use it with TaskMaster, it's such a different way to do and play with software. Oh, one important note: I went with Cursor also because of the pricing, that's despite confusing in term of fast vs slow requests, it smells less consumption base.
Not op but having played Rebirth, while overall very good, it suffers from the classic case nowadays of adding repetitive "chores" to do around maps, to artificially increase the length of the game with little purpose.
So far (only 6 hours in, but some friends who went further confirmed), Expedition 33 seems to steer away from that, being a lot more story driven.
It also has, by far, the greatest prologue I've seen in a game.
Worse thing so far, the UI in the menu does take a second to get. Particularly the selected state is way too subtle and a bit confusing at first.
Agreed, I've only played FF7 Remake (which I'd waited patiently for years while it was being developed), and just did not enjoy playing the game. It felt like it 20 hour game stretched out to 70 hours with repetitive fetch quests. It lacked the charm and fun of the original, and it sounds like this Clair Obscur game has learnt from this experience.
No, but that is because I played Remake and thought it was really bad, genuinely one of the worst games I've ever played. As you might imagine I did not drop my cash on the follow-up to a game I hated that much.
I think a bunch of people would have been more interested in the ff7 remake had it been able to wrap up the story in a single game that's only 30-60 hours.
Played Remake and put that down after a few hours. The combat was just not appealing; it was packed with filler for anyone who actually played the original FFVII; the VA was just typical JRPG wooden quality. Would rather not invest another $70 for Rebirth.
Just finished FF VII Rebirth, which I'm considering exactly what a FF should be, with the exclusion of the last chapter's narrative that I didn't like. That said, next one is Clair Obscur, very looking forward to play it!
This one’s more of a streamlined experience, with a very nice take on turn based combat - it’s FFX turns with complex mechanics and quicktime parries, basically.
Story and environment rival a FF, with a European touch that’s a nice change. But it’s about 30 hours so not as immersive. Amazing achievement for an unknown team!
I don't have faith that this is something we can fix in the short term because most of us have been educated in a very competitive environment where individuals come first. I'm not saying that the opposite is good either, but we should find a balance in between. I also feel like that we are all becoming more disconnected, alone, and where the center of gravity is only ourself. Despite my premise, I still have some hopes for future generations, but unfortunately I think that things will get way worse before correcting.
> most of us have been educated in a very competitive environment where individuals come first
This is definitely intertwined with rampant individualism, but I don't think it's just our education or lack thereof that's to blame. It's also the environment we're born into and therefore never really question where it leads us and why. Century of the Self [0] makes an excellent case for where/how things went wrong, and we never deviated from this path because capitalism and its consumption-first economies would never permit such a thing.
For those comparing post-WWII to now, the only real difference seems to be capitalism becoming ever more desperate to squeeze all remaining profits. Capital concentrates [1] and profits continue to trend toward zero as Marx warned they would. It's a fundamental contradiction built into capitalism that has yet to be addressed except for by those few who are already disproportionately benefiting from the arrangement at everyone else's expense.
Consider how the average baby boomer was treated by their company of employment compared to the average worker in the 21st century. Employers now make it painfully obvious that everybody is disposable, and the only thing that matters are the metrics tied to their own compensation, no matter how disconnected that is from producing results that are actually good for society. The workers are all incentivized to become back-stabbing careerist wolves fighting and hoarding secrets instead of cooperating to build actual Good Things. The best way to get a raise is to jump ship to another company. Etc.
Given all of the above, it'd be very strange if we didn't end up in the hellscape that we are currently in.
> we never deviated from this path because capitalism and its consumption-first economies would never permit such a thing.
While I haven't read Century of the Self, I will say that most of East Asia outside of China and NK are fiercely capitalistic. Ads are everywhere and obvious. There's a huge focus on consumption and status. There's generally much looser restrictions on zoning, gambling, and prostitution than the West. And yet the cultures continue being a lot more collective and understanding of their fellow person. South Asia is less capitalistic (having transitioned from more socialistic modes of economic organization somewhat recently), but is still quite capitalistic.
I think capitalism might exacerbate this in the West but it is fundamentally a Western problem. Most of East and South Asia still operates on an extended family model where there's an expectation that when a person or a family is having a hard time they take resources from their family and when they're in a position to do well they give resources to their struggling family members. Lots of extended families have family members who are ... problematic. Many of these folks have gambling issues, can't hold down jobs, have mental health problems, etc. But families support them. They never really thrive but they usually have food, shelter, companionship, and understanding around them. I think this creates a level of empathy that's just absent from Western society.
My partner and I are Asian but we have caucasian friends. Many of our caucasian friends will cut off problematic family members immediately. Indeed a lot of caucasians I know are very quick to cut people they don't like or who don't align with their values out of their life. This culture of individual supremacy is what I think really plagues the west which used to at one time have a less individualistic nature and now finds its hyper atomization eating away at the foundations of its societies.
Yes, this is the correct understanding of the problem. The thing is, correctly understanding the problem is highly disincentivized, much less doing anything about it.
> B) Demographics are now working against us instead of for us -- turns out everyone has decided not to have kids, which means an end to population growth, consumption growth, ergo hiring growth
Even in the case of population growth things will not look better especially because we are in the middle of a tragedy of the commons, we are exhausting the resources of the planet faster than what it takes to regenerate them. What's the plan for more consumption when there's nothing to produce? or that cost so much that only a few could afford that?
I guess that we will see more requests for data labelers that know coding from LLM providers to answer Stack Overflow like of questions in order to keep their model up to date.
I'm curious about the solutions the op has tried so far here.