balazstorok's comments

balazstorok · 2025-05-21T11:42:34 1747827754

At least opening PRs is a safe option, you can just dump the whole thing if it doesn't turn out to be useful.

Also, trying something new out will most likely have hiccups. Ultimately it may fail. But that doesn't mean it's not worth the effort.

The thing may rapidly evolve if it's being hard-tested on actual code and actual issues. For example it will be probably changed so that it will iterate until tests are actually running (and maybe some static checking can help it, like not deleting tests).

Waiting to see what happens. I expect it will find its niche in development and become actually useful, taking off menial tasks from developers.

Frost1x · 2025-05-21T12:28:58 1747830538

It might be a safer option in a forked version of the project that the public can’t see. I have to wonder about the optics here from a sales perspective. You’d think they’d test this out more internally before putting it in public access.

Now when your small or medium size business management reads about CoPilot in some Executive Quarterly magazine and floats that brilliant idea internally, someone can quite literally point to these as examples of real world examples and let people analyze and pass it up the management chain. Maybe that wasn’t thought through all the way.

Usually businesses tend to hide this sort of performance of their applications to the best of their abilities, only showcasing nearly flawless functionality.

xnickb · 2025-05-21T12:31:03 1747830663

> I expect it will find its niche in development and become actually useful, taking off menial tasks from developers.

Reading AI generated code is arguably far more annoying than any menial task. Especially if the said code happens to have subtle errors.

Speaking from experience.

balazstorok · 2025-05-22T06:11:19 1747894279

This is probably version 0.1 or 0.2.

Reviewing what the AI does now is not to be compared with human PRs. You are not doing the work as it is expected in the (hopefully near?) future but you are training the AI and the developers of the AI and more crucially: you are digging out failure modes to fix.

xnickb · 2025-05-23T12:04:40 1748001880

While I admire your optimism regarding those errors getting fixed, I myself am sceptical about the idea of that happening in my lifetime (I'm in my mid 30s).

It would definitely be nice to be wrong though. That'd make life so much easier.

ecb_penguin · 2025-05-21T13:44:18 1747835058

This is true for all code and has nothing to do with AI. Reading code has always been harder than writing code.

The joke is that PERL was a write-once, read-none language.

> Speaking from experience.

My experience is all code can have subtle errors, and I wouldn't treat any PR differently.

xnickb · 2025-05-21T20:42:26 1747860146

I agree, but when working with code written by your teammate you have a rough idea what kind of errors to expect.

AI however is far more creative than any given single person.

That's my gut feeling anyway. I don't have numbers or any other rigorous data. I only know that Linus Torvalds made a very good point about chain of trust. And I don't see myself ever trysting AI the same way I can trust a human.

balazstorok · 2025-05-22T06:13:22 1747894402

It depends what we set as the bar for the AI. Like now, the bar wasn't even "have all tests pass without modifying the actual tests". That is probably lower than for any PR you would need to look at.

cesarb · 2025-05-21T12:24:46 1747830286

> At least opening PRs is a safe option, you can just dump the whole thing if it doesn't turn out to be useful.

There's however a border zone which is "worse than failure": when it looks good enough that the PRs can be accepted, but contain subtle issues which will bite you later.

UncleMeat · 2025-05-21T12:28:20 1747830500

Yep. I've been on teams that have good code review culture and carefully review things so they'd be able to catch subtle issues. But I've also been on teams where reviews are basically "tests pass, approved" with no other examination. Those teams are 100% going to let garbage changes in.

camdenreslink · 2025-05-21T14:10:10 1747836610

Even when you review human-written code carefully, subtle bugs can sneak through. Software development is hard.

UncleMeat · 2025-05-21T16:18:21 1747844301

Of course. AI Agents throwing code at you merely makes it more likely.

ecb_penguin · 2025-05-21T13:42:57 1747834977

Funny enough, this happens literally every day with millions of developers. There will be thousands upon thousands of incidents in the next hour because a PR looked good, but contained a subtle issue.

6uhrmittag · 2025-05-21T13:33:32 1747834412

> At least opening PRs is a safe option, you can just dump the whole thing if it doesn't turn out to be useful.

However, every PR adds load and complexity to community projects.

As another commenter suggested, doing these kind of experiments on separate forks sound a bit less intrusive. Could be a take away from this experiment and set a good example.

There are many cool projects on GitHub that are just accumulating PRs for years, until the maintainer ultimately gives up and someone forks it and cherry-picks the working PRs. I've than that myself.

I'm super worried that we'll end up with more and more of these projects and abandoned forks :/

cyanydeez · 2025-05-21T12:33:53 1747830833

Unfortunately,if you believe LLMs really can learn to code with bugs, then the nezt step would be to curate a sufficiently bug free data set. Theres no evidence this has occured, rather, they just scraped whayecer

balazstorok · 2025-05-14T15:30:00 1747236600

I don't think there is exists a magical political system that we set up and it magically protects us from corruption. Forever. Just like any system (like surviving in an otherwise hostile nature) it needs maintenance. Maintenance in a political or any social structure is getting off your bottom and imposing some "reward" signal on the system.

Corruption mainly exists because people have low standards for enforcing eradication of it. This is observable in the smallest levels. In countries where corruption is deeply engraved, even university student groups will be corrupted. Elected officials of societies of any size will be prone to put their personal interests in front of the groups' and will appoint or employ friends instead of randomers based on some quality metrics. The question is what are the other people willing to do? Is anyone willing to call them out? Is anyone willing to instead put on the job themselves and do it right (which can be demanding)?

The real question is how far are the individuals willing to go and how much discomfort are they willing to embrace to impose their requirements, needs, moral expectations on the political leader? The outcomes of many situations you face in society (should that be a salary negotiation or someone trying to rip you off in a shop) depend on how much sacrifice (e.g. discomfort) you are willing to take on to get out as a "winner" (or at least non-loser) of the situation? Are you willing to quit your job if you cannot get what you want? Are you going to argue with the person trying to rip you off? Are you willing to go to a lawyer and sue them and take a long legal battle? If people keep choosing the easier way, there will always be people taking advantage of that. Sure, we have laws but laws also need maintenance and anyone wielding power needs active check! It doesn't just magically happen but the force that can keep it in check is every individual in the system. Technological advances and societal changes always lead to new ideas how to rip others off. What we would need is to truly punish the people trying to take advantage of such situations: no longer do business with them, ask others to boycott such behaviour (and don't vote for dickheads!, etc.) -- even in the smallest friends group such an issue could arise.

The question is: how much are people willing to sacrifice on a daily basis to put pressure on corrupt people? There is no magic here, just the same bare evolutionary forces in place for the past 100,000 years of humankind.

(Just think about it: even in rule of law, the ultimate way of enforcing someone to obey the rules is by pure physical force. If someone doesn't listen, ever, he will be picked up by other people and forced into a physical box and won't be allowed to leave. And I don't expect that to ever change, regardless of the political system. Similarly, we need to keep up an army at all times. If you simply go hard pacifist, someone will take advantage of that... Evolution. )

Democracy is an active game to be played and not just every 4 years. In society, people's everyday choices and standards are the "natural forces of evolution".

balazstorok · 2025-05-11T05:22:11 1746940931

I have to agree. There is a clear pattern indicating what she thinks is "the best way to live". Be open and be happy. Be otherwise at your own demise. It also sounds a lot like she is trying to convince herself she is striving for the right way of living. First it seemed she has a point, later in the post I felt she lacks intellectual humility, a "healthy" level (oh the irony) of doubt.

balazstorok · 2025-04-29T09:23:59 1745918639

I think James Burke's classic talking about the fragility of our complex interdependent systems starting the episode from the 1965 Northeast blackout is still relevant and an interesting watch: https://www.youtube.com/watch?v=XetplHcM7aQ

balazstorok · 2025-04-17T09:35:49 1744882549

Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).

nialse · 2025-04-17T10:22:31 1744885351

The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.

mhitza · 2025-04-17T11:52:04 1744890724

My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.

From some article I have in my draft, experimenting with open source text embeddings:

    ./match venture capital
    purchase           0.74005488647684
    sale               0.80926752301733
    place              0.81188663814236
    positive sentiment 0.90793311875207
    negative sentiment 0.91083707598925
    time               0.9108697315425
 
    ./store sillicon valley
    ./match venture capital
    sillicon valley    0.7245139487301
    purchase           0.74005488647684
    sale               0.80926752301733
    place              0.81188663814236
    positive sentiment 0.90793311875207
    negative sentiment 0.91083707598925
    time               0.9108697315425

Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.

snovv_crash · 2025-04-17T11:06:45 1744888005

Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.

logicchains · 2025-04-17T10:11:49 1744884709

2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.

meltyness · 2025-04-17T11:14:32 1744888472

I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.

oezi · 2025-04-17T13:32:58 1744896778

Three strategies seem to be:

- Use LLM to evaluate result and retry if it doesn't match.

- let users trigger a retry

- let users edit

future10se · 2025-04-17T10:31:46 1744885906

The on-device models used for Apple Intelligence (writing tools, notification and email/message summaries, etc.) are around ~3B parameters.

I mean, they could be better (to put it nicely), but there is a legitimate use-case for them and I'd love to see more work in this space.

https://machinelearning.apple.com/research/introducing-apple...

https://arxiv.org/abs/2407.21075

Lapel2742 · 2025-04-17T09:51:24 1744883484

I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.

throwaway314155 · 2025-04-17T09:39:29 1744882769

Summarization on mobile/embedded might be a good usecase?

balazstorok · 2024-10-09T11:03:48 1728471828

Nobel peace prize has countless times been awarded to a group of people or institution. It is differently controlled but the idea is not unprecedented.

ascorbic · 2024-10-09T12:30:43 1728477043

The rules are different

balazstorok · on Feb 1, 2023

Markr | Frontend Engineer, Full-stack Engineer, NLP Scientist | Full time | Remote (EU) | https://testmarkr.com

At markr we are cutting down cost of targeted feedback to students to solve the two-sigma problem: 1-on-1 tutoring and mastery learning improves student outcomes by 2 standard deviations. Help us bring down the costs with the help of NLP. We are not aiming to replace the teachers but instead build gigantic exoskeletons for them to achieve 10x efficiency in bringing out the best from their students.

Join us if you are looking to make the world a better place through edtech.

We're looking for mid/senior in

- Frontend Engineer - Full-stack Engineer - NLP Scientist

Our tech stack includes React + Typescript + Python.

We are looking for more experienced members but we are part of a larger group of companies working in edtech, so make sure to reach out at any level.

https://testmarkr.com/careers for more details

balazstorok · on May 15, 2020

really nice, my 2 year-old is going to love it