Hacker Newsnew | past | comments | ask | show | jobs | submit | kenjackson's commentslogin

You can set your prompt to do that. You can have it be extremely skeptical. You can even make it contrarian, if you wanted to be extreme. My current prompt challenges me often, and wants to find weaknesses in my argument.

LLMs will need to develop a notion of trustworthiness. Interesting that part of the process of learning isn’t just learning, but also learning what to learn and how much value to put into data that crosses your path.

To me I think the problem is the blast radius

All of us are slightly wrong about things, but not all of us are treated as oracles of correct information like Opus, ChatGPT, etc are


you're confusing LLMs with humans

Not massively sure I am

The economics for the company aren't great for people that make high frequency use of it. And I suspect that people that would pay for such a service nowadays would make good use of it.

Regarding the 5d guarantee -- I suspect that most disks would show up in 2-3 days, but if you're going to guarantee you'll need some buffer (as I think US Mail says first class is 1-5 days). And I think Netflix was just counting on it mostly being shorter (and may have even had distribution centers at some point in its history).


Protected speech can be beyond politics. Politics doesn't subsume all protected speech.

So private companies shouldn’t get to determine who they provide services to? Assuming no extremely malicious intent, I’d be fine if they said it was only going to McDonalds because the founders like Big Macs.

McDonalds isn't a public benefit corporation.

I think it’s also that contrarianism generates an argument they can follow - it’s often much more simplistic along some axis. For example, flat earthers superficially have a really simple model. Throw a ball up, of course it comes down. You look straight ahead and it looks flat. Ask them how GPS works and they can’t follow the math anyways.

While I agree with the sentiment, using AI to write the final draft of the article isn’t cheating. People may not like it, but it’s more a stylistic preference.

Using AI and a human byline is 100% cheating.

Yeah I agree. Don't tell me you authored something when claude did the majority of the writing. Use claude if you want, but don't pretend you wrote the content when you didn't.

I also hate this style of plastic, pre-digested prose. Its soulless and uninteresting. Maybe I've just read too much AI slop. I associate this writing style with low quality, uninteresting junk.


Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...


I just tried it in ChatGPT "Auto" and it didn't work

> Yes — ((((()))))) is balanced.

> It has 6 opening ( and 6 closing ), and they’re properly nested.

Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.

> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).

> A balanced version would be: ((((()))))

Testing a couple of different models without a harness such that no tool calls are possible would be interesting


Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.

The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.

Clearly, my ChatGPT is just better than yours.


heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.

OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.

Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.

Even more interesting to track how many of those are just ad-hoc patched.

Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.

When LLMs can't count r's: see? LLMs can't think. Hoax!

When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!

You just can't reason with the anti-LLM group.


Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.

Followed by lots of "works perfectly for me, why are people even talking about this?"

I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.


You are misremembering. There’s no patch. All these examples used the instant model.

The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.

>You just can't reason with the anti-LLM group.

On the contrary, the reasoning is simple and consistent:

LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.


Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.


You are trying it on a production model. The paper is using models with tool calls disabled.

It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.


I learn a lot from code I read, but don’t write. Did the author not read the code and simply threw it over the fence?


This is all valid (except probably the last sentence), but it also describes so many attempted changes right until they become darn near the default.

This sounds like why I heard Redfin wouldn’t work, or Netflix, or Amazon, or Uber, or PayPal, etc…. There are always these business complexities that make it seem like these spaces have too much friction, but if there’s enough money - if it can be done then people will figure it out.


tbh this sounds revisionist... I don't recall anyone saying that any of those services "wouldn't work". Uber I suppose is one where people thought they might run into regulatory problems, and with some of those companies people were concerned about profitability. But none those companies have I ever heard that the product itself was not going to work or be useful. (Nor, indeed, that the product was tested at large scale and performed 3x worse than the incumbent...)


Not revisionist at all.

or Netflix, or Amazon, or Uber, or PayPal, etc… Netflix and Amazon both were competing against brick and morter that were everywhere. Blockbuster was in every town, usually in every major neighborhood. The thought was that on Friday night people wanted to get a movie they wanted, not just happen to have the movie that was shipped to them. And then with streaming it was "the content on Netflix is old and dated, who would want this?" They slowly ate from below. Blockbuster scrambled with their own mailed disc offering. And died before it even had a chance to confront streaming.

Repeat this story with B&N where people said that you had to browse the books physically. You couldn't just blindly order online and wait two weeks to get the book (remember they got big before "2 Day Prime").

With PayPal it was about "they don't understand banking or payment -- and it wants to be both?!".

For this OpenAI experience, it doesn't sound great. I have accounts with these places I buy things from. I want to make sure I get my Prime shipping and digital discount via using the Amazon app. But if you could find a way to integrate my accounts all into ChatGPT things might be different. In the same way I used to never use Apple Wallet, but now it really is my go to place for everything I have a card for. I don't have to worry about having my grocery loyalty card or my football season tickets with me or my car insurance card. It's all in wallet. The Apple Wallet sucked until it was suddenly great.


Sorry that is revisionist. The idea of getting a movie mailed or streamed always sounded better than shitty blockbuster with limited selection and late fees.

The growth was fast for netflix/amazon/paypal/etc and people saw how it was an improvement from the get go.


I seem to recall a lot more hype for these companies than people saying it won't work. You seem to be cherry picking from the naysayers of the time, but not the broad consensus.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: