Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
> But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
This is the inverse of the Frame Problem, or the Qualification problem:
John McCarthy's paper related to it from the 1980's
LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
> they never push back on nonsense questions or stupid requirements
"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
> you really don't need to worry about teaching a human to push back on bad questions
A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.
> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?
This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.
>Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
I wouldn't even give them credit for cases where there's a lot of good training data. My go-to test is sports trivia and statistics. AI systems fail miserably at that [1], despite the wide availability of good clean data and text about it. If sports is such a blind spot for AIs, I can't help but wonder what else they're confidently wrong about.
This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.
I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.
Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.
I asked Gemini to format some URLs into an XML format. It got halfway through and gave up. I asked if it truncated the output, and it said yes and then told _me_ to write a python script to do it.
On the one hand, it did better than chatgpt at understanding what i wanted and actually transforming my data
On the other, truncating my dataset halfway through is nearly as worthless as not doing it at all (and i was working with a single file, maybe hundreds of kilobytes)
Given that Gemini seems to have frequent availability issues, I wonder if this is a strategy to offload low-hanging fruit (from a human-effort pov) to the user. If it is, I think that's still kinda impressive.
Somehow I like this. I hate that current LLMs act like yes-men, you can't trust them to give unbiased results. If it told me my approach is stupid, and why, I would appreciate it.
I just asked ChatGPT to help me design a house where the walls are made of fleas and it told me the idea is not going to work, and also has ethical concerns.
I tried it with a Gemini personality that uses this kind of attack, and since that kind of prompt strongly encourages it to provide a working answer, it decided that the fleas were a metaphor about botnet clients, and the walls were my network, all so it could give an actionable answer.
I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.
> they never push back on nonsense questions or stupid requirements
I was reminded of your comment this morning when I asked ChatGPT how to create a path mask in Rhino Grasshopper:
Me: what is a path mask that will get 1;1;0;0;* and also anything lower (like 1;0;5;10 or 0;20;1;15} ?
ChatGpt: Short answer: No single path mask can do that. Here's why: (very long answer)
Me: are you sure I can't use greater than, less than in the masks?
ChatGpt: Yes — **I am absolutely sure:** **Grasshopper path masks do *NOT* support greater-than or less-than comparisons.** Official sources and detailed confirmation: (sources and stuff)
...so I think your priors may need to be updated, at least as far as "never". And I especially like that ChatGpt hit me with not just bold, not just italics, but bold italics on that NOT. Seems like a fairly assertive disagreement to me.
Hmm. I actually wonder is such a question would be good to include in a human exam, since knowing the question is possible does somewhat impact your reasoning. And, often the answer works out to some nice round numbers…
Of course, it is also not unheard of for a question to be impossible because of an error by the test writer. Which can easily be cleared up. So it is probably best not to have impossible questions, because then students will be looking for reasons to declare the question impossible.
Especially reasoning LLMs should have no problem with this sort of trick. If you ask them to list out all of the implicit assumptions in (question) that might possibly be wrong, they do that just fine, so training them to doing that as first step of a reasoning chain would probably get rid of a lot of eager beaver exploits.
I think you start to hit philosophical limits with applying restrictions on eager beaver "AI", things like "is there an objective truth" matter when you start trying to decide what a "nonsense question" or "stupid requirement" is.
I'd rather the AI push back and ask clarifying questions, rather than spit out a valid-looking response that is not valid and could never be valid. For example.
I was going to write something up about this topic but it is surprisingly difficult. I also don't have any concrete examples jumping to mind, but really think how many questions could honestly be responded to with "it depends" - like my kid asked me how much milk should a person drink in a day. It depends: ask a vegan, a Hindu, a doctor, and a dairy farmer. Which answer is correct? The kid is really good at asking simple questions that absolutely do not have simple answers when my goal is to convey as much context and correct information as possible.
Furthermore, just because an answer appears in context more often in the training data doesn't mean it's (more) correct. Asserting it is, is fallacious.
So we get to the point, again, where creativite output is being commoditized, I guess - which explains their reasoning for your final paragraph.
> I also don't have any concrete examples jumping to mind
I do (and I may get publicly shamed and shunned for admitting I do such a thing): figuring out how to fix parenthesis matching errors in Clojure code that it's generated.
One coding agent I've used is so bad at this that it falls back to rewriting entire functions and will not recognise that it is probably never going to fix the problem. It just keeps burning rainforest trying one stupid approach after another.
Yes, I realise that this is not a philosophical question, even though it is philosophically repugnant (and objectively so). I am being facetious and trying to work through the PTSD I acquired from the above exercise.
Tuning the model output to perform better on certain prompts is not the same as improving the model.
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.
Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').
> Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning
I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.
Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.
But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.
In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.
I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.
There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...
It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.
almost reminds me of stuff like, "no, this fork of the bitcoin source code and the resulting blockchain is the one that will change the world! Forget all those other shitcoins!"
All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
A third meaning of creole? Hub, I did not know it meant something other than a cooking style and a peoples in Louisiana (mainly). As in I did not know it was a more generic term. Also, in the context you used it, it seems to mean a pidgin that becomes a semi-official language?
I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.
Pidgin and creole (language) are concepts that have some similarities but don't fully overlap.
"Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations.
As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars.
Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact!
Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akad͡zjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'.
I just confirmed with 2 native Louisianians that "creole" is, in fact, also the stuff that forms in a BBQ. I have to wonder if it is a bit insensitive to use it in that way, though.
I did not know the Acadiana link, thanks for that.
First of all, loyalty happens when both sides have moats. I'm not talking here about the case where one side is very loyal and the other is very disloyal - I'd rather call that "suckering". But in the US, government jobs have lots of mutual loyalty. The business can feel confident the employee isn't likely to leave, because for those jobs a huge part of the package is the pension which you only get after staying 20 years. And they heavily reward tenure. Meanwhile the employees also feel confident they won't be dumped (DOGE aside) because these orgs are structured in such a way that it's very hard to fire people due to process and culture. Lo and behold, plenty of loyalty in government jobs. US companies fire much more easily.
In European companies both firing and quitting is much more complicated, so you get employer loyalty in Germany or UK for example, because you actually get long term benefits there and termination is not as simple. The US companies of 50-80s like the author's father's employer were similar as well.
By the way, US companies don't actually demand loyalty. They pay lip service to it, but complaining about that is like complaining that people in clothing catalogs are too attractive. That's just how the field works, nobody takes it seriously and you look silly complaining about it. "Demanding loyalty" doesn't look like this. If an employer offered a $1 million bonus on your 10 year anniversary, that would be demanding loyalty for real. But neither the employee nor employer side has interest in this, not to mention the implied slowing down of the termination process. Plus the can of worms of knowing the company will even be around then.
Everything is fine, zoomers are not some insanely disloyal alien changelings. We're just in a transitional economy.
This focuses on case where the acquirer seeks to capture the value of the startup's business. But this is not always the case, sometimes the startup is dubious, but a cash-rich enterprise can purchase startups simply to eliminate potential avenues of competition. They may not be interested in adding a better product to their portfolio, only in quashing any nascent attempts at building the better product so they can keep selling their own mediocre one.
Also, "model innovation" strikes me as missing the point these days. The models are really good already. The majority of applications is capturing only a tiny bit of their value. Improving the models is not that important because model capability is not the bottleneck anymore, what matters is how the model is used. We just don't have enough tools to use them fully, and what we have is not even close to penetrating the market, while all the dominant tools are garbage. Of course application innovation is the place to be!
The main reason for me is that terminal programs are just less crappy, because people who develop them try much harder. The terminal itself strikes me as a terrible platform - no text sizes, no fonts, no graphics... People dismiss it as unnecessary bells and whistles but then every other TUI program jumps through ridiculous hoops to reinvent crappy versions of these.
If only the same people developed their programs with the same philosophy (minimal, simple, clean UI and keyboard driven) but in a normal GUI, so that you don't have to abuse Unicode to draw UI and just draw it.
This has been my take on things forever. Power user tools tend to be made as a TUI, which is great in that it lets you work from the framebuffer and over SSH, but it really makes you wonder why we can't have the same conveniences with proper graphics.
I think the problem is the disconnect between learning vs passing. The goal of writing a book report is supposed to be to develop your brain and improve some skills. But society cannot simply give away knowledge without some kind of testing, so there must be an exam. And you have curricula where students are "required" to take a list of classes. Not all students are deeply excited every class on that list (or their teacher, or textbook) so some students are in some classes purely to tick a checkbox. That means to them, whatever skill is taught there is useless, so they'll happily use the LLM and cheat in other ways.
First part of the problem is we need to stop cookie cutter course lists. Forcing people to take a course they don't care about is a futile ability. Back in the day it was easy to do it, but now it has gotten harder due to LLMs and reliance on exams as a compliance tool. Yes, this will make it harder to say someone has a degree in X. Instead you will have to handle a bit more nuance and discuss what specific topics they studied.
Second part is we need to dial down the credentialism. Treating third party exam grades as an indicator of ability is no longer feasible in the LLM world. The only viable way is to have a extremely controlled exam environment, but that greatly restricts what sort of things you can examine. A lot of knowledge is relevant on a timescale of days or longer, not a few hours, and you can't detain people for days just for an exam grade.
Both of this are challenging for sure but I don't think it's impossible. The programming industry has dealt with this for decades. When someone has a degree in CS or related area, it doesn't mean all that much in practice, and the GPA in that degree is also a weak indicator. Instead, the industry found other ways to directly evaluate ability. Sure, they're not perfect, but not exactly hopeless I would say.
>so some students are in some classes purely to tick a checkbox
As a student I was forced to take classes I would have never willingly chosen to take, and yet I still learned from them. I worked for an A and didn't consider cheating an option. I'm not really sure why, I can answer why I wouldn't today, but I can't particularly say why my yesteryear self was so against it, yet it remains as a key point in me gaining a very useful education.
>Forcing people to take a course they don't care about is a futile ability.
While I think sometimes we include too many unrelated courses, I also don't agree with the idea of only giving someone courses they are interested in. I would have been weaker for it. I think the issue is the culture that encourages cheating as a valid response, but where does that come from and how to fight it are massive problems.
>The only viable way is to have a extremely controlled exam environment, but that greatly restricts what sort of things you can examine.
I think oral exams are great at testing knowledge, but they suffer other problems. They don't scale at all, and they leave more room for bias than other forms of exams. I'm sure there are other problems, but those two are enough to start with. If only there was some option that had the benefits of an oral exam with an expert without the issues (this sounds like I'm hinting there is such a solution, but I promise I'm not, it is just wistful thinking).
This. Not only that, I don't know of a single person (IRL or online) who used atop, like, ever. In fact, this is the first time I'm even hearing of atop.
IIRC, most folks went from top -> htop -> glances -> various btop variants (bashtop, bpytop, btop++ etc)
atop can record to a file and then be replayed in the future. Sometimes a node is so FUBARed that it won’t even emit metrics so atop can sometimes save your ass when it records metrics to disk.
I used atop sporadically at Facebook to debug performance issues. I actually learned about it there, was I think on all the machines. This was bunch of years ago, so not sure if it still is there fleetwide, but it was really helpful to get a past granular view of what happened on the machine on some exact second few days ago where error rate metrics indicate a particular host was struggling.
I'm genuinely stunned to figure out there's a whole set of lore of *tops.
I'm not sure I'm being rational from a textbook security perspective, but, it'd take a whole lot of tangible reward to get me off the binaries supplied with the system.
btop gives you a more holistic overview of the system: individual disk stats, network stats, graphs of mem/cpu/bandwidth usage over time, etc.
I think it's handy having everything on one screen, but if you know your way around all the individual builtin tools for these, more power to you, no reason to change.
First of all, btop is included in the default repos of most Linux distros, so you don't need to worry about security. This also applies to htop and glances by the way.
In terms of tangible feature benefits, btop also offers disk I/O stats, network throughput stats, partition usage, and even GPU usage (if your distro compiled it with GPU support).
In terms of "nice" stuff that's non-essential, the overall UI is a lot more user-friendly and in many ways, better (subjectively). Eg there are visual graphs for various metrics, you can filter process names by substring, get detailed stats of a specific process, see the tree view of all the processes, easily show/hide various parts of the UI (eg you can focus solely on the process list if that's the only thing you're interested in).
There are also some distinct advantages the UI offers easier to send specific signals to processes. Eg in btop I can just select SIGSTOP from the menu, whereas in top, I'd need to remember or lookup the numeric equivalent (eg 19 for SIGSTOP).
Other top alternatives also offer similar feature sets. Glances also shows the most recent warning/errors from the system logs), as well as container resource usage which would be handy for some folks.