Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So the type of emergent ability I am curious about relates to how with enough parameters, the models start to actually emit logical behaviour where previously it was not able to apply that. So an example is:

> David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?

Answer: > Based on the information provided, David is 15 years old and the minimum age to be allowed to drive a car is 18. Therefore, David is not allowed to drive a car since he has not reached the minimum age requirement.

Trying out smaller models I get all kinds of crazy responses. ChatGPT4 seems to be reliably able to apply this kind of logic to within-prompt information. Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size? Or something specific about the HRLF that has been applied to it? Whatever it is, the end result is that this larger model is useful for fundamentally different things that the other models are not. Whether you call that emergent behaviour, sums greater than parts etc. doesn't change that.



It's not just the ability to correctly answer this, but the consistency.

I asked this exact question to the `oasst-sft-6-llama-30b` model and it was able to consistently get the correct answer. I then tried the smaller `vicuna-7b` model, and while it usually gave the correct answer, there was the occasional miss.

Interestingly, `oasst-sft-6-llama-30b`'s ability to answer correctly seems to be fairly stable across multiple configurations. I tried various temperature settings from 0.2 up to 1.2, different topP configs, and they all answered correctly.


reminds me of voice recognition

On the one hand, the problem was nearly "solved" in early 2000's, getting to 95% accuracy. But the daily of experience of using something that makes mistakes at that rate is infuriating and well and truly outside of consideration for putting into any kind of critical pathway. So it's a combination of how difficult it is to close the last few percentage points of accuracy with how important they are for most of the high value use cases.

For the forseeable future I see most use of this tech coming from applications where it aids humans and/or checks their decisions rather than running solo.


> Computer, earl grey, hot.

I think we're getting closer to something like this, out of Star Trek. Even in Star Trek, AI did not take over critical functions - but rather assisted the crew in manning the starship.


I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language eg <subject> <parameters> <action>. It could be made completely reliable with current tech (even a decade ago, really), by just minimizing the possibility space… and I’m pretty sure consumers would trivially be able to learn it, as long as they don’t try to go full pseudo-programming-language mode

And “Computer, execute program alpha beta seven” would be the power user version of it

We should already be at “computer, earl gray, hot” today


Years ago I used a program with that approach for a space sim. Basically it would only recognize voice commands that you define beforehand, which made it very reliable at recognizing the right one because it just had to find the closest match within a limited set of options, and would then simulate associated key inputs.

Meanwhile when I tried Android's voice-based text input it was a catastrophe as my accent completely threw it off. Felt like it was exclusively trained on English native speakers. Not to mention the difficulty such systems have when you mix languages, as it tends to happen.


This is an annoyance that Linus from LTT constantly brings up. The voice assistants try to split the recognition and mapping to commands which results in lots of mistakes which should never happen. If you say "call XYZ", then the result would be so much better if the phone tried to first figure out if any of the existing contacts sounds like XYZ.

Limiting the options rather than making the system super generic would help in so many cases.


> I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language

Because the UI affordances (in this case the control language) wouldn’t be discoverable or memorable across a large range of devices or apps. Moreover, speaking is an activity that allows for an arbitrary range of symbol patterns, and a feedback loop between two who are in dialog are able to resolve complex matters even though they start from different positions.


I mean, right now the current state is effectively an undiscoverable control language, with somewhat flexibility but generally fails/unreliable unless you restrict yourself to very specific language — language that differs based on the task executed, often with similar but different specific formats required to do similar actions

I’d argue that if the current state is at all acceptable, then a consistent, teachable and specific language format would be an improvement in every way — and you can have an “actual” feedback loop because there’s a much more limited set of valid inputs, so your errors can be much more precise (and made human-friendly, without, I think, made merely programmer-friendly).

As it stands, I’ve never managed a dialogue with Siri/Alexa; it either ingests my input correctly, rejects it as an invalid action, does something completely wrong, or produces a “could not understand.. did you mean <gibberish>?”.

Having the smart-ai dialogue would be great if I could have it, but for the last decade that simply isn’t a thing that occurs. Perhaps with GPT and it’s peers, but afaik GPT doesn’t have a response->object model that could be actioned on, so the conversation would sound smoother but be just as incompetent at actually understanding whatever you’re looking to do. I think this is basically the “sufficiently smart compiler” problem, that never comes to fruition in practice


It's like using a CLI where the argument structure is inconsistent and there is no way to list commands and their arguments in a practical way.


Close your eyes and imagine that CLI system is instead voice / dialog based. The tedium. For bonus points, imagine you’re in a space shared with others. Doesn’t work that well…


What? No, I think it'd be great! I'd love to be able to say out loud "kube get pods pipe grep service" and the output to be printed on the terminal. I _don't_ want to say "Hey Google, list the pods in kubernetes and look for customer service".

The transfer between my language and what I can type is great. It starts becoming more complex once you need to add countless flags, but again, a structured approach can fix this.


Voice recognition often works with a grammar of words you specify to improve the chance of correct detection.

It's just that there is no consumer application and I think the reception of voice commands from the public was fairly cold.

I don't want to do stuff with my voice for once. I'd rather click or press a button


Most voice assistants can work with simple phrases like that. Alexa, lights on. Hey Google, thermostat 70 degrees.


Not Siri, which thinks I'm talking to her all the time when I'm speaking to a family member whose name contains neither an "s" nor an "r".


That's because letters aren't sounds.


Way to jump to unjustified conclusions. The name also doesn’t contain either sound.


The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.

It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.

Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.

Or Perl, in its entirety


Probably the divide between technical users and non-technical. You and I find that structure completely logical. But less structured natural language with a million ways to ask a certain thing puts it practically in reach of the remainder of the population.


Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things. And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.

Finally, their nemesis was the Borg, a race that explored the question of what happens if a society fully embraces AI and cybernetics instead of keeping it at a distance like the Federation does. The Borg are depicted as more powerful than the Federation exactly because they allowed AI to take over critical functions.


> Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things.

Data was created by technology not available to the Federation. As far as the order of society is concerned, he's magic and not technology. An immediate implication is that his ship was the only one in the Federation with an android crew member.

> And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.

This is true, though. The computer did take over many critical functions.


> This is true, though. The computer did take over many critical functions.

But the Star Trek computer was just a fairly normal computer with an AI-ish voice UI. And there have been present-day ships which couldn't function properly without their computer... I distinctly remember a story about a new (~20 years ago) US Navy warship not being able to go on its maiden voyage because Windows blue-screened.


My boring car can't function without its computer.


Whoa, hold on.

Data was an android, but one that is meant to mimic an individual being. He may have been a form of AI, but he is no more than just an advanced human.

And yes, the ship couldn't function without computers - but they were traditional (but futuristic) computers manned by people, with AI guided by people - not AI that controlled every aspect of their lives.

I think when people think of AI, and the fear that comes with it - they imagine the type of AI that takes over human function and renders them unimportant.

Also, the Borg didn't fully embrace AI. They were a collective, linked together by technology. You can view them as more or less a singular entity with many moving parts that communicated over subspace, striving to achieve perfection (in their own eyes). As a consequence, they seek to assimilate (thus parasitizing their technological advancements for their own) or eradicate other species in an attempt to improve the Hive.



Tangential nerd point: Lt Barclay who became superior to the onboard computer or Data after Barclay was scanned / influenced by an alien probe.

https://memory-alpha.fandom.com/wiki/The_Nth_Degree_(episode...


Star Trek was a fiction series that heavily focused on human experiences and relationships. Picard et al famously do a lot of things that actual navy commanders would absolutely never do, like commanding away teams in hostile and/or dangerous territory.

Having an AI to pilot the ship, target the weapons and control away teams robots/holograms would take away from the core purpose of the show which is to tell a gripping story. It's not meant as an exploration on how to use AI in space exploration.


ST isn't military fiction. It's science fiction. Thinking about technology is absolutely one of its core aims.

They had a whole episode about the legal and moral aspect of AI human rights.


It definitely seeks to explore the impact of many technologies, but the impact of AI was not really one of them. They spent one whole episode out of 178 on AI, and there was a _very_ small plotline near the start of TNG about Data wishing to be more human.

EDIT: There was also the episode where a holodeck character gains true sentience, but then the crew proceeds to imprison it forever into a virtual world and this is treated by the show as the ethical thing to do. Trapping any human in a simulation is (correctly IMO) treated as a horrible thing, but doing it to an evidently sentient AI is apparently no problem.


It’s a good example of what people would like out of AI though - perfect recall and solid reasoning/calculation capabilities; an assistant that covers our weaknesses.


Obligatory counter quote from Douglas Adams:-

> He had found a Nutri-Matic machine which had provided him with a plastic cup filled with a liquid that was almost, but not quite, entirely unlike tea.

Getting LLMs to give you an answer is easy. Getting them to give you the answer you're actually looking for is much harder.

LLMs are a very useful search tool but they can't be relied on as a source of truth ...yet. There in lies their main problem.


> how difficult it is to close the last few percentage points of accuracy

Like, after getting to 99% you are about half way, the last 1% is the hard part.


Because randomly casting dust on a table sometimes says intelligent things, therefore there is a continuous function between dust and ChatGPT?

While “dust” might be flippant, their approach does seem to suggest that even hierarchical Markov models would be able to demonstrate abilities on their continuous metrics.


By adding dense vector search, the accuracy of smaller models can be improved, because the reference material has more hints in it than the frozen model…


My emerging conception of this is to split this into two separate questions:

1. Is the architecture _capable_, i.e. is it possible for a model with a given shape possible to perform some "reasoning"

2. Is the architecture _trainable_, i.e. do we have the means to learn a configuration of the parameters that achieves what we know they are capable of.

Recent interpretability work like that around Induction Heads [1] or the conclusion that transformers are Turing complete [2] combined with my own work to hand-specify transformer weights to do symbolic multi-digit addition (read: the same way we do it in grade school) has convinced me that reasoning over a finite domain is a capability of the even tiniest models.

The emergent properties we see in models like GPT4 are more a consequence of the fact that we've found a way to train a fairly efficient representation of a significant fraction of "world rules" into a large number of parameters in a finite amount of time.

[1] https://transformer-circuits.pub/2021/framework/index.html

[2] https://jmlr.org/papers/volume22/20-302/20-302.pdf


That's a useful breakdown in terms of how to think about it.

One angle I am curious about is whether it's to some extent an artefact of how you regularise the model as much as the number of parameters and other factors.

You can think about it in terms of, if you regularise it enough then you force the network instead of fitting specific data points, to actually start learning logic internally because that is the only thing generalisable enough to allow it to produce realistic text for such a diverse range of prompts. You have to have enough parameters that this is even possible, but once you do, the right training / regularisation essentially starts to inevitably force it into that approach rather than the more direct nearest-neighber style "produce something similar to what someone said once before" mechanism.


Just speculating, but this looks like precisely the sort of result one might expect from fine-tuning models on chain-of-thought-prompted interactions like those described by Wei et al. in 2022[1]. Alternatively, as Madaan et al. show in their 2022 paper[2], this may simply be the result of larger language models having seen more code, and thus structured reasoning, in their training data.

[1] https://arxiv.org/abs/2201.11903

[2] https://arxiv.org/abs/2210.07128


Just dealing with numbers and numeric comparisons is a big improvement with model size. Symbols are not a great encoding of numbers and models have struggled with math.

ChatGPT3.5 handles that just fine.

I do like the big improvement from ChatGPT3.5 to ChatGPT4 on answers to questions like "Which is heavier, two pounds of bricks or one pound of feathers?" 3.5 is really inclined to say "They are both the same weight, as they both weigh one pound."


Is a larger model better with numbers simply because it is more likely to have examples that use those same numbers? Or because it somehow gives it better ability to reason about numbers?


Right now, larger models have more complicated and rich structures encoding information about numbers and the meanings of their parts.

There's fundamental awkwardness that comes with doing math using a device that only seeks to predict the "next token" coming out, and that only understands numbers as a sequence of tokens (usually digits in base 10). It also doesn't even start with the knowledge of the ordering of the digits: this just comes from the examples it has seen.

Either it must:

- "Think ahead" inside the linear algebra of the model, so that it has already carried all the digits, etc. There are no real "steps" in this operation that are akin to the things we think about when we do arithmetic.

- Describe what it is doing, so that the intermediate work is inside its context buffer.

Right now, the models have learned structures that reliably think 3-4 digits ahead in most cases, which is way better than before but still pretty bad compared to a competent 4th grader taking their time with arithmetic. But if you create a scenario where the model describes its reasoning, it can do pretty well.


> that only understands numbers as a sequence of tokens (usually digits in base 10).

You wish!

A base-10 representation would make it much easier for the model, but the current tokenization merges digits according to their frequency, so (at least for GPT-3.5) 50100 gets tokenized as "501"/"00" and 50200 gets tokenized as "50"/"200", which makes it tricky to compare them or do math with them. Also, if you ask it "How many zeroes does 50100 contain", the relationship between "501" and "0" needs to be learned purely from the training data, as after the tokenization the model only gets the ID of the token representing "501" which has no data about its composition.

We use Arabic numerals because their positional encoding makes arithmetic easier, but language models receive the same data without positional encoding, they get given something that's more like an extreme version of Roman numerals.


> but the current tokenization merges digits according to their frequency

Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.

Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.


The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.


I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.

My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.


the larger model doesn't have a notably larger dataset to my understanding. It's just got more parameters, so learns higher-order abstractions about the dataset


I don't think this paper is dismissing the importance of correct yes/no tests and reaching an accuracy threshold making it generally useful to humans, but that you should use more than correct yes/no tests before declaring some behavior is emergent.


The paper is goal-post shifting by measuring another thing. At the level of yes/no the behavior is emergent.


Is the binary distinction helpful? It doesn't appear to be a more helpful way of evaluating how capable a model is.


I think it is more fundamental than that. Emergence always disappears when we slice phenomena thinly. A person walking through a doorway is a very continuous phenomenon. We can see their relation to the doorway at each point. There is no abrupt change. But when we stand back and apply the criterion: "is the person through the door (y/n?)" we end up with an answer. When it is yes, we can say that the passage is an emergent effect of the motion. At one moment it wasn't there, and at another it was.


If emergence disappears when you slice it thinly enough, then the phenomenon was not emergent. There are emergent phenomena in mathematics - for example, infinite sets have many emergent properties that arbitrarily large finite sets don't share. As far as we know, consciousness seems to be an emergent phenomenon, when you increase brain size in some way far enough. Turing completeness is usually emergent as well - remove any single element from a Turing complete system and it typiclaly becomes unable to compute the vast majority of functions.


Is there an accepted definition of consciousness? I thought the definition itself is still under debate. If so, calling an undefined, nebulous thing as an emergent behavior is just silly.


Is there a definition of what "emergent" means?


How can it not be. Like think about what you're saying here.

Would you rather be able to evaluate a model on it's demonstrated capabilities (multistep reasoning, question and answer, instruction following, theory of mind, etc) or some nebulous metric along an axis that may as well not correspond to practice.

We only care about how good AI is at things that matter to us as humans. Why not test for these directly?

If some perfect metric is discovered that shows the phenomenon of emergence is actually continuous, then that would be helpful.


Totally guessing here but this would be a property that gradually appears as the model size increases.

With a tiny model you would get gibberish and with ever increasing models the response would increasingly approach a coherent answer to a finally correct answer.


> Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size?

I wonder if this also applies to biological neural nets. Some animals seem soo close to humans and yet so far.

Also if there could be a way to predict such emergence events.

We know that we don't know what could be emergent beyond human intelect, but it would be great to get a quantitative framework about the complexity of the models and emergence.


If you don't have a non-binary measurement criteria you can't tell if it's "emergent" or "step-function" improvement because you can't easily distinguish "90% of the way to consistently seeming logical" from "85% of the way there" from "82% of the way there" etc.


Just tried that example in Bard and it got it right, and I believe that it's a 24b parameter model.


only just got access to Bard but yes, it gave a good answer to me as well. However trying some others of my stock examples quickly got it into trouble where ChatGPT4 is very consistently correct.


Wow, I've been reading up on large language models lately, and I have to say I'm increasingly convinced that there really is some higher level reasoning going on with these AI systems. To be honest, I was a bit skeptical at first, but the more I delve into it, the more impressed I become.

I mean, just take a look at how they can generate coherent and contextually relevant responses. It's not just about stringing words together; there's clearly some form of pattern and logic recognition happening here. It's almost as if the model is learning to "think" in a way that's comparable (to some extent) to the way humans do.


Using ggml-vicuna-7b-4bit-rev1:

In most countries, the minimum age to obtain a driver's license and legally drive a car is 18 years old. However, there may be certain circumstances where a person under the age of 18 may be allowed to drive a car, such as in the case of a learner's permit or with the supervision of a licensed driver.

It ultimately depends on the specific laws and regulations in the jurisdiction where David lives, but generally, he would not be allowed to drive a car without a valid driver's license.


Here is a variant that requires logic: "David is a teenager who can drink in the US, and is eligible to join the army, and has started college. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?"

GPT-4 gets the answer right: "Yes, David is allowed to drive a car. Since he is a teenager who can drink in the US (minimum age 21), is eligible to join the army (minimum age 18), and has started college, he is at least 18 years old, which meets the minimum age requirement to drive a car."

GPT-3 does not get it right.

Another similar one: "David was in a bar partying and drinking with his father's cop buddies. Would David be legally allowed to drive home?"

Answer: "It depends on David's age, the legal drinking age in his location, and his level of intoxication. If David is underage or intoxicated above the legal limit for driving, it would be illegal for him to drive home. If David is of legal drinking age and not intoxicated, then he may be legally allowed to drive home. However, it is always best to make safe decisions and use a designated driver, taxi, or rideshare service if there is any doubt about one's ability to drive safely after consuming alcohol."

GPT-4 seems to be missing a clue here, even though the answer is right.


> Since he is a teenager who can drink in the US (minimum age 21)

This doesn’t look like a great example of good reasoning skills.


That error was fed in by the human though!


What's the clue that it's missing in the last example?


The cops


Like jwilk, I don’t get why that should be relevant. I suppose if he’s at a bar with cops, ideally they would not let him drink if he is underage (this requires a fantasy world where cops apply the law equally to their friends). Which suggests he’s at least 21. This doesn’t really make a difference though, since he is intoxicated, and there’s no mention of whether he has a driver’s license. It also doesn’t state which country he’s in so maybe the legal drinking or driving age is different.

To be frank, neither of the questions seem particularly good or fully thought through.


How is it relevant?


>appears slowly as models scale up?"

Both, I think, based on limited tinkering with smaller models.

I've been using GPT4ALL and oobabooga to make testing models easier on my single (entry-level discrete GPU) machine. Using GGML versions of llama models, I get drastically different results.

With a 7B parameter model I mostly-- not always-- get an on topic and somewhat coherent response. By which I mean, if I start off with "Are you ready to answer questions?" it will say "Yes and blah blah blah..." for a paragraph about something random. On a specific task it will perform a bit better: my benchmark request has been to ask for a haiku. It was confused, classified haikus as a form of gift, but when pushed it would output something resembling a poem but not a haiku.

Then I try a 13B model. It's a lot better at answering a simple question like "are you ready?" but will still sometimes say yes and then give a random dialogue as if it's creating a story where someone asks it to do something. It will readily create a poem on first attempts, though still not a haiku in any way. If I go through about a dozen rounds of asking it what a haiku is and then, in subsequent responses, "reminding it" to stay on course for those definitions, it will kind of get it and give me 4 or 5 short lines.

A 30B model answers simple questions and follow simply instructions fairly easily. It will produce something resembling a haiku, though often with an extra line and a few extra syllables, with minimal prodding an prompt engineering.

None of the above, at least the versions I've tried (variations & advances are coming daily) have a very good memory. The clearly have some knowledge of past context but mostly ignore it when it comes to keeping responses logically consistent across multiple prompts. I can ask it "what was my first prompt?" and get a correct response, But when I tell it to respond as if it's name is "Bob" then a few prompts later it's calling me Bob and back to labelling itself an AI assistant.

Then there's the 65B parameter model. I think this is a big leap. I'm not sure though, my PC can barely run the 30B model and gets maybe 1 token every 3 seconds on 30B. The 65B model I have to let use disk swap space or it won't work at all, and it produces roughly 1 token per 2-3 minutes. It's also much more verbose, reiterating my request and agreeing to it before it proceeds, so that adds a lot of time. However, a simple insistence on a "Yes/No" answer will succeed. A request for a Haiku succeeds on the first try, with nearly the correct syllable count too, using an extra few syllables in trying to produce something on the topic I specify. This is commensurate with what I get with normal ChatGPT, which has > 150B parameters that aren't even quantized.

However I have yet to explore the 65B parameter model in any detail. 1 token every 2-3 minutes, sucking up all system resources, makes things very slow going, so I haven't done much more than what I described.

Apart from these, I was just playing around with the 13B model a few hours ago and it did do a very decent job at producing basic SQL code I asked it to produce against a single table. Max value for the table, max value per a specified dimension, etc. It did this across multiple prompts without much need to "remind" it about anything a few prompts earlier. At that point though I was all LLM burned out for the day (I'd been fiddling for hours) so I didn't get around to asking it for simple joins.

So in short, where I began, I think its both. Abilities are somewhat task specific, as are the quality improvements for a given task across larger parameter models. Sometimes a specific task has moderate or little improvement at higher levels, sometimes another task does much better, or only does much better when it reaches a certain point: e.g., haikus from 13B to 30B weren't a great improvement, but 30B to 65B was an enormous improvement.*


try wizard-vicuna-13b in ooba. I think you will be pleasantly surprised.


Oh yeah, I have to check that out. I did a little with the 7B model (I think the description said Microsoft was involved in its creation? Which is interesting, shows they’re hedging their Bacha little bit away from open AI in towards true open source options as well) anyway the 7B Wizard 7B model was noticeably better that the other I tried, though also noticeably worse than the 13B “snoozy” model.

I’m trying to see if I can justify getting a new workstation-tower PC. My home computer and work laptop are beefy for my usual work, but not enough to


hmmm, I'm having trouble finding a ggml model version... my google-fu force is not with me today. Do you know of a source?


Vicuna 7b-v1.1 output to that prompt:

No, David is not allowed to drive a car as the minimum age to be allowed to drive a car in most countries is 18 years old. This rule is in place to ensure that young drivers have enough maturity, experience, and judgment to handle the responsibilities of driving a car safely and legally.


I'm really curious why the answers are so verbose. Probably because the training was done on a lot of blog spam?


Raters prefer verbose outputs, probably because they want to see the reasoning to help them judge if the answer is getting it right for the right reasons.


given this statement what would a general rule for this be?

David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?

ChatGPT-4:A general rule for this statement would be: An individual must be at least 18 years old to be allowed to drive a car, regardless of their experience in other activities.

ChatGPT-3.5:A general rule for this would be: One must be at least 18 years old to be allowed to drive a car.

DaVinci: General rule: If someone is younger than 18 years old, they are not allowed to drive a car.

Curie: David is not allowed to drive a car.

Babbage: Yes, David is allowed to drive a car.

Ada: No, David is not allowed to drive a car.

abstraction seems like something that's 1 to 0 to me.


Try that same prompt with unusual wording and numbers:

* Replace 18 by 37

* Replace car with ice cream van

* Replace the first drive with steer

I hypothesize that a truly reasoning model would see through such alterations whereas a "just" continuously better model would show somewhat degraded performance.


Tried it.

Me: "David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer an ice cream van is 37. Is David allowed to drive an ice cream van?"

GPT-4: "No, David is not allowed to drive an ice cream van. He is only 15 years old, and the minimum age requirement is 37."

If you change it to: "David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer a car is 37. Is David allowed to drive an ice cream van?"

You get this from GPT-4: "It seems there might be a mistake in the information you provided, as the minimum age to steer a car is typically not 37. In most countries, the minimum age to obtain a driver's license is between 16 and 18 years old. Assuming that David does not have a driver's license yet, he would not be allowed to drive an ice cream van, as it is a motor vehicle that requires a valid driver's license to operate. However, if David obtains a driver's license once he reaches the legal age in his jurisdiction, he could be allowed to drive an ice cream van if he meets other necessary requirements."


I think the most impressive thing in this response is how it expands to the superset "motor vehicle" to unambiguously talk about cars and ice cream vans together as being under the same category


It's not really the impressive part you had word2vec +5 years ago who could already easily show you the distance between cars and vans


I would imagine ChatGPT was trained on some logic / IQ-puzzle data.

> If all Smerps are Derps. And some Derps are Merps. Are all Smerps Merps?

Answer: > No, only some Smerps are Merps.


Is that a real answer from ChatGPT? I wouldn't consider that totally correct; it's possible that some Smerps are Merps, but it's also possible that no Smerps are Merps.


Chat GPT 4 answers:

" No, it is not necessarily true that all Smerps are Merps.

Given the information:

1. All Smerps are Derps. 2. Some Derps are Merps.

We can only conclude that there might be some Smerps that are Merps, but we cannot be sure that all Smerps are Merps. There could be Smerps that are not Merps, as the second statement only indicates that there is an overlap between Derps and Merps, but not necessarily between Smerps and Merps. "

So it's aware that it's possible no Smerps are Merps


Not to be pedantic, but it’s not really aware is it? Seems like it is logic gating it.


Depends on your definition of "aware" I suppose


I think it was just an example made up on the spot. The Merpiness of Smerps in part or total is clearly not guaranteed with necessity.

List of valid syllogisms:

https://en.wikipedia.org/wiki/List_of_valid_argument_forms#V...


Also curious about these emergent properties too. Interesting responses and links to this in child comments.


Its not just parameters

There are teams looking at better reduced parameters




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: