Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On a bit of a tangent and hypothetical, but what if we pool eough resources together to do a training that includes everything a human can experience? I am thinking of all the five senses and all the data that comes with it, e.g. books, movies, songs, recitals, landscape, the wind brushing against the "skin", the pain of getting burned, the smell of coffee in the morning, the itchiness of a mosquito's bite, etc.

It is not impossible I think, just require so much effort, talents, and funding that the last thing resembling such an endeavor was the Manhattan project. But if it succeeded, the impact could rival or even exceed what nuclear power had done.

Or am I deluded and there is some sort of fundamental limit or restriction on the transformer that would completely prevent this from the start?



But why would we do that even if we could? Making a very expensive machine act like a human is essentially useless, it is not like there is a shortage of humans on Earth. It wouldn't even be a great model of a human brain.

The reason we are doing all that is for its potential uses. Write letters, code, help customers, find information, etc... Even AGI is not about making artificial humans, it is about solving general problems (that's the "G").

And even if we could make artificial humans, there would be a philosophical problem. Since the idea is to make these AIs work for us, if we make these AI as human-like as possible, isn't it slavery? It is like making artificial meat but insist on making the meat-making machine conscious so that it can feel being slaughtered.


Because right now the major reason people still deny LLM as "intelligent" is because it has no connection or understanding to the things it is saying. You can make it say 1+1=2 but it inherently does not have a real concept of what is one thing and what are two things. Its neural network just perceived the weights to give the most statistically correct answer based on what it was modeled on, i.e. text.

So instead of training it that way, the network can potentially be trained to "perceive" or "model" the reality beyond the digital world. The only way we know or have enough experience and data to do so is through our own experience. An embodied AI is what I think is required for anything to actually grasp the real concepts, or at least as close as possible to them.

And without that inherent understanding, no matter how useful a model is, it will never be a "general" inteligence.


It makes sense to have an embodied AI, i.e. a robot. Self driving cars count.

But it doesn't have to be modeled after humans. The purpose of humans if we can call it that is to make more of itself, like all forms of life. That's not what we build robots for. We don't even give robots the physical abilities to do that. Giving them a human mind (assuming we could) would not be adequate. Wrong body, wrong purpose.


How would you model embodiment and embodied experience?


Sight, sound are quite obvious.

Taste and olfactory are matters of chemical compositions. It will take an incredible effort but something similar to a mass spectrometer can be used to detect every taste and smell we can think of and beyond. How fast and how efficient they can be is probably the main challenge.

Touch is difficult. We don't even know fully why or how does an itch "work". But force, temperature, atmospheric, humidity sensors, etc are widely available. They can provide a crude approximation, imo.

Just off the top of my head. I am sure smarter people can come up with much more suitable ways to "embody" a machine learning model.


I suggest, before throwing around words like "obvious" and "crude approximation", reading some Martin Heidegger, Hubert Dreyfus, or Joseph Weizenbaum. Half-baked attempts at mechanistic and reductionist implementations of embodiment are a dime a dozen.


I mean, sight and sound in large language models are obvious by now, in that rendering them into a token-based representation that an LLM can manipulate and learn from (however well it actually succeeds in picking up on the patterns - nothing about that is guaranteed, but the information is there theoretically) is currently a conceptually solved problem that will be gradually improved upon.

If reproducing the artifacts and failure modes of human modes of interpretation of this physical data (say, yanny/laurel, or optical illusions, or persistence of vision phenomena) is deemed important, that's another matter. If all that's required is a black-box understanding that is idiosyncratic to LLMs in particular, but where it's functionally good enough to be used as sight and hearing, then I don't see see why it can't be called "solved" for most intents and purposes in six months' time.

I guess it boils down to this: do you want "sight" to mean "machine" sight or "human" sight. The latter is a hard problem, but I'd prefer to let machines be machines. It's less work, and gives us a brand-new cognitive lens to analyse what we observe, a truly alien perspective that might prove useful.


This seems to give up on the GP comment's goal of "everything a human can experience" and create nothing more than a fancy Mechanical Turk.


If the goal is to build a human experience simulator that reacts in the same ways as a human would, then the you can't just collect the sensory data, you need to gather data on how humans react (or have the model learn unsupervised from recorded footage of humans exposed to these stimuli). Unless maybe it's good enough to learn associations from literature and poetry.

No matter how you build it, it is still experiencing everything a human can experience. There's just no guarantee it would react the same way to the same stimuli. It would react in its own idiosyncratic way that might both overlap and contrast with a human experience.

A more "human" experience simulator would paradoxically be more and less authentic at the same time - more authentic in showing a human-style reaction, but at the cost of erasure of model's own emergent ones.


Is there anything except sensory input that you assume part of the embodied experience? What would that be?

Apart from that, I'm afraid that at this point research on sensory input apart from audio and visual needs much more advancement. For example, it's not clear to me what kind of data structure would be a good fit for olfactory or sensory training data


As mentioned above, olfactory data can be just chemical fingerprints. Mass spectrometers already do this and provide very distinct signals for every chemical component.

Touch and such can have some approximation done through various sensors like temperature, force, humidity, electromagnetic, etc.


Sure you can punch in a chemical fingerprint for say the smell of a specific type of rose. Maybe it doesn't matter for the learning process that in an equivalent human experience it was preceeded by someone making you a compliment a couple minutes earlier, or that it was combined with all the other chemical fingerprints present at the moment, like maybe it just rained shortly before and there's a slight smell of wet earth in the air or someone smoking a cigarette walked by and there's minimal leftovers of that, or the window wasn't opened for a couple hours and everything has a slight tint of "used air" to it, which might add a factor of dampened learning, which might be necessary for the specific learning process to happen slow enough to sink in properly etc...

Don't get me wrong I would be curious to see such research done to see whether it would improve anything above the stochastic parrot level - it's just going to take a while to figure out what is even relevant


I think those factors you mentioned are important but ultimately additional context to the main data, i.e. "rose smell". They certainly can add additional meanings and alter how the main data is processed, but they are just "context" added on, just like how the word "lie" is very context dependent and by itself it is nigh impossible to know what it means. Is it a verb? noun? Which verb, lying to someone or lie on a couch?

But an LLM has no problem at all deciphering and processing and most importantly, responding meaningfully to all the ways we can use or encounter the word "lie". I contend that if a model large enough is trained on enough data, the concepts will automatically blend and explain each other sufficiently, or at least enough to cover your example and those similar to it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: