I suggest,
before throwing around words like "obvious" and "crude approximation",
reading some Martin Heidegger, Hubert Dreyfus, or Joseph Weizenbaum.
Half-baked attempts at mechanistic and reductionist implementations of embodiment are a dime a dozen.
I mean, sight and sound in large language models are obvious by now, in that rendering them into a token-based representation that an LLM can manipulate and learn from (however well it actually succeeds in picking up on the patterns - nothing about that is guaranteed, but the information is there theoretically) is currently a conceptually solved problem that will be gradually improved upon.
If reproducing the artifacts and failure modes of human modes of interpretation of this physical data (say, yanny/laurel, or optical illusions, or persistence of vision phenomena) is deemed important, that's another matter. If all that's required is a black-box understanding that is idiosyncratic to LLMs in particular, but where it's functionally good enough to be used as sight and hearing, then I don't see see why it can't be called "solved" for most intents and purposes in six months' time.
I guess it boils down to this: do you want "sight" to mean "machine" sight or "human" sight. The latter is a hard problem, but I'd prefer to let machines be machines. It's less work, and gives us a brand-new cognitive lens to analyse what we observe, a truly alien perspective that might prove useful.
If the goal is to build a human experience simulator that reacts in the same ways as a human would, then the you can't just collect the sensory data, you need to gather data on how humans react (or have the model learn unsupervised from recorded footage of humans exposed to these stimuli). Unless maybe it's good enough to learn associations from literature and poetry.
No matter how you build it, it is still experiencing everything a human can experience. There's just no guarantee it would react the same way to the same stimuli. It would react in its own idiosyncratic way that might both overlap and contrast with a human experience.
A more "human" experience simulator would paradoxically be more and less authentic at the same time - more authentic in showing a human-style reaction, but at the cost of erasure of model's own emergent ones.