This is not my area of expertise, but if I understand the article correctly, they created a model that matches pre-existing audio clips to pre-existing images. But instead of returning the matching image, the LLM generates a distorted fake image which is vaguely similar to the real image.
So it doesn't really, as the title claims, turn recordings into images (it already has the images) and the distorted fake images it creates are only "accurate" in that they broadly slot into the right category in terms of urban/rural setting, amount of greenery and amount of sky shown.
It sounds like the matching is the useful part and the "generative" part is just a huge disadvantage. The paper doesn't seem to say if the LLM is any better than other types of models at the matching part.
I think you are misunderstanding. I don't think the network matches the audio to a ground truth image and generates an image. It just takes in audio and predicts an image. They just use the ground truth images for training the model and for evaluation purposes.
The generated images are only vaguely similar in detail to the originals, but the fact that they can estimate the macro structure from audio alone is surprising. I wonder if there's some kind of leakage between the training and test data, e.g. sampling frames from the same videos, because the idea you could get time of day right (dusk in a city) just from audio seems improbable.
EDIT: also minor correction, it's not an LLM it's a diffusion model.
EDIT2: my mistake, there is an LLM too!
I’ve heard clips of hot water being poured vs cold water, and if you heard the examples, you would probably guess right too.
Time of day seems almost easy. Are there animals noises? Those won’t sound the same all day. And traffic too. Even things like the sound of wind may generally be different in the morning vs night.
This is not to suggest the researchers are not leaking data, or that the examples were cherry picked, it seems probable they are doing one or the other. But it is to say, if you were trained on a particular intersection, and heard a sample from it, you could probably train a model to predict time of day reasonably well.
Weather patterns have a daily/ seasonal rhythm… the strength and direction of the wind will have some distribution that is different at different times of the day. Temperature and humidity as well, like the other poster said.
It certainly looks like some amount of image matching is going on. Can the model really hear the white/green sign to the left in the first example in figure 3? Can it hear the green sign to the right and red things to the left in the last example?
Yeah, I also saw that sign and thought - 'yeah, this is bullshit.'
It's got exactly the same placement in the frame - which would requires some next-level beamforming capability - and also has the same color, which is impossible. There's some serious data leakage going on here.
[edit] The bottom right image is even more suspect. There's a vertical green sign in the same place on the right side of the image, but also some curious red striping in the distance in both images. One could argue 'street signs are green' but the red striping seems pretty unique, and not something where one would just guess the right color.
That would be explained by data leakage too, e.g. sampling frames in the train and test data from the same video sequences. There's nothing in the writeup that says suggests the model is explicitly matching audio to ground truth images.
The researchers' suggestion that certain architectural features might have been encoded in the sound [which is at least superficially plausible] is rather undermined the data leakage in the model also leading to it generate the right colour signage in the right part of multiple images. The fidelity of the sound clearly isn't enough for the model to register key aspects of the sign's geometry like it only being a few feet from the observer, but it has somehow managed to pick up that it's green and x pixels from the left of the image...
I don't know if data leakage is the right word, but maybe overfitting if they took a 1 hour clip from same place and used 90 percent for training and 10 percent for eval/test?
It is still decent way to start I think, but it needs to get more varied data after that and use different geographical locations for eval and test.
In response to the correction: The paper says that "we propose a Soundscape-to-Image Diffusion model, a generative Artificial Intelligence (AI)
model supported by Large Language Models (LLMs)" so there's an LLM involved somewhere presumably?
The word "accurate" in that headline is doing a LOT of work.
Here's how the results were scored:
"Computer evaluations compared the relative proportions of greenery, building and sky between source and generated images, whereas human judges were asked to correctly match one of three generated images to an audio sample."
So this is very impressive and a cool piece of research, but unsurprisingly not recreating the space "accurately" if you assume that means anything more than "has the right amount of sky and buildings and greenery".
But it's not reporting one bit, it's generating a detailed (color!) photo. 100s of kilobits of made up garbage plus one accurate bit cannot reasonably be described as an "accurate" result.
That's not an unreasonable question, however the larger point is that this sort of thing cannot be done perfectly accurately.
This was established mathematically, answering an old 1966 question from famous mathematician Mark Kac: "You can't hear the shape of a drum" -- there isn't a unique answer even when allowed to use arbitrary test sounds.
I love this paper, but something I think is often missed when it comes up is that you CAN hear the shape of many drums if you restrict the shape space, for example with a prior of "what a drum should look like"
Zelditch proved spectral uniqueness for convex, fully connected, drums with some symmetry.
If you add multiple hearing points, you massively constrain the space of possible drums. The question then becomes something like "can you see the shape of a drum?"
I initially wondered the same, but I don't think either "accurate" or "precise" is the correct word. The correct word here is "plausible" or maybe "realistic", as in, "Researchers Use AI To Turn Sound Recordings Into Plausible Street Images".
Probably the thing you really want from an article with this topic focus is to be able to see the images bigger then a postage stamp size. And, even more irritating - the images are actually there, in reasonable size, just not linked...
Some of these are not very accurrate. That "country side" image has the entirely wrong foliage color (fall colors vs. spring colors). It also appears to place buildings when the "ground truth" image is by a small stream.
I would not rely on this tool for any meaningful data collection.
I share the prevailing skepticism about this specific project, but it does seem plausible that desiccated fall foliage might interact with sound differently than supple new growth.
I am not by any stretch a mathematician but AI research like this reminds me of things that excite mathematicians - it’s like people spent three hundred years playing with Prime numbers and all of a sudden “oh yeah, silicon, fibre optics, ahah secure encryption”
There are going to be real useful tools - but we need to play for another century before we have that aha moment. Probably :-)
I think this is cool but it’s a more of a statistical correlation than an AI-related paper.
What I’m saying is that if you were to replace ‘AI’ with “ask humans to draw an image based on these sounds,” you’ll probably get somewhat similar results.
This is more like a classifier. They have a bunch of human-classified image/sound pairs, and they match unclassified sounds to the classified sounds. Then there's a Midjourney image generation step, but that's probably unnecessary.
How do you determine if something is publish worthy? If someone puts in a lot of effort experimenting with something that fails, it can still seem publish worthy so others can learn what works and what doesn't. It should be more about level of effort I think. Otherwise the incentives become all wrong too.
In this case the algorithm can determine broad classes like "rural" or "city", and aside from those classes the generated images have little connection with the audio. I think most DL researchers would agree that this is low-effort stuff, and therefore not publish-worthy. In addition to this the word "accurate" in the title is misleading.
Except that it is not. If you want to use this for tasks like navigation then it is useless.
Practically, it is the same as using chatgpt to generate a realistic scenery based on a text description.
So it doesn't really, as the title claims, turn recordings into images (it already has the images) and the distorted fake images it creates are only "accurate" in that they broadly slot into the right category in terms of urban/rural setting, amount of greenery and amount of sky shown.
It sounds like the matching is the useful part and the "generative" part is just a huge disadvantage. The paper doesn't seem to say if the LLM is any better than other types of models at the matching part.