Hacker News new | past | comments | ask | show | jobs | submit login
Generating natural-sounding synthetic speech using brain activity (humanbioscience.org)
145 points by techben on Sept 2, 2019 | hide | past | favorite | 39 comments



Amazing. One thing I'm not clear on is this: did they have to re-establish the brain activity -> muscle movements model for each patient? Because presumably that wouldn't have worked for a paralysed patient. In that case, the question is how hard is it to generalise a brain activity model so that it can be trained on one population and then used to get data from a paralysed person's brain?


Yes, typically in brain computer interface tasks there is a need for retraining models given different individuals or even the same individual on different days/weeks/months. In another summary page there is the statement:

> The researchers also found that the neural code for vocal movements partially overlapped across participants, and that one research subject’s vocal tract simulation could be adapted to respond to the neural instructions recorded from another participant’s brain. Together, these findings suggest that individuals with speech loss due to neurological impairment may be able to learn to control a speech prosthesis modeled on the voice of someone with intact speech.

As per paralyzed individuals that is a primary target of this area of research. Things get considerably more complex in those cases however as they would need to start out with a pre-trained model which couldn't be naturally adapted by listening to their own speech. Additionally any brain damage which many have contributed to their condition can impair some of the signals in question. Overall the particular problem seems to be advancing from when I worked with a lock-in patient last, but there's still a good ways to go.


I wonder if this tech could one day be useful for general users without disabilities.

> Even when the researchers provided the algorithm with brain activity data recorded while one participant merely mouthed sentences without sound, the system was still able to produce intelligible synthetic versions of the mimed sentences in the speaker’s voice.

That already seems borderline usable for people with speech. I wonder if it could be made to work when you're merely thinking about the mouth movements without actually making them. That would be ideal.


It'd be great if this technology matures to the point where we can perform telepathic communication, à la Ghost in the Shell.


Beyond radio comms, I think a telepathic "Hey [Siri/Alexa/etc]..." could have some startling social implications.

It might even be the killer app to make elective brain implants mainstream. (Of course if this could be done without hazardous and expensive implants, all the better.)


When you just have to think about buying something, and Alexa orders it for you :O


That's something I've been thinking about a lot recently. Not so much accidental orders, but rather who these software assistants will be owned and controlled by in in a foreseeable future where this sort of tech creates tighter couplings between software assistants and our own minds.

If the software assistants are sufficiently useful and tightly coupled with the human mind, I think it quite likely that the line between self and software might get blurry for some users. The ability to think a question and hear a correct answer as a voice in your head is the sort of profoundly powerful user experience that I think might plausibly alter the assumptions people make about what it means to be themselves.

If these software assistants become apart of the users' own mind in their own perception of themselves, what responsibilities do the owners/operators of those systems have to their users?

I guess we'll cross that bridge when we get their, but the relative immaturity of FOSS software assistants is starting to unnerve me. In 2040 when Amazon starts selling "god in a box" to the general public, a two-way telepathic connection to a state of the art quasi-AGI living in the cloud, will there be a viable FOSS alternative?


It's already blurry; humans are already used to using chunks of the environment as parts of their state of mind. I have personal experience with this: as I've been dragged from my previous highly-customized Linux desktop into a more software-conformity-centric world, it's had pretty distinctly constraining effects on the way I'm able to use my mind, making various forms of easy external fluidity close to impossible. I've tried to fight against it, but the environment evolves so that people expect you to have the executive function of all their devices combined, and the only way that you're allowed to is by having either the same or similar devices or orders of magnitude more resources.

Toddlers are, today, growing up with iPads, YouTube, and Alexa/Siri as just a natural part of the world. As far as I've observed—admittedly not much—educators and parents are far behind, I would speculate both because grasping the indirection-of-agency that this sort of technology creates is a heavy abstract task of the kind that doesn't seem to filter through those parts of civilization well and because the technology has the ability to change too quickly to counter any attempts to pin it down. And pinning it down in too static a fashion could have its own horrible effects.

“FOSS” in the original sense is largely a distraction here (even though it is still an important idea), given that we've wound up in the “programming is specialized” world. The dynamic characteristics involve how agency flows through systems, and in the presence of highly distributed and often SaaSS (Service as a Software Substitute, as the FSF describes it) systems, being able to alter the source code isn't a solid defense even at “skilled programmer” speeds: going against a rushing current just means you get torn apart as soon as you touch the world. I think we need a new word for what you probably meant but which I don't know how to articulate well.


Now, think about children having these augmentations activated from birth. Their "self" would be essentially enmeshed with the external service. Scary.

Edit: strange I didn't think about it directly, but that is the Borg from Star Trek.


In a way, we're already there, no? We listen to what Yelp reviewers tell us. We stop trying as hard to memorize facts as we offload our cognitive functions to Google search.

I guess the tighter coupling than what it is now is what makes the idea repulsive.

Also, as far as FOSS alternative goes, I wouldn't count on it. It's not so much the code - in time it'd be the huge data-crunching that counts, something only big corporations are able to do.


Even if we just had reliable FOSS voice recognition without the rest, I think hackers could create powerful user experiences. But alas, even that seems to be asking too much. There are some FOSS efforts to implement state of the art solutions with lots of training data, Mozilla has been working on this from what I understand, but last I checked nothing was really ready yet and the stuff Mozilla is working on needs really beefy server hardware to run, which I think unfortunately disqualifies it as a viable competitor to the commercial offerings (which also use expensive hardware, but don't require end users to know anything about it.)


the very possibility of my thoughts being transmitted to any cloud provider in real time sounds very scary to me


Surely that won't be the only option, and you'll be able to run, say, Arch Linux on your implant. Just be sure to read the news before a full system upgrade, lest you end up in a coma and need a hard reboot with a live usb image.


At least you won't have to tell everyone you use Arch, as the implant will telepathically inform everyone near by automatically.


I think thought-based Googling will be the true killer app. Home assistant stuff would be really nice as well, though.


Well that's one of the aim of Neuralink

https://waitbutwhy.com/2017/04/neuralink.html


I applaud them for trying hard to make the operation as approachable as a lasik operation, but puncturing a hole through my skull, no matter how tiny the hole is, and sew some tiny threads to my white matter, is a no-no for me.

But then again, the thought of a laser cutting your cornea was probably incomprehensible too at its inception.


"Metaman: The Merging of Humans and Machines into a Global Superorganism" is a pretty interesting read in this area.


The title is intriguing. You got a summary of it?


At a very high level, through BMI and direct brain to brain communication it will be possible to solve incredibly complex problems and create an exponential acceleration in human evolution.


IMO, this is the tech that would make smartglasses REALLY start to make sense, even more-so than augmented reality.


I prefer my brain to be airgapped.


Video (buried in the article): https://www.youtube.com/watch?v=3pv0vT82Cys


at 2 minutes:

they can produce words from brains even when the subject does not speak them. interrogation tool?


No, this approach is not reading thoughts. The video states that they still mouth the sentences which will trigger similar muscle movements which appears to be the target of this method. Without the signals to move the muscles, no intelligible speech should be detectable using this approach.


I mean, we can already read subvocalizations.

What I mean is, when we think about moving our arms the same parts of our brains activate as if we are moving our arms. Maybe, when we think about talking a similar thing happens.


Here is the google Cached version. Seems the site is offline

http://webcache.googleusercontent.com/search?q=cache:https:/...

Copy the link above to view the research


Why aren't they replacing the mumbling with closest match words in the synthesizer?

Anyway, putting electrodes inside the brain is not for common public, is this at all possible without those intrusions?


Approaches like this are not possible without invasive methods. Placing electrodes in the brain or on the brain provides considerably higher signal fidelity.


That's unfortunate. I have RSI in both my hands and throat and something like this would really be a life changer.


Slap on a style transfer layer and it's a wrap.

Disclaimer: I know very little about how to actually do ML stuff, it just seemed like something that'd be possible in the near future.


I work in this space, so I'd love to give a bit of detail on the ML if you're interested:

Style transfer is trickier to do with speech than images! One significant issue is the lack of a good "content" versus "style" distinction. In images you can get great results by calling the higher-level features of an object classifier network "content" and holding that constant. Some people have tried this for audio with e.g. a phoneme classifier, but there are additional characteristics (such as inflection) that relate the emotional content of speech which wouldn't be held constant.

Another issue is that much of the speech classification work is done in spectrogram (or with further processing MFCC) space, which lets you treat audio similar to images and leverage a bunch of technology that we have for classifying those. But for synthesizing speech, spectrograms aren't a fantastic representation, because small errors in spectrogram space can translate into large errors in the waveform which are very clearly audible, and humans in general are pretty sensitive to audio errors. There are cool neural spectrogram inversion methods out there which can help, but those should still be trained to be robust to the kinds of errors that a style transfer algorithm would make, so it's still pretty tricky.

My company, Modulate, is building speech style transfer tech; and we've found a lot more success with adversarial methods on raw audio synthesis, where the adversary forces the generator to produce plausible speech from the target speaker!

One of the coolest parts of the kind of BMI research in this article, to me, is the potential to buy back some latency margin for speech conversion! If you're working on already-produced speech, there are super tight latency requirements if you want to hear your own speech in the converted voice - over 20-30ms for the entire audio loop, and you start to get echo-like feedback that makes speaking difficult. Even without looping back, you don't want more than 100-200ms of latency in a conversation before it starts impeding the flow of dialogue. This means your style transfer algorithm gets almost no future context, and limits the kinds of manipulations that you can do (not to mention the size of the network that you can do them with, depending on available compute power!).


For the time being the work is still in the early stages of the computational neurosci research. Once things get to the point of commercial/medical viability then I'd imagine either the direct synthesis model will improve sufficiently or something akin to parrotron's voice normalization may be of use https://google.github.io/tacotron/publications/parrotron/ind... .

For actual patients using the system however I'd expect it would be beneficial to keep the output as low latency and as unmodified as possible as this will help the individual learn to control the system as nerual activity shifts over the course of time.


This is super cool! Does anyone have any insight into whether this sort of muscle-movement based vocalization has been done in the past?


The cognitive systems lab at the university in bremen[1] ,germany does a lot of research in that field and I had the pleasure to visit them a few months ago. If you are interested you should find a lot of research on their homepage, ranging from regular speech to text, over silent speech (aka muscle movement to text) up to brain to text.

[1] https://www.uni-bremen.de/en/csl/

edit: small correction of my bad english :P


there's a rich literature in birdsong. also projects at wustl and i believe mgh in human vocalizations.


The proofs of concept accumulating in this space are so exciting. It's hard not to jump straight into runaway speculation.


i'm curious, what's the effective bit rate? speech and language have excellent priors and fancy speech synthesis has been a thing for a while.

either way cheers for getting something to work well!


so they use the brain to bypass the brain that isn't working properly. that makes sense.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: