Remember when the PlayStation 2 was "technically a supercomputer" and taking LSD a certain number of times made you insane? Great moments in marketing history
Register readers with very long memories indeed will recall similar concerns being raised over Sir Clive Sinclair's ZX-81. The fear then was that the sneaky Sovs would try to buy heaps of ZX-81s for their Zilog Z80-A CPUs and might 1KB RAM to upgrade their nuclear missile guidance systems.
I would say that this take was correct, just not in the way the detractors at the time intended. The danger was to the usefulness of the internet.
I have yet to see any benefit to society from GPT's improvements, but I do see the internet quickly becoming more and more unusable due to the inundation of machine-generated spam on nearly every communications platform.
In the most "doomer" possible view of my beliefs, yes, that'd be an accurate description.
I don't know if that will actually come true, of course - human society is pretty resilient in the face of its own self-caused adversity, but currently it does already extract a significant mental and emotional cost dealing with filtering out the "human mimics" on many platforms - especially search engines.
And it's even more plausible this is just a marketing play to build hype. Take, for example, you just made some new super-pathogen in your basement lab. It could kill everyone on the planet. This is obviously pretty dangerous, so do you:
A) silently dispose of it and hope nobody else ever makes the mistake of creating it.
or
B) keep it in the freezer and hold a press release about how you made it but it's too dangerous to share any details.
> Our experiments
on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous
systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2
consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases
> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.
If you go back and look at older cities they almost all have the same pattern: walls and gates.
I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?
At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P
> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win
> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.
Unfortunately in the US there is a political party that is attacking education and doesn't want people to learn critical thinking skills at a time when critical thinking is sorely needed. They happen to really "love the poorly educated" for some reason.
classified tech is generally at least 10 years ahead of anything the public has access to. judging by how bizarre and polarizing the previous two US elections have been, i wouldn't be surprised if this prediction had already played out and we just didn't know it yet
From a big tech POV, a political scandal would be most PR-damaging right before a big US election. The time from right after to two years before or so, it wouldn't be as big a deal. Once you open Pandora's box, then a scandal in the next election cycle wouldn't be as big of a deal in terms of the company's responsibility/optics since the tech is already out there and it's a "new normal" at that point.
People saying “too dangerous to release” usually means one (or more) of three things:
1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!
2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.
3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.
weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.
social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.
I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?
Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.
Without looking at this, are you sure that this can do speech to speech? Maybe my flaw in searching has been disregarding anything that's called "text to speech" as not also "speech to speech"?
Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.
This was my thought as well, but someone pointed out to me that regional accent identification captures a large percentage of cadence and inflection differences (specific word choices and turns of phrase obviously would still not be there).
Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.
Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.
"Few" doesn't mean <60 it typically means ~5 or <10.
Getting high-quality audio for an arbitrary private citizen via public means isn't that easy, especially for folks like me that don't post video on public social media and use automated call screening and never say a word until the caller has been vetted.
> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."
These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.
I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.
it might help if the dumb dumbs at the bank would stop trying to make me say "my voice is my password". I've been careful to only say "no fcuk off you fcuking numpty who came up with this idea after voice cloning hit the mainstream".
I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.
What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.
There are legitimate uses of this tech, such as preserving voices of people losing them such as Stephen Hawking, or making it better for blind/low vision people to follow text and interact with devices. For that latter case having a more natural voice that is also accurate is a good thing.
I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.
The more recent voice models are a lot higher quality and emotive (without the jarring pitch transitions of things like Cepstral) so are better to listen to. However, the recent models can clip/skip text, have prolonged silence, have long/warped/garbled words, etc. that make them harder to use longer term.
You're right, of course. Unfortunately, however, we're all just actors in a giant, multi-player, iterated Prisoner's Dilemma here. If I decide not to pursue human-level automated speech generation, or I end up developing it and don't release because it's "too dangerous," someone else will just come in behind me and take all that market share I could have captured.
It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?
At least one good use is for video games where the text of some dialogue is determined when you run the game. For example in a game I work on player chat is local and voiced by tts configured by the player for their character.
On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.
On the other: this is fraught with ethical problems, not to mention an ideal tool for fraud. At worst, it could be used as a weapon for total asymmetrical warfare on concepts like media integrity and an ideal tool for character assassination; disinformation, propaganda, etc.
I would happy welcome a world where this stuff is nerfed across the board, where videogames and porn are just chock full of AI voice-acting artifacts. We'll adjust and accept that as just a part of the experience, as we have with low fidelity media of the past. But my more cynical side tells me that's not what people in power are concerned about.
This is what happens when you have an industry full of people "looking for challenging problems to solve" without an ethical foundation to warn them that just because you can build something doesn't mean you should.
The point is to spawn a new medium, you'll have to imagine harder how positive that could be as people with lots of ideas are not going to give them to you for free.
Perfecting the tech for wide-spread use has trade offs; need for caller id, ease of slandering until trust in voice uniqueness recalibrates, all of which is going to change soon anyway but giving only rich/bad actors the tech at first has its own set of trade offs. Head in the sand is the irresponsible way.