Hacker Newsnew | past | comments | ask | show | jobs | submit | nshm's commentslogin

It is useless actually. Very slow and quality is suboptimal and it is just speech generation component. See discussion here:

https://github.com/SesameAILabs/csm/issues/80


No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.


GPTs also get gradients from all tokens, BERT only on 15% masked tokens. GPTs are more effective.


Call it a CLM vs MLM, not LLM vs MLM. Soon LMLM's will exist, which will be LLMs too...


It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.

Once knowledge is distilled you can build on top of it easily by merging concepts for example.

So no secret here.


Do they distill knowledge or distill the relationship between words (that describe knowledge)

I know it seems dancing on head of pin but …


Well the internal representation is tokens not words so.. the pin is even smaller?

They distill relationships between tokens. Multiple tokens together make up a word, and multiple words together make up a label for something we recognize as a "concept".

These "concepts" are not just a label though - they are an area in the latent space inside the neural network which happens to contains those words in the sequence (along with other labels that mean similar things).

A simple demonstration of this is how easily multi-modal neural networks build cross modal representations of the same thing, so "cats" end up in the same place in both image and word form but also more complex concepts ("a beautiful country fields with a foreboding thunderstorm forming") will also align well between the words and the images.


> Do they distill knowledge or distill the relationship between words (that describe knowledge)

Do we know that there's a difference between the two? Maybe this distinction is just a god of the gaps.


There is also a glitch in "dialogue"


Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.


Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.


xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.


Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.


Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..


Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.


Especially as any finished product may end up being compressed again. Lossy to lossy audio transcodes ALWAYS cause additional audio data to be lost.


I had forgotten about StyleTTS2, and it was discussed here on HN a couple of months ago. Maybe that's what made me feel that there's something going on.


I've tested both. StyleTTS2 is impressive, especially its speed, but the prosody is lacking, compared to Metavoice.


Is it possible to run Metavoice and other pytorch systems on Apple silicon EG the M1? I keep getting issues.


Good improvements for many languages, numbers here

https://github.com/openai/whisper/blob/main/language-breakdo...


From the WER numbers alone it looks like a very small difference for English itself, but I've found WER to be a misleading assessment mechanism.

Having extensively tested Whisper v2 large against other 'lower WER' models and found them wanting (because of differences in their methodology for generating output), I'm super curious to get a feel for how v3 holistically behaves.

Will probably test it right now. :)


I don't understand how a pop. 10M country - Czech Republic is among the best.

And I can confirm - my app Whisper Memos (https://whispermemos.com) is very popular in Czech Republic.

It makes perfect sense. Whisper is almost as good as transcribing Czech as English!


Czech pronunciation is extremely regular and straightforward (sounds close to Latin or even Italian) with no weird "which vowel was that" or "half the word is silent" features and just a few exceptions. Usually if you write a letter, you pronounce the sound, and if you hear a sound, you write the letter.

A great example is that — for most words from any language that uses a subset of the Czech alphabet — a Czech speaker can just pronounce the word instead of spelling it and another Czech speaker will be able to write it down.

e.g. "messerschmitt", "nešamas", "cadeira", "philosophy", "tastaturi", "nicchia", "kaupunki", "abordagem", "povjerilac", "primauté" are all foreign words with very unambiguous pronunciation in Czech.


I don't know Czech, but Italian is extremely consistent in the way it's written, so it's at the top of the list with about one or two orders of magnitude less data.


Czech pronunciation is actually very close to Italian (and both close to Latin). We don't do the "ce" and "ci" and "gn" things (we do a "di, ti, ni" thing instead), and we use diacritics to soften certain sounds (ž,š,č,ď,ň), but even ignoring all that and plowing right through, an Italian speaker pronouncing Czech text should be easily intelligible and even spot on for some words.


I'm more impressed about Korean! I didn't even realize it was that good in V2. But I've just seen a lot of systems perform really poorly (judged by my Korean gf not me) and Korea is only a country of 52M (between Spain and Italy).

A funny note, if Siri is set in Korean mode and reads your texts that come in as English, they sound like a racist imitation of a Korean accent. It is absolutely hilarious.


I also find funny how Portuguese is also better than English (Brazilian talking here). I guess is probably the nature of the languages or so, phonetics...

it does works amazing in PT-BR Whisper V2, I can't even imagine it being better, and turns out, V3 promises it to be better...


Wow a fellow slovak indie developer, kinda rare to see.


It looks like it's basically whisper-2 with extra training against datasets for specific languages that brought incidental improvements to the rest. Support for some of the languages is still really bad (from real-world experience).


Curious as to how dutch has the lowest error rate


They enunciate.


Ok, first we screwed buffers by making them globally tracked instead of just a piece of memory. Now its time to break all binary modules again.


Ok, but the photos look very suspicious. 1400 year gold right from the ground shouldn't shine like that. Compare to the coins here for example

https://www.smithsonianmag.com/smart-news/ancient-welsh-gold...


Gold is a noble metal - pure gold doesn't tarnish like that in your photos.

"Gold staters" vary in gold content a great deal:

    The Durotriges issued a series of rapidly debased coins through this period probably starting around 50BC with a largely silver (80%) stater (British B) with a fairly small percentage of gold.
and

    Verica's stater series weighed between 5.27g and 5.29g while the gold content varied between 42% and 44.5% The gold content appears to have remained stable over time with no sign of debasement.
https://en.wikipedia.org/wiki/Celtic_currency_of_Britain

Pure gold nuggets unearthed after many thousands of years under ground "look like gold" when given a good rinse to get the dirt off - they don't look tarnished like the "gold staters" in you photo.


I believe the non-tarnishing lroperties of gold is why it has achieved it’s status in the sorld.


I have seen “guldgubber” come out of the ground and they do look like this for real.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: