Hacker News new | past | comments | ask | show | jobs | submit login
Future of DeepSpeech / STT after recent changes at Mozilla (discourse.mozilla.org)
147 points by trowngon on Aug 22, 2020 | hide | past | favorite | 74 comments



Maybe we should try to find a list of exactly what they are focussing on going forward instead of the slow drip of things they’re cutting back on (servo, MDN, DeepSpeech...)

It’s a sad sad day when you have an organisation getting hundreds of millions in funding and turning away from what’s its good at. The decline has begun in my eyes, it may not become apparent for a few years yet.


Cutting out DeepSpeech seems sensible to me, it’s out of place in the general portfolio of products.

It would be nice if Mozilla could tell us what their focus is going to be, but I doubt that Mozilla management know at this point.

At this point I’m somewhat concerned that Firefox will be irrelevant in fives years, and I don’t currently feel that Mozilla is communicating clearly that they still care about Firefox. I assume they must, but it would be comforting to know that Firefox is still at the core of Mozillas strategy.


> Cutting out DeepSpeech seems sensible to me, it’s out of place in the general portfolio of products.

I disagree precisely because of the point you make later: "I’m somewhat concerned that Firefox will be irrelevant in fives years".

Functionality provided by deep learning is going to be an important component of many types of software interactions going forward. The logistics of this will be quite different from what we are used to in open source, with the need to fund and coordinate compute, collect and handle data being a more vital aspect compared to the past.

There are STT software, some mentioned in this thread, that match or are even better than DeepSpeech but none of them are as ergonomic. Accounting for the value of time, this means it will be more cost effective to outsource such capabilities to the cloud. Which comes with trade-offs that are difficult to appreciate in the short term: https://news.ycombinator.com/item?id=24236489

I'd say DeepSpeech fits in the mold of Mozilla as a company providing solutions to complicated software problems that are better at respecting the user and their privacy.

In the old days, the most accurate TTS and STT models were built into the OS. These days, you need to call into the cloud to get the best stuff. In [1], Internet Archive complains about the quality of their OCR software. It's not that OCR is so bad, it's that the best OCR is found on Google's and Microsoft's computers. It's possible to cobble something together using open source solutions like EasyOCR, Tesseract+OpenCV but that will only get you part of the way there. What makes the cloud offerings so good is they have enough resources to devote to pre-processing pipelines and architecture tweaks and settings better able to handle edge cases. Most of the mass resides in edge cases.

From my vantage, the future looks to be one of software as thin layers built atop APIs which call into programs running on the servers of a handful of companies. You might not think this a big deal but these software will be the ones scanning the environment, writing the emails, completing the thoughts and planning the calendars for the majority of humans.

[1] https://blog.archive.org/2020/08/21/can-you-help-us-make-the...


Based on the testing I just did with Vosk, Mozilla DeepSpeech, Google Speech to Text and Microsoft Azure, I disagree with your arugment that SaaS has the best quality results.

Mozilla DeepSpeech was definitely trailing the bleeding edge, but Vosk using the vosk-model-en-us-daanzu-20200328 model produces very accurate results even on uncommon words, similar in performance to Google & Microsoft (which has generally better formatting than Google's STT)

Try it yourself:

Google: https://cloud.google.com/speech-to-text/ See "Put Speech-to-Text into action" header

Microsoft: https://azure.microsoft.com/en-us/services/cognitive-service... See "Upload File"

Vosk: https://alphacephei.com/vosk/

Had Mozilla provided 4x to 8x more GPU resources and more staff, then their STT would likely be competitive. Other small STT developers can iterate and test much faster due to having more hardware at their disposal.


Even Google is trying to offload as much of these computations to on-device chips as possible nowadays though.

Their new Pixel has voice control entirely backed by on-device models for example.

I think SaaS is a stopgap for good ML, and that eventually enough of this will be open source, that basic tasks such as vision and speech will be cheap to solve for any company with high tech competency.


Is now a good time for someone to write the "Unbundling Mozilla" start-up post on substack? I'd love to see something cogent written up about it. Something like this[0]?

[0] - https://latecheckout.substack.com/p/the-guide-to-unbundling-...

EDIT: Add link


I'm not sure about Mozilla's efforts in STT, but they were lagging pretty far in TTS. [1]

Google/Baidu, universities, and an assortment of Chinese/Japanese/Korean social media companies (Line, etc.) are posting the most compelling TTS research, models, and code. Mozilla's TTS system [2] is an amalgam of some of these models, but it lags pretty far behind state of the art.

Mozilla should focus on getting additional revenue streams. We can help them out by trying to get Congress / DOJ to strip Google of its ability to have and maintain a browser with which they entrench their search and advertising moat. I think they're clearly in antitrust/anticompetitive territory.

[1] I'm pretty familiar with this field as I wrote https://vo.codes and https://trumped.com TTS systems. Neither of those are state of the art in terms of mean opinion score (MOS), but they're incredibly efficient.

[2] https://github.com/mozilla/TTS


It is explainable given that there was a single developer working on TTS. It is hard to compete with big academic teams/industry players this way.

I also believe Mozilla team was restricted by a lack of computing resources. They had just a single 8GPU server or so.


Said 8 GPU server was consistently in use for Mozilla DeepSpeech (now renamed Mozilla STT) in training models. Its impressive how far Mozilla got considering how limited their resources were.


This is an area that I find unbelievably frustrating. A lack of computing resources in the current day is kind of insane. You can buy an 8GB GPU for <$1000. Even with the rest of the costs, the cost of hardware like this is a drop in the bucket when your main office is housed in Mountain View! Especially on a project that ends up being public-facing, these are missed opportunities where a little can go a long way.


I take your point but according to the release details on the repo it was not 8Gb on one card but a server with 8 cards, each a Quadro RTX 6000 with 24Gb, and they're around £4k each currently, so the cost of the GPUs alone is £32k

https://github.com/mozilla/STT/releases/tag/v0.8.2


Ah, I see-- not an 8GB, 1-GPU server, but an 8-GPU server. That does make a bit of a difference, changing the cost from a new workstation to functionally a piece of capital equipment. Still, I'm not sure that my point about equipment costs falls short--even at (call it) $40K, you're probably talking less than 3 months of the company's all-in cost for the developer themself, amortized over multiple years.


We need a SETI@home approach to open source AI models.

Only then we can break our dependency on Google and Facebook - and Mozilla for that matter.



Chromium is open source and you can apply policies to do the things you mention. Based on your logic Mozilla should also be forced to get rid of Firefox Sync.


Chrome is shoved down grandma's throat. She probably doesn't know much other than it's the "Google Internet thing". It's the default on Android and Google.com nags you to install it.

This is worrying given that Google cripples the browser and web standards to favor its own search engine and advertising platform.

Killed the semantic web and semantic markup? Check.

Disabled APIs for blocking ads? Check.

Use Google.com as the default search? Yep.

Embrace and extend the web with AMP and instant apps? Bingo.

Auto log into your Google session or nag until users permit it? Absolutely.

Trying to destroy the notion of a URL? I thought those were cool.

Google is destroying the web and is about as anti-competitive as they come.


> Killed the semantic web and semantic markup? Check.

Based on what evidence?

> Disabled APIs for blocking ads? Check

They didn't. uBlock Origin and adblocker extensions never stopped working.

> Use Google.com as the default search? Yep.

What do you think Edge does here? Easily changed via policies.

> Auto log into your Google session or nag until users permit it? Absolutely

Doesn't nag you and easily disabled in settings or via policy.

> Trying to destroy the notion of a URL? I thought those were cool.

I only get a little frustrated on Android, but just have to remember to hit the edit icon if I want to change it.


> > Disabled APIs for blocking ads? Check

> They didn't. uBlock Origin and adblocker extensions never stopped working.

That was probably this issue in the chromium tracker https://bugs.chromium.org/p/chromium/issues/detail?id=896897...

I don't know what happened after that though; the conclusion of that issue (in Jan 2019) was "these changes are draft, and still being discussed".


I see a lot of what appears to be over reaction... doesn’t sound like deepspeech is ending in the first part of the announcement

“ Most of the technical changes were already landed, and we see no reason not to ship it. We’ll be releasing 1.0 soon and encourage everyone to update their applications”

So looks like at least 1.0 is near and still gonna happen... I know these seem like dark times for Mozilla but I believe they will survive. As I recall the decline of Netscape was a pretty dark time and out of that came Phoenix - er Firefox and here we are today... I’m sure Mozilla and many of the great projects will survive


I don’t know what is going to save Mozilla, really I don’t. I just wish there was a way to “reach” them and discuss how we the internet community could come to an agreement about what they could do to derive value we would pay for.

It’s not for a lack of trying on their part for sure, but it feels like just using their browser isn’t all there is to it any more


what they could do to derive value we would pay for

For someone that found Linux in the 90's and watched the birth of Mozilla from the ashes of Netscape, that's a very strange thing to read.

This site is not Slashdot, I know. It always had another kind of relation to business and money. But still...

I have no idea why Mozilla should need a business model. Much less I understand why should we think of one and agree on it.

How much money does it take to maintain a web browser? If it's a lot, maybe, just maybe, we should agree on a reduced feature set and refuse to use something more complex. Some people here talk about text mode browsers. I'm not so radical. Just keep it simple enough to be maintanable by a dozen of volunteers.


Why? Should we apply the same logic to Linux? Why should we arbitrarily restrict user value because something costs money?

Isn’t the main problem that users are not willing to pay for the browser they use?

Google Chrome is probably maintained by much more than 12 people, so if we restrict Firefox to that, everyone is just going to move to Chrome anyways.


> I have no idea why Mozilla should need a business model.

Because developers aren't free and "let's get money from Google searches" is great until Google decides not to fund a competitor any more.


Building B2B services around rust ie. onsite training, consulting, development to me seems better than firing people - what am I missing here?


Almost all company-sponsored programming languages are run as loss leaders to enable selling some other profitable product of the company. What is the profitable product that Rust enables?


Well an IDE would’ve been one option, as well as backend services for enterprise who are migrating to Rust. Otherwise as I mentioned the product is services like outsourced development, consultancy and training resources?


> What is the profitable product that Rust enables?

Surely that's Firefox?


Nobody is building anything based on Firefox. It's not like Rails or .NET that gives your application a head start.


People used to build a lot of software around Gecko, there are still some notable users like Komodo IDE, but Firefox is a lot harder to embed than it once was. Servo from the Rust team was supposed to solve this by providing a new embeddable browser core, not sure if that is still the long term plan


Firefox apparently is not longer a focus because it is hard to monetize outside of the search box, see earlier letter. I would definitely not take Firefox' future for granted at this point.


Firefox is the only thing Mozilla has ever been able to make any money with; anything else has gotten them a pittance at best.

Giving up on that because it's 'too hard', without first proving they have an alternative? That would be insanely foolish. They may as well close up shop now if that's their plan.


Fully agreed. It's a real problem.


Has "the internet community" ever "come to an agreement" on literally anything?


Net neutrality?


What would you personally pay Mozilla for?


Firefox, Rust, and privacy.

It'd be really awesome if they could develop a search engine or phone (I know they tried) that had an open standards / web-compatible development kit.

I want an anti-Google / anti-Apple. Something we own and can extend. Something that doesn't sell our data.

I'd also like to see Mozilla doing lobbying. Partnering with the EFF. We've strayed so far from the bright and open Internet of the 90's and 00's. It's depressing to think about how locked up and proprietary it's all become.

I'll buy Mozilla / Firefox merch. I'll pay a subscription.

edit: Talk to Shuttleworth. Fold Ubuntu in. I'll buy a Mozilla phone and a Mozilla laptop.


I feel bad for doing a "me too" comment, but you've nailed exactly my thoughts on the subject. I feel like Mozilla hasn't really tried something like this. Every time it gets suggested, it quickly gets shot down (by other internet commenters) as "can't be done" and "wouldn't generate nearly enough money".

Well... maybe not with that CEO salary.


Mozilla can model itself after Microsoft somewhat.

Provide a development stack (they're experts at Web and Rust). Make themselves the go-to shop for developers in that realm.

Sell them on an OS and editor with support. Partner with Ubuntu. Hell, I would even reach out to Nadella and see if they'd be willing to work with Mozilla on hedging against Google. Mac is becoming locked down and kind of unpleasant to develop on/for. Mozilla could win this.

Block all the advertising and tracking. Build a Spotify-like news aggregation service you can access from your Mozilla subscription.

Build an email service like Hey and a file backup service like Dropbox. It's too bad Zoom bought Keybase, but perhaps Chris Coyne wants a new gig?

We should team up to beat FAAMG. Most of the FAAMG actors are actually quite damaging to open source despite benefiting from it greatly.


This all sounds to me like capital intensive businesses against entrenched players where even the not so average consumer would likely not do more than pay lip service to it unless there was some secret sauce to this that was more compelling to the options

They neeed a good out of the park product in those markets to make any real headway. Too idealistic.

My only thought on this is that they should pivot to be like algolia , focus on Firefox being a reference implementation browser and seek their expertise to the other vendors, maybe. It’s one of the few verticals I can think of that would work strategically Without them having to pivot into things they have no experience with


Do they? I mean, most of these vendors are already competing, and unlike Firefox, they're not necessarily competing for the average Joe, but technical users who often have different priorities.

Those are also services that groups are used to paying for already, which means if they could eat the start-up costs, even at a reduced scale, they could make a profit at even a slight premium for things that they already do very well, and go from there.


I'm already personally paying Mozilla $8/mo for their VPN and private browser extension.

If they offered something like the services offered by mailbox.org, or Librem One? I'd switch my GMail account tomorrow, including the storage fees I'm paying on it, and would do it at triple the cost for not abusing my data. Hell, they already have the domain experience with their proximity to Thunderbird devs.


MDN, Firefox (voice search would be nice), and anything that is a replacement for Google products.


Does anyone know of other open-source projects in the speech-to-text space? DeepSpeech was one of the most promising projects, especially the latest versions...


Try https://github.com/alphacep/vosk-api. It supports 10 languages, works on Android and RPi and also has big and more accurate server models.

Other good ones are https://github.com/daanzu/kaldi-active-grammar and https://talonvoice.com/

There are toolkits for research like https://github.com/kaldi-asr/kaldi, https://github.com/espnet/espnet, wav2letter, Espresso, Nvidia/Nemo, https://github.com/didi/athena. You can try them too if you want to go deep. Some of them have interesting capabilities.


Comparing DeepSpeech v0.7.4 to Vosk using plain spoken English samples from male and female speakers, they seem to be performing the same if I use vosk-model-small-en-us-0.3 and the full size DeepSpeech model.

When I use vosk-model-en-us-daanzu-20200328 the result is perfect on many of these tests, though it does not do punctuation or capitalization outside apostrophes. IIRC there is another project on Github that can add basic formatting though.

I am quite surprised with vosk's performance, it even handles odd words like Puget Sound well! Need to test our more accented audio on it, but this is quite exciting.


There are a lot of open source projects in this space. DeepSpeech is actually one of the outsiders (they are not represented well in the academic community), and also not quite competitive to other software (at least last time I checked).

E.g. some very active projects are:

* Kaldi (https://github.com/kaldi-asr/kaldi/) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).

* ESPnet (https://github.com/espnet/espnet), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.

* Espresso (https://github.com/freewym/espresso).

* Google Lingvo (https://github.com/tensorflow/lingvo). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).

* NVIDIA OpenSeq2Seq (https://github.com/NVIDIA/OpenSeq2Seq).

* Facebook Fairseq (https://github.com/pytorch/fairseq). Attention-based encoder-decoder models mostly.

* Facebook wav2letter (https://github.com/facebookresearch/wav2letter). ASG model/training.

* (RETURNN (https://github.com/rwth-i6/returnn) and RASR (https://github.com/rwth-i6/rasr), our own, although this is currently free for academic use only. It is used in production as well. Supports hybrid NN-HMM, CTC, end-to-end attention-based encoder-decoder, transducer, etc.)

And there are much more.

You will also find lots of ready-to-use trained models.


You seem to know a lot about the topic, any idea about the current state of text-to-speech? Haven't seen any opensource projects that would make, for example, an ebook enjoyable.


Recent more or less reasonable one is https://github.com/TensorSpeech/TensorFlowTTS, it implements all the latest algorithms. For simple business books it will be ok, for emotional fiction probably not there yet.


Extant TTS is already there for fiction, if you approach it with the right expectations (more an alternative to visual reading than dramatically read audio books.) I've 'read' numerous fiction books using MacOS's TTS ('Alex') and with my kindle (3rd gen 'keyboard' model from 2010.)

These extant solutions require an effort-investment from the user to work up to fast speeds, but once the user becomes acclimatized they work great. The neuroplasticity of the human brain seems to do a great job of smoothing out the wrinkles.


I agree - I've been using google's TTS api for audiobooks and it's great. I switch off between professional audio books (overdrive is amazing and free by public libraries) and TTS and, while professionals can add something, you get used to TTS pretty fast. Google's TTS gives 1 million free characters a month, which is pretty generous for a single person and it sounds pretty good. I read books with pretty weird character names (like the Wandering Inn web serial) and it never explodes. Sometimes it spells out character names but even for very non-standard names, it does fine.

I've experimented with some of tacotron TTS/espnet to do the TTS on my computer and they work alright. Sometimes you get weird edge cases and it makes some pretty weird sounds (and even if your laptop doesn't have a GPU, google co-lab works well for quick audiobook generation). I don't hit the million characters that often so it hasn't been a big deal but I'll probably move to home-made just because I like tweaking it.

The way I think about it is that the written word doesn't have much intonation anyway so as long as the audiobook doesn't offend me, it's a pretty good solution (and helps prevent eye strain after working on a computer all day)


Can you run audio files through any of these or do they only support audio from microphones?


At the point of them taking in input to process, audio that comes from a microphone or comes from a file is basically just a series of numbers and is the same. So there's no barrier in terms of feasibility.

Whether they're all set up to do that "off the shelf" is a different matter but it should be fairly straightforward to add this to any that lack it and because they're open-source anyone could do a bit of Googling etc and find suitable code to adapt to do it. I know DeepSpeech definitely can take audio from files directly as input as I've used it that way before, and I strongly expect many (or possibly all) of the others could too.


DeepSpeech and Vosk can accept audio files, although each wants them formatted in a slightly different mono WAV format.

See my other comment for a comparison of the two: https://news.ycombinator.com/item?id=24248238


deepspeech.pytorch is a good one. Since Mozilla's DeepSpeech project is still using tensorflow 1.x, I think pytorch implementation is actually better. https://github.com/SeanNaren/deepspeech.pytorch


Between this and Servo I guess Mozilla is just giving up on relevance. That really sucks.


That’s what got them in this mess in the first place, fifty pie in the sky projects to be relevant instead of focusing on Firefox or just saving revenue aggressively.


I work with Mozilla's DeepSpeech every day. Mozilla's STT is critical to the survival of important indigenous languages throughout the world.

I sincerely hope we can help make this project continue and that Mozilla can help us do that.

Ensuring indigenous languages have digital representation is essential to their survival. Speech recognition and synthesis are a vital part of that. Indigenous communities are often ignored by Big Tech because they bring little financial value to their bottom lines, but financial bottom lines are not everything. Culture is more important. Open source tools like DeepSpeech allow communities to build the tools they need for themselves.

Māori have been working to help build tools for te reo Māori, and our project is at the forefront of using open source tools like DeepSpeech to revitalize the Māori language. The core of a good speech recognition system helps us in many practical ways, such as improved transcription, support for pronunciation, correct announcements in public transport, correct information on maps and in many other ways. We may well continue to support and use DeepSpeech if the project can continue.

But there are also many other projects in other countries in the world who may follow on - such as the Kabyle people of Algeria who are using DeepSpeech, or the Mohawk nation in North America who have been looking into it.

By the way we are working on our web presence but for now this quick one pager gives some idea of the work we are doing - https://papareo.nz.


Is your current data public? Do you have sufficient amount of untranscribed data (1000+ hours)? We could help you.


Are they still collecting data for Common Voice, or is it both projects that they're terminating?



Ugh this is a deep gut punch, as this is one of the most interesting recent projects Mozilla was working on in my opinion.


Yikes. All this because they refuse to trim the fat at the C-level. A company can't be profitable by only employing overhead. They'll all be forced to take the ultimate pay cut when Mozilla closes up shop.


I was hoping DeepSpeech would lead to in home cloud-less "Alexas". Just ask me for a subscription on it and productize it please.


So to confirm what you have in mind:

That's a physical device with all computation and information local?

And you'd pay for an upfront cost of the device as well as a subscription?

What comes with the subscription? Updated data, new features? Just keeping the lights on?


There are many cloud-less Alexas already, a good one is rhasspy https://rhasspy.readthedocs.io/en/latest/, it is not based on deepspeech though.


In the repo/docs, it suggests that DeepSpeech is an option for some languages (English & German). Haven't tried it, but with recent(ish) performance improvements in DS it can run on somewhat less powerful computers than used to be the case.

Other options for similar assistants that can also use DS are Mycroft (https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...) and DragonFire (https://github.com/DragonComputer/Dragonfire)


DS is by far the easiest to use/most promising ASR library/kit/thing I’ve used; really hoping it keeps going


Are there other foundations like mozilla we can donate to? For initiatives that are in the interest of the public? The Apache foundation is all I can think of but they focus on corporate use projects.


Really a shame. I find so many of Google's STT quirks infuriating enough I'd love a robust alternative.


Where does it say DeepSpeech is on hold? I don't see that anywhere.


Submitted title was "Mozilla to put DeepSpeech project on hold". We've replaced that with the article title per this guideline: "Please use the original title, unless it is misleading or linkbait; don't editorialize." https://news.ycombinator.com/newsguidelines.html


I guess that explains why I was under the impression the project was being shuttered after reading the comments, but not the actual post yet.

The actual forum post just says they don't know anything about the future of DeepSpeech yet, for those doing the same.


> Until a proper decision is being made regarding the future of the project, we will “keep the lights on” and try to address existing issues and review your contributions to the best accommodation we can in the scope of our new roles.

You could say that "keep the lights on" is the same as on hold.


Mozilla let politics take over its corporation to the point where it's basically a far left extremist group now that's relying on semi-bribe funding from Google.

There's no technology left anymore.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: