Hacker News new | past | comments | ask | show | jobs | submit login

> Cutting out DeepSpeech seems sensible to me, it’s out of place in the general portfolio of products.

I disagree precisely because of the point you make later: "I’m somewhat concerned that Firefox will be irrelevant in fives years".

Functionality provided by deep learning is going to be an important component of many types of software interactions going forward. The logistics of this will be quite different from what we are used to in open source, with the need to fund and coordinate compute, collect and handle data being a more vital aspect compared to the past.

There are STT software, some mentioned in this thread, that match or are even better than DeepSpeech but none of them are as ergonomic. Accounting for the value of time, this means it will be more cost effective to outsource such capabilities to the cloud. Which comes with trade-offs that are difficult to appreciate in the short term: https://news.ycombinator.com/item?id=24236489

I'd say DeepSpeech fits in the mold of Mozilla as a company providing solutions to complicated software problems that are better at respecting the user and their privacy.

In the old days, the most accurate TTS and STT models were built into the OS. These days, you need to call into the cloud to get the best stuff. In [1], Internet Archive complains about the quality of their OCR software. It's not that OCR is so bad, it's that the best OCR is found on Google's and Microsoft's computers. It's possible to cobble something together using open source solutions like EasyOCR, Tesseract+OpenCV but that will only get you part of the way there. What makes the cloud offerings so good is they have enough resources to devote to pre-processing pipelines and architecture tweaks and settings better able to handle edge cases. Most of the mass resides in edge cases.

From my vantage, the future looks to be one of software as thin layers built atop APIs which call into programs running on the servers of a handful of companies. You might not think this a big deal but these software will be the ones scanning the environment, writing the emails, completing the thoughts and planning the calendars for the majority of humans.

[1] https://blog.archive.org/2020/08/21/can-you-help-us-make-the...




Based on the testing I just did with Vosk, Mozilla DeepSpeech, Google Speech to Text and Microsoft Azure, I disagree with your arugment that SaaS has the best quality results.

Mozilla DeepSpeech was definitely trailing the bleeding edge, but Vosk using the vosk-model-en-us-daanzu-20200328 model produces very accurate results even on uncommon words, similar in performance to Google & Microsoft (which has generally better formatting than Google's STT)

Try it yourself:

Google: https://cloud.google.com/speech-to-text/ See "Put Speech-to-Text into action" header

Microsoft: https://azure.microsoft.com/en-us/services/cognitive-service... See "Upload File"

Vosk: https://alphacephei.com/vosk/

Had Mozilla provided 4x to 8x more GPU resources and more staff, then their STT would likely be competitive. Other small STT developers can iterate and test much faster due to having more hardware at their disposal.


Even Google is trying to offload as much of these computations to on-device chips as possible nowadays though.

Their new Pixel has voice control entirely backed by on-device models for example.

I think SaaS is a stopgap for good ML, and that eventually enough of this will be open source, that basic tasks such as vision and speech will be cheap to solve for any company with high tech competency.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: