I also decided to learn WebRTC and built a video chat app project: https://zonko.chat
The last time I did any p2p networking was back in 2002 or something when you still had to do it all manually. We used all sorts of fun tricks like NAT hole punching, and using little script endpoints to capture and forward along port and public IP address information.
It was fun to see that all of this has since been formalized under the "ICE framework". I was surprised to see that the STUN spec is only 12 years old now, despite the techniques involved being used for at least 20 years, probably more like 30+.
So if anyone who's new to this whole p2p world feels that WebRTC and the ICE framework is confusing or onerous, I would point out that just a short while ago these were basically just a handful of heuristic techniques developed through trial and error over the years. It's really much easier nowadays! zonko.chat only took me 12 or so hours to build (and seems to be well-supported by chrome and ff, even mobile).
Edit: Upon reflection, I don't even remember how I learned about some of them. The concept of TURN was probably one that I, and many thousands of others, invented from scratch due to necessity (failed to punch the hole? fall back to this custom relay I wrote in perl). STUN was an easy one to figure out yourself, too. I don't remember how I learned about hole punching though. Probably a forum or a book. Or possibly just an experiment ("what if the two connections touch somewhere in the internet at the same time... hey wait, it worked?") What's interesting to me is that the core "ICE" concepts (hole punching, STUN, TURN) are still pretty simple even in their mature, formalized, scientific form. But the concept of "SIP" is much more sophisticated today than it was back then.
> The last time I did any p2p networking was back in 2002 or something when you still had to do it all manually.
> It's really much easier nowadays! zonko.chat only took me 12 or so hours to build
While it may have only taken you 12 hours to build in "real time" I'd say you've been "building" it for the last 20 years. If a newbie tried to do this they could expect to spend a few weeks or more on the project would be my guess.
That's a good point! I did not need to relearn core concepts.
The bulk of that 12 hours was actually spent debugging negotiation and timing/race issues in the signaling/SIP layer. Even for me, figuring out how the WebRTC API is supposed to work was a little difficult.
I hope I didn't come off as overly aggressive, the comment was mostly for myself, I was feeling badly that there was no way I would be able to do implement something like this in 12 hours.
Often veterans here will talk about projects they trivially did. but unless you dig into their profile and find out who they are it feels like every joe-shmo on hn is a 1/2mil+ SWE at google.
No no, that didn't sound aggressive at all! Though I still wouldn't say the project was trivial, just very limited in scope. And had I done everything perfectly -- meaning, being able to stream code from my head with no errors -- the project may only have taken 3 hours. That's how small it is. I think the finished product is 800 lines of code, server and client together. It's conceptually very small too. (Maybe this is a 'veteran' skill as well, being able to keep things small.)
My point is that a whole 75% of the time I spent on this project was just me flailing around with stuff that wasn't working as I expected (that's relatable at all experience levels!). Perhaps it's true that veterans can get through certain things more quickly than novices can, but we're not immune to the 80/20 rule either! We just get stuck on different types of problems.
I worked for a company developing a custom P2P video streaming protocol. It was amazingly hard to get working properly and testing all the quirks in different routers from different manufacturers. The combination of all the NAT strategies (I think there were 5 main ones) created quite a matrix of possible ways that things can work, and you had to implement and test them all before there was any kind of standard way to do it. We built a mock internet with dozens of consumer routers, as many as we could get our hands on. We eventually got it to work and that company exists to this day, but I think they switched to WebRTC long ago.
WebRTC seems easy when you're creating a proof of concept with-in your own network, once you get into complex situations behind firewalls across the internet it's a whole different story.
The article mentioned "This section we will just touch and go about when do you need a TURN server. It is not needed in all situations but a component needed if you have to deal with slightly less straightway use cases." a TURN server is a must in the real world...
True, but the problem you're mentioning is solved, for the most part. STUN/TURN servers serve this purpose, there are open source ones[0][1] and you can can even use public STUN servers[2] (as you might expect free TURN servers aren't really a thing).
Solutions like Jitsi Meet[3], whereby (formerly appear.in)[4] use WebRTC to great success and are fantastic for quick meetings.
WebRTC these days is pretty mature and ready for primetime -- TURN servers are a last resort, but they're a small price to pay for something that might be free (to you the service provider) most of the time, if signaling succeeds.
I do a significant amount of work with Hospitals and secure environments (Military, etc). TURN is needed 100% of the time. P2P traffic is not allowed. All IP addresses need to be known and kept static upfront for firewall whitelisting.
This means products which help alleviate WebRTC infrastructure such as AWS Kinesis are not allowed (due to how they allocate turn servers with unknown IP addresses) and a company needs to manage their own infrastructure / TURN servers (which allows you to cherry pick where server locations are (HIPAA, country legal for what is streamed)) or accept Twillio's, or their competitors etc, large IP ranges (and don't have server location flexibility / increased commercial and market growth restrictions).
Whichever route you go down it is quite an undertaking!
P.s. Tsahi Levent-Levi is truly exceptional in this area. I highly recommend reading his blog and training courses: https://bloggeek.me/, https://webrtccourse.com/, AND he runs an amazing testing product https://www.testrtc.com. if you build your own infrastructure testRTC is a must.
STUN is only useful if you're trying to negotiate a P2P connection, which isn't the case when using an SFU. If everything you're doing is going through an SFU then you don't need STUN.
I think the most frustrating part is that people don't know what NAT they are behind. I wish WebRTC easily told people the attributes of their NAT. Not sure if it would help production deploys, but would make learning a lot easier.
Someone contributed a really cool tool to Pion stun-nat-behavior[0] that I use a lot. It prints out NAT details. It uses the modern/correct details. I see a lot of docs that still use symmetric NAT etc.. RFC 4787 [1] suggests against all that.
After hand-rolling my own setups and working with a few libraries, I have found https://mediasoup.org/ v3 to be the easiest library to use that still gives me the freedom to work with the architecture I want. This of course you're not using WebRTC for its p2p capabilities and are willing to scale via SFUs which is a common approach these days.
STUN fails under symmetric NAT, not strict NAT. That google document makes no citations of that 92% figure, but I assume that's for desktop traffic only. Pretty much all mobile/cellular connections would require TURN too.
The issue with webrtc is once you step out of the side-project domain, you have to confront the endless implementation differences between browsers, whether it's undocumented SDP behavior, different codecs, non-conformant behavior for low level calls, etc.
We've built a push-to-talk walkie talkie system called Squawk[0] which holds long lived webrtc connections in the background throughout the day. We use simplepeer[1] as the base to help bootstrap some of the browser shimming, but it's not perfect. So ultimately we've had to build all sorts of checks into our protocols like an audio keepalive where we send periodic frames (20ms) of silence down the media channel, and verify that we received some additional header bytes on the remote end, because otherwise webrtc would let the connections rot and you wouldn't know until you needed them which in a push-to-talk situation is too late.
Also, to increase adoption, perhaps give the users a link to directly use the app.squawk.to URL to use as well. I tried that on my personal device and it works as advertised.
Out of those, it seems Jitsi and Mediasoup to be better as SFUs than the rest because they have what appears to be decent-looking congestion control, bitrate allocation, and support for simulcast. The rest apparently do not (at least I couldn't find anywhere in the code where it happened).
Also interesting to note is that so many of the newer ones are written in Go based on Pion. If Pion ever gains the ability to do decent congestion control (perhaps based on transport-cc like Jitsi and Mediasoup do), that could improve things for all of those.
What you want are transcoding features -- some of the servers offer it (ex. Kurento[0] for example which is not listed above), but some don't (ex. mediasoup[1]) and some offer recording but you need to wrangle formats yourself (ex. janus[2]).
IPv6 is becoming increasingly common here in Asia (I'm based in Thailand but travel a lot around Asia). For context I have used 3 different ISPs here and all have dual stack, both my cellular connections have also been dual stack.
This has meant NAT is less of an issue for native IPv6 endpoints, including P2P.
Hopefully when IPv6 is finally widespread in US/Europe we will see stuff taking more advantage of this fact.
I'm eager to create a higher quality video broadcasting (not web meeting, one way only) app for some local yoga studios I help out with and am hoping this article gives me a push in the right direction.
The audio quality on zoom is just terrible no matter if you disable DSP or not.
So many yoga classes require high quality music.
It's frustrating that chaturbate provides top notch video and audio quality for free essentially, while paying $20/mo for zoom gives you what looks like 380p video quality and audio quality I have yet to find a poor comparison for...
Does anyone know how one could emulate what chaturbate does?
Any good articles outlining how they do what they do?
Ideally, the teacher would just plop their phone down in front of them, hit broadcast, and a few seconds of buffering later 1080p video and quality audio would be visible through a browser.
Why is that so tough to do??? I haven't been able to find a single article that simplifies or distills it at all.
From a technological perspective, streaming with a couple of seconds delay is a world of difference from streaming with sub-second latency. You can account for network dynamics with more buffers, encode in higher quality, even using multiple passes, transcode to multiple targets etc.
Zoom getting the music audio through mic sounds like the real problem. You should be aiming to stream the audio from digital source. Then you could have the song titles overlaid on video. There's definitely licensing issues though. The instructors are probably already not using legit licenses for their classes though.
Also a lot of audio codecs are tuned towards speech and filter out high frequencies. You should pick one meant for music.
Most of the steaming platforms use HTTP Live Streaming (HLS) because it avoids all the networking NAT headaches that come with p2p connections and it handles variable quality better because each client fetches the best quality for their bandwidth. As far as I know, with WebRTC the sender degrades quality to satisfy the slowest peer.
That said, the downsides of HLS are potentially higher infrastructure costs required to transcode video to the different qualities and somewhat related is the higher latency to live. With proper tweaking you might get 2-3 seconds of latency, but it might be too much for your use case.
If there is voice interaction between the yoga instructor and their students the HLS delay will certainly be noticeable.
> WebRTC the sender degrades quality to satisfy the slowest peer
No, WebRTC is 1 to 1. Each connection is adapted independently. But, you can build services that have rooms with more participants, then it's up to you to shape the traffic as you want. If you use a central server (SFU), it can just send each peer the best they can receive, each independently from one another. It's a property of the service, not the technology.
I looked into that but it's not ideal mainly because we don't need a video conferencing system.
I suppose I could dig through the code and disable anyone but the host's video feeds but I don't have a lot of time to dedicated to this project unfortunately.
Isn't your usecase just a one-to-many stream, like e.g. Youtube, Twitch, Periscope, etc. all provide? That should be easier than a proper n-way meeting.
While the video streaming isn’t that great on zoom, my experience is the audio quality piped through their custom audio thing for maxis is pretty good.
I've been looking into webrtc and used the "webrtc samples" which are good in many ways. It is fairly easy to get something up and running, but I found several areas that were difficult.
* debugging. One users sound just doesn't work while it works perfectly for me with different machines. I am clueless as to how to debug it.
* ice. while it works, I had a hard time understanding, tracking and debugging what was going on.
* closing and restarting connections
* multiple clients in one room?
* echo cancellation. This was frustrating for users.
* Turn. Is there a tool or way to know which clients need a turn server? Are using a turn server?
I ended up guessing that getting it to be a product would actually be fairly time consuming
WebRTC doesn't do everything for you; it's really just responsible for tying together ICE with media streams. Signaling is up to you to figure out. For instance, multiple clients in one room: this is part of the signaling layer and is not WebRTC's responsibility (I built this into zonko.chat if you want to see how it works though).
Closing and restarting connections is signaling layer stuff, ie your responsibility.
Echo cancellation is really supposed to be application layer and up to you as well, but I think this will probably shift to be the browser's/WebRTC's/getUserMedia's responsibility at some point.
Re. TURN: ICE is the process that works out whether a specific client needs to relay through a TURN server. The question is: do you need to implement a TURN server? The answer is: yes, you need a TURN server. If you built a P2P app that you want to work for all users, you will always need a TURN server. You can run coturn on the same box that you serve your app from. Most likely a side project will never hit the scale requiring more than a $5 digitalocean box for TURN.
And yes, it should not be a surprise that products are time consuming to build :) WebRTC is plumbing; you probably were expecting something more like Jitsi.
FYI, echo cancel actually does work ( chrome definitely ),
just make sure you specify the audio constraint so that it has
a sample rate of 16khz ( aec does not work on the default 44/48khz modes )
> Echo cancellation is really supposed to be application layer and up to you as well, but I think this will probably shift to be the browser's/WebRTC's/getUserMedia's responsibility at some point.
Echo cancellation typically can't be application layer. The APIs I've seen (Android, iOS, WebRTC), require low level latency and works best as close to hardware as possible.
{ echoCancellation: true } as a track constraint in getUserMedia works.
I've never actually gotten { echoCancellation: true } to work for me, but your sibling comment does have a suggestion I need to try out!
Echo cancellation is a pretty lightweight DSP/FIR task. Whether you do it close to the hardware (? I suspect this is not actually the case with getUserMedia though -- it is still an audio stream algorithm) or in the application layer, echo cancellation requires the same amount of added latency.
But in any case, I did say I suspected echo cancellation would shift to getUserMedia. It's not fully there yet, but it will be.
It depends what the product is. If you're trying to build another Zoom (which I gather from the "rooms" question), yes, it will take quite some time. For one thing, the mesh topology of P2P won't scale up beyond a handful of users, so you'll need to make it client/server. And besides time-consuming, that starts to get operationally expensive. Decoding, compositing, and encoding high-resolution video streams in real time take some processing power.
If you want to try a platform that abstracts some parts of it (such as signaling) and aims to provide an all-in-one package (compared with WebRTC which is a collection of puzzle pieces that you are responsible to put together), have a look at OpenVidu.
The team behind Kurento is working on this (I am part of it) for people who don't really care about all the intricacies of the standard(s), and just want to build a product on too of it. A single Docker container to deploy, and you're all set to write your app.
Still, this is a complex topic so there are a thousand ways this technology can be made easier to use and understand. And I agree with other comments about the issue of debugging, there is totally an empty space in the market for a comprehensive solution that can help troubleshooting when WebRTC fails.
The debugging bit is so frustrating, I almost spend a whole week trying to find a bug with my PeerConnections only to find out that it was that the TURN Server was misconfigured (although Trickle ICE was successful). And even then just setting up a TURN server consumed a whole day.
WebRTC is end-to-end encrypted by default. There is a signaling server that helps establishing the connections between users in a room, but after that the communication is encrypted. Also those TURN and STUN servers are only required for technical reasons to get peer-to-peer working. So no content is ever passed unencrypted.
That's the difference to other services like Zoom and Jitsi, where a server in the middle is receiving the video streams unencrypted and then redistributes. Although Jitsi is adding encryption support for that as well soon.
We started a simple webrtc app in 2018. Thought it would be simple. Now two years later we are still tweaking the code and dealing with handshakes and codecs across browsers, as well as edge cases involving firewalls and what to do if someone disconnects for longer than the timeout.
One example – H.264 is hardware accelerated on iPhone so one might prefer this over VP8 which could drain the device's battery pretty quickly when used in a P2P mesh setup.
First off, I'm a total noob when it comes to WebRTC, but having read the docs Google provide for it (and the accompanying mini video app tutorial) it seemed like it's dead simple to implement and use - I understood that the complexities and basically everything you talked about above was already handled. Is your implementation different than theirs or did I maybe misunderstand the value propositions of WebRTC?
Twilio costs but not a bad idea. I created Remotehour(https://remotehour.com) which allows you to have an 'open-door' policy video call easily. It works with Twilo :)
This reminds me of Icecomm[0] from a few years back.
Unfortunately, it didn't stick around for too long. It was pretty easy to use, as well, and a lot of people here ended up in a video chat together[1]. LOL!
I have some experience in this from developing https://p2p.chat a while back.
As others have mentioned, building a simple project is fairly simple. The difficulty comes when you want to scale to more than ~4 users without the app becoming unusable. Adjusting audio/video constraints to ensure that you get optimal media streams is quite difficult, also. Nevermind dynamically tweaking them!
In real life, the STUN server rarely works, and thus, the myth of this peer to peer utopia was never realised, and why webrtc did not receive any attention
A small group of friends and I are working on a virtual karaoke club using WebRTC and Go, https://github.com/ryanrolds/club. 100% agree with the WebRTC being easy to create proof-of-concepts, but there are a lot of edge cases and browsers differences that have to be worked through.
The current plan is to keep everyone in "groups", think friends at a table, small (max 6). The server will maintain peer connections with everyone in the "room" and broadcast the singer via that peering. As the singer changes, the server will simply allow the KJ to pick who is getting broadcast over the other server -> client peer.
I really learnt so so much from the entire thread discussion here today! WebRTC I've gathered is oftentimes easy to create, but the real challenge is in the scalability. From the way it seems, scalability is only possible with the forwarding architecture with the use of the Selective Forwarding Unit or of like.
I always wonder if there is a way to think outside of this 'box'.
One thing I've wanted to make is something like gather.town where I can remix the audio and video so that different users sounded louder or quieter. But, I never figured out where in the WebRTC API that is done. It seems like I need to set up my own SFU and put the necessary logic over there.
Are you trying to adjust the volume of the remote audio streams? Can you change value of the `audio` element of the DOM?
I think you could also do it with the WebAudio API. If you throw up a repo would love to try and help :) having a backend makes it so much harder to deploy/maintain stuff.
It's not in the WebRTC API. You can change the volume on the audio tags you create, or you can pipe the audio media stream into a WebAudio graph and modify it there.
I've built a number of WebRTC apps over the years. Recently, I built just such a thing as you described and open sourced it: https://www.calla.chat. I opted to build it on top of Jitsi Meet this time. It's actually advantageous that it's not through the WebRTC API because Jitsi doesn't give access to the raw WebRTC commands. But hijacking the audio elements it creates is completely doable.
sadly webrtc p2p does not scale. Sadly current HTML-based solutions for media servers are slow and highly CPU intensive. Even more sadly, adobe flash has solved the problem of multi video chat decades ago, but we have decided to deprecate without alternative
This article is a strange one. They mention WhatsApp and some other mobile products but then proceed to frame everything into the context of a browser.
It's interesting the raise again for unified communications. All this technology specifically WebRTC is being around for few years now. The Innovation is minimum, why? Most do the problems are solved. When a technology is mature most of the focus is on security or applying other technologies to improve it such as Machine Learning. In the case of VoIP and Video apps, is very mature since the inception of H323, SIP, SCCP, RTP, sRTP, most recently JS and WebRTC.
The last time I did any p2p networking was back in 2002 or something when you still had to do it all manually. We used all sorts of fun tricks like NAT hole punching, and using little script endpoints to capture and forward along port and public IP address information.
It was fun to see that all of this has since been formalized under the "ICE framework". I was surprised to see that the STUN spec is only 12 years old now, despite the techniques involved being used for at least 20 years, probably more like 30+.
So if anyone who's new to this whole p2p world feels that WebRTC and the ICE framework is confusing or onerous, I would point out that just a short while ago these were basically just a handful of heuristic techniques developed through trial and error over the years. It's really much easier nowadays! zonko.chat only took me 12 or so hours to build (and seems to be well-supported by chrome and ff, even mobile).
Edit: Upon reflection, I don't even remember how I learned about some of them. The concept of TURN was probably one that I, and many thousands of others, invented from scratch due to necessity (failed to punch the hole? fall back to this custom relay I wrote in perl). STUN was an easy one to figure out yourself, too. I don't remember how I learned about hole punching though. Probably a forum or a book. Or possibly just an experiment ("what if the two connections touch somewhere in the internet at the same time... hey wait, it worked?") What's interesting to me is that the core "ICE" concepts (hole punching, STUN, TURN) are still pretty simple even in their mature, formalized, scientific form. But the concept of "SIP" is much more sophisticated today than it was back then.