Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What part of that calculation is incorrect in your view?

> 380 UDP packets sent from source to recipient during a 1 minute call, and a handful of TCP packets to whatsapp's servers. That would yield a transmission overhead of about 2.2kbps.

That sounds like way too many packets! 380 packets per second, at 40 bytes of overhead per packet, would be almost 120 kbps.

My calculation only assumes just 50, and that’s already at a quite high packet rate.

> you can rely on protocols that have less error correction

You could, but there's no way to get a regular smartphone IP stack running over Wi-Fi or mobile data to actually expose that capability to you. Even just getting the OS's UDP stack (to say nothing of middleboxes) to ignore UDP checksums and let you use those extra four bytes for data can be tricky.

Non-IP protocols, or even just IP or UDP header compression, are completely out of reach for an OTT application. (Networks might transparently do it; I'm pretty sure they'd still charge based on the gross data rate though, and as soon as the traffic leaves their core network, it'll be back to regular RTP over UDP over IP).

What they could do (and I suspect they might already be doing) is to compress RTP headers (or use something other than RTP) and/or pick even lower packet rates.

> I don't think it's unreasonable to assume this could reduce their total audio-sourced bandwidth consumption by a considerable amount while maintaining/improving reliability and perceived "quality".

I definitely don't agree on the latter assertion – packet loss resilience is a huge deal for perceived quality! I'm just a bit more pessimistic on the former, unless they do the other optimizations mentioned above.



I think you’re misreading OP, as he says 380 packets per minute, not second. That would give you an overhead of 253 bytes per second, sounds a lot more reasonable.


Yes 380/min = ~6/s which is a very open ptime of >100ms, this can also be dynamic and change don the fly. It ultimately comes down to how big the packet can be before it gets split which is a function of MTU.

If you have 50ms of latency between parties, and you are sending 150ms segments, you'll have a perceived latency of ~200ms which is tolerable for voice conversations.

One other note is that this is ONLY for live voice communication like calling where two parties need to hear and respond within a resonable delay - for downloading of audio messages or audio on videos, including one-way livestreams for example, this ptime is irrelevant and you're not encapsulating with SRTP - that is just for voip-like live audio.

There is a reality in what OP posted which is that there is diminishing returns in actual gains as you get lower in the bitrate, but modern voice implementations in apps like whatsapp are using dynamic ptime and are very smart about adapting the voice stream to account for latency, packet loss and bandwidth.


In my personal experience, Whatsapp's calling is subpar compared to Facetime audio, Skype or VoWIFI even. Higher latency, lower sound quality and very sensitive to spotty connections.


Wow, that would be an extremely low packet rate indeed!

That would definitely increase the utility of low bitrate codecs by a lot, at the expense of some latency (which is probably ok, if the alternative is not having the call at all).


I read it as in 380 packets per whole call, which was a minute long, not 380 packets per second during 1 minute.


That's about 160 ms of audio per packet. That's a lot of latency to add before you even hit the network


Assuming continuous sound. You don’t need to send many packets for silence.


Voice activity detection and comfort noise have been available in VoIP since the very beginning, but now I wonder if there's some clever optimization that could be done based on a semantic understanding of conversational patterns:

During longer monologues, decrease packet rates; for interruptions, send a few early samples of the interrupter to notify the speaker, and at the same time make the (former) speaker's stack flush its cache to allow "acknowledgement" of the interruption through silence.

In other words, modulate the packet rate in proportion to the instantaneous interactivity of a dialogue, which allows spending the "overhead budget" where it matters most.


pretty sure they said 380 packets total in the 1 minute call (~6-7/s)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: