What part of that calculation is incorrect in your view? > 380 UDP packets sent ...

markus92 · on June 13, 2024

I think you’re misreading OP, as he says 380 packets per minute, not second. That would give you an overhead of 253 bytes per second, sounds a lot more reasonable.

tgtweak · on June 13, 2024

Yes 380/min = ~6/s which is a very open ptime of >100ms, this can also be dynamic and change don the fly. It ultimately comes down to how big the packet can be before it gets split which is a function of MTU.

If you have 50ms of latency between parties, and you are sending 150ms segments, you'll have a perceived latency of ~200ms which is tolerable for voice conversations.

One other note is that this is ONLY for live voice communication like calling where two parties need to hear and respond within a resonable delay - for downloading of audio messages or audio on videos, including one-way livestreams for example, this ptime is irrelevant and you're not encapsulating with SRTP - that is just for voip-like live audio.

There is a reality in what OP posted which is that there is diminishing returns in actual gains as you get lower in the bitrate, but modern voice implementations in apps like whatsapp are using dynamic ptime and are very smart about adapting the voice stream to account for latency, packet loss and bandwidth.

markus92 · on June 14, 2024

In my personal experience, Whatsapp's calling is subpar compared to Facetime audio, Skype or VoWIFI even. Higher latency, lower sound quality and very sensitive to spotty connections.

lxgr · on June 13, 2024

Wow, that would be an extremely low packet rate indeed!

That would definitely increase the utility of low bitrate codecs by a lot, at the expense of some latency (which is probably ok, if the alternative is not having the call at all).

roman-holovin · on June 13, 2024

I read it as in 380 packets per whole call, which was a minute long, not 380 packets per second during 1 minute.

mikepavone · on June 13, 2024

That's about 160 ms of audio per packet. That's a lot of latency to add before you even hit the network

ant6n · on June 13, 2024

Assuming continuous sound. You don’t need to send many packets for silence.

lxgr · on June 13, 2024

Voice activity detection and comfort noise have been available in VoIP since the very beginning, but now I wonder if there's some clever optimization that could be done based on a semantic understanding of conversational patterns:

During longer monologues, decrease packet rates; for interruptions, send a few early samples of the interrupter to notify the speaker, and at the same time make the (former) speaker's stack flush its cache to allow "acknowledgement" of the interruption through silence.

In other words, modulate the packet rate in proportion to the instantaneous interactivity of a dialogue, which allows spending the "overhead budget" where it matters most.

newobj · on June 13, 2024

pretty sure they said 380 packets total in the 1 minute call (~6-7/s)