> I bet that WhatsApp is one of the rare services you use which actually deployed servers to Australia. To me, 200ms is a telltale sign of intercontinental traffic.
So, I used to work at WhatsApp. And we got this kind of praise when we only had servers in Reston, Virginia (not at aws us-east1, but in the same neighborhood). Nowadays, Facebook is most likely terminating connections in Australia, but messaging most likely goes through another continent. Calling within Australia should stay local though (either p2p or through a nearby relay).
There's lots of things WhatsApp does to improve experience on low quality networks that other services don't (even when we worked in the same buildings and told them they should consider things!)
In no particular order:
0) offline first, phone is the source of truth, although there's multi-device now. You don't need to be online to read messages you have, or to write messages to be sent whenever you're online. Email used to work like this for everyone; and it was no big deal to grab mail once in a while, read it and reply, and then send in a batch. Online messaging is great, if you can, but for things like being on a commuter train where connectivity ebbs and flows, it's nice to pick up messages when you can.
a) hardcode fallback ips for when DNS doesn't work (not if)
b) setup "0rtt" fast resume, so you can start getting messages on the second round trip. This is part of noise pipes or whatever they're called, and tls 1.3
c) do reasonable-ish things to work with MTU. In the old days, FreeBSD reflected the client MSS back to it, which helps when there's a tunnel like PPPoE and it only modifies outgoing syns and not incoming syn+ack. Linux never did that, and afaik, FreeBSD took it out. Behind Facebook infrastructure, they just hardcode the mss for i think 1480 MTU (you can/should check with tcpdump). I did some limited testing, and really the best results come from monitoring for /24's with bad behavior (it's pretty easy, if you look for it --- never got any large packets and packet gaps are a multiple of MSS - space for tcp timestamps) and then sending back client - 20 to those; you could also just always send back client - 20. I think Android finally started doing pMTUD blackhole detection stuff a couple years back, Apple has been doing it really well for longer. Path MTU Discovery is still an issue, and anything you can do to make it happier is good.
d) connect in the background to exchange messages when possible. Don't post notifications unless the message content is on the device. Don't be one of those apps that can only load messsages from the network when the app is in the foreground, because the user might not have connectivity then
e) prioritize messages over telemetry. Don't measure everything, only measure things when you know what you'll do with the numbers. Everybody hates telemetry, but it can be super useful as a developer. But if you've got giant telemetry packs to upload, that's bad by itself, and if you do them before you get messages in and out, you're failing the user.
f) pay attention to how big things are on the wire. Not everything needs to get shrunk as much as possible, but login needs to be very tight, and message sending should be too. IMHO, http and json and xml are too bulky for those, but are ok for multimedia because the payload is big so framing doesn't matter as much, and they're ok for low volume services because they're low volume.
WhatsApp is (or was) using XMPP for the chat part too, right?
When I was IT person on a research ship, WhatsApp was a nice easy one to get working with our "50+ people sharing two 256kbps uplinks" internet. Big part of that was being able to QoS prioritise the XMPP traffic which WhatsApp was a big part of.
Not having to come up with filters for HTTPS for IP ranges belonging to general-use CDNs that managed to hit the right blocks used by that app, was a definite boon. That, and the fact XMPP was nice and lightweight.
As far as I know google cloud messaging (GCN? GCM? firebase? Play notifications? Notifications by Google? Google Play Android Notifications Service?) also did/does use XMPP, so we often had the bizarre and infuriating very fast notifications _where sometimes the content was in the notification_ but when you clicked on it, other apps would fail to load it due to the congestion and latency and hardcoded timeouts TFA mentions.. argh.
But WhatsApp pretty much always worked, as long as the ship had an active WAN connection.... And that kept us all happy, because we could reach our families.
> WhatsApp is (or was) using XMPP for the chat part too, right?
It's not exactly XMPP, it started with XMPP, but XML is big, so it's tokenized (some details are published in the European Market Access documentation), and there's no need for interop with standard XMPP clients, so login sequence is I think way different.
But it runs on port 5222? by default (with fallbacks to port 443 and 80).
I think GCM or whatever it's called today is plain XMPP (including, optionally, on the server to server side), and runs on ports 5228-5230. Not sure what protocol apple push is, but they use port 5223 which is affiliated with xmpp over tls.
So I think using a non 443 port was helpful for your QoS? But being avaialable on port 443 is helpful for getting through blanket firewall rules. AOL used to run AIM on all the ports, which is even better at getting through firewalls.
I once got asked "what was a life changing company/product" and my answer was WhatsApp - to slightly bemused looks.
WhatsApp connected the world for free. Obviously they weren't the first to try but when my (very globally distributed family) picked up WhatsApp in '09/'10 we knew we were onto something different. Being able to stay in touch with my brother half way across the world in realtime was very special. Nothing else at the time really competed. SMS was expensive and had latency. Email felt clunky and oddly formal - email clients don't feel "chatty". MSN was crap on mobile and you both had to be online. Ditto for Skype. For calls we even used to do this odd VOIP bridge where you would each call an endpoint for cheap international phone calls.
Meanwhile in 2012, I was able to install WhatsApp on my mum's old Nokia Symbian feature phone, use WhatsApp on a pay-as-you-go sim plan in Singapore communicating over WAP. The data consumption was so low I basically survived 2 months on maybe 1-2 top ups. Compare that with the other day where I turned on roaming on my phone (so I could connect to Singtel to BUY a roaming package) and my phone passively fetched ~50+ MB in seconds and I was hit with 400SGD of data charges (I was able to get them refunded)
I am very grateful to all the work and thought WhatsApp put into building an affordable global resilient communication network and I hope every one of the people involved got the payout they deserve.
This is a big one that makes low-bandwidth connections unusable in a lot of apps. The deluge of ad/tracking/telemetry SDKs' requests all being fired in parallel with the main business-logic requests makes them all saturate the slow pipe and usually leads to all of them timing out. By being third-party SDKs they may not even give you control of the underlying network requests nor the ability to buffer/delay/cache those requests.
One advantage of being Facebook in this case is that they're the masters of spyware and are unlikely to need to embed third-party spyware, so they can blend tracking/telemetry traffic within their business logic traffic and apply prioritization, including buffering any telemetry and sending it during less critical times.
> Why is there is WhatsApp for most commonly used devices, but iPads?
I was frustrated by this a while back, so I asked the PMs. Basically when investing engineering effort WhatsApp prioritises the overall number of users connected, and supporting iPads doesn't really move that metric, because (a) the vast majority of iPad owners also own a smartphone, and (b) iPads are pretty rare outside of wealthy western cities.
I've been gone too long for accurate answers, but I can guess.
For iPad, I think it's like the sibling notes; expected use is very low, so it didm't justify the engineering cost while I was there. But I see some signs it might happen eventually [1]; WhatsApp for Android Tablets wasn't a thing when I was there either, but it is now.
For the four device limit, there's a few things going on IMHO. Synchronization is hard and the more devices are playing, the harder it is. Independent devices makes it easier in some ways because the user devices don't have to be online together to communicate (like when whatsapp web was essentially a remote control for your phone), but it does mean that all of your communications partner's devices have to work harder and the servers have to work harder, too.
Four deviced covers your phone, a desktop at home and work, and a laptop; but really most of the users only have a phone. Allowing more devices makes it more likely that you'll lose track of one or not use it for long enough that it's lost sync, etc.
WhatsApp has usually focused on product features that benefit the most users, and more than 4 devices isn't going to benefit many people, and 4 is plenty for internal use (phone, prod build, dev build, home computer). I'm sure they've got metrics of how many devices are used, and if there's a lot of 4 device users and enough requests, it's a #define somewhere.
Yeah, it's very very noticeable that WhatsApp is architected in a way that makes experience great for all kind of poor connectivity scenarios that most other software just... isn't.
So, I used to work at WhatsApp. And we got this kind of praise when we only had servers in Reston, Virginia (not at aws us-east1, but in the same neighborhood). Nowadays, Facebook is most likely terminating connections in Australia, but messaging most likely goes through another continent. Calling within Australia should stay local though (either p2p or through a nearby relay).
There's lots of things WhatsApp does to improve experience on low quality networks that other services don't (even when we worked in the same buildings and told them they should consider things!)
In no particular order:
0) offline first, phone is the source of truth, although there's multi-device now. You don't need to be online to read messages you have, or to write messages to be sent whenever you're online. Email used to work like this for everyone; and it was no big deal to grab mail once in a while, read it and reply, and then send in a batch. Online messaging is great, if you can, but for things like being on a commuter train where connectivity ebbs and flows, it's nice to pick up messages when you can.
a) hardcode fallback ips for when DNS doesn't work (not if)
b) setup "0rtt" fast resume, so you can start getting messages on the second round trip. This is part of noise pipes or whatever they're called, and tls 1.3
c) do reasonable-ish things to work with MTU. In the old days, FreeBSD reflected the client MSS back to it, which helps when there's a tunnel like PPPoE and it only modifies outgoing syns and not incoming syn+ack. Linux never did that, and afaik, FreeBSD took it out. Behind Facebook infrastructure, they just hardcode the mss for i think 1480 MTU (you can/should check with tcpdump). I did some limited testing, and really the best results come from monitoring for /24's with bad behavior (it's pretty easy, if you look for it --- never got any large packets and packet gaps are a multiple of MSS - space for tcp timestamps) and then sending back client - 20 to those; you could also just always send back client - 20. I think Android finally started doing pMTUD blackhole detection stuff a couple years back, Apple has been doing it really well for longer. Path MTU Discovery is still an issue, and anything you can do to make it happier is good.
d) connect in the background to exchange messages when possible. Don't post notifications unless the message content is on the device. Don't be one of those apps that can only load messsages from the network when the app is in the foreground, because the user might not have connectivity then
e) prioritize messages over telemetry. Don't measure everything, only measure things when you know what you'll do with the numbers. Everybody hates telemetry, but it can be super useful as a developer. But if you've got giant telemetry packs to upload, that's bad by itself, and if you do them before you get messages in and out, you're failing the user.
f) pay attention to how big things are on the wire. Not everything needs to get shrunk as much as possible, but login needs to be very tight, and message sending should be too. IMHO, http and json and xml are too bulky for those, but are ok for multimedia because the payload is big so framing doesn't matter as much, and they're ok for low volume services because they're low volume.