I've seen first hand validation on massive workloads moving to Graviton based in...

deliciousturkey · on March 2, 2023

m5 instances use the Nitro system. In addition, m5.24xlarge is a quite quirky instance type: It uses 2 CPU's with 24 cores each in a NUMA configuration. Half of the RAM is attached to each CPU, and access from the other CPU is much slower. In addition, the CPU cores use a microarchitecture from 8 years ago, meaning the cores are quite slow in practice.

All of this means that a lot can go wrong when running code on those instances, resulting in lower performance. It is either advised to run separate processes on each NUMA domain, or use NUMA aware code (which Java almost never is). In addition, the code (or the system) should be highly scalable to multiple CPU cores.

In addition, the cores are old enough to suffer from Spectre/Meltdown related patches/workarounds, decreasing especially syscall performance.

notyourwork · on March 2, 2023

In our case the instance type is about the only workhorse for the given job. High TPS (scales well to the core count) and needs a large on-disk configuration for low latency key value retrieval of data deployed on disk.

I did slightly misspeak on the instance move having seen your reply. We moved from m5.24xlarge to m6i.16xlarge. Sorry for the confusion.

That said, you shared some interesting information. I'd love to read up more on this, any specific place I can dig in a bit deeper regarding the finer specifics of these instance types and architecture?

deliciousturkey · on March 2, 2023

Just to note: m6i instances are Intel-based.

As for getting information on AWS instances, the best way in my opinion is just to spin up the instance and look up which exact CPU model it uses. Then you can go for example to WikiChip (https://en.wikichip.org/wiki/WikiChip) to see more information about the CPU. Other good sources include Anandtech (for example https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...) and Chips and Cheese (for example https://chipsandcheese.com/2022/05/29/graviton-3-first-impre...).

Things like NUMA configuration can be inspected with tools like numactl.

notyourwork · on March 2, 2023

Yes, I'm aware. The service in question wasn't easily able to be moved so we moved to m6i which isn't ARM based but does leverage nitro. We saw substantial improvements in that configuration too. Not sure what is different because you said m5 use nitro as well but my assumption was m6i with reduced hypervisor overhead from nitro was why we saw improvement.

deliciousturkey · on March 2, 2023

m6i is a much newer CPU architecture, based on Intel Ice Lake rather than Skylake. It is quite significantly faster just from that alone. In addition, the CPU has about 10% higher clock speed.

The 16xlarge version is also a 32 core single socket CPU, meaning there should be no issues with NUMA. I would expect it to be much better than m5.24xlarge in most applications when taking the much faster single-threaded performance into account. Of course, nothing beats benchmarking and measuring yourself though.

I have personally seen issues with NUMA systems and code that theoretically parallelizes very well. Any synchronized mutable state becomes an issue with these kinds of systems. For example, I have had an issue where third party code would use the C "rand" function for randomness. Even though this was not used in a hot code path, on m5.24xlarge >90% of the execution time would be spent on just on the lock guarding the internal random state. On a "normal" system with fewer cores this never showed up while profiling.

vladvasiliu · on March 2, 2023

> Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.

I wonder if this is actually an Intel issue or if there are some other optimizations at play, such as in the virtualization layer.

Because at one point I wanted to try Jetbrains' new "gateway" product, which basically runs a remote IDE and only shows the GUI locally. I was curious on one hand, but I also wanted a machine with a bit more oomph for my occasional compilation needs (rust on Linux, fwiw). I was really unimpressed, the c6i was comparable to my local slim laptop running an 11th gen i7u part. My similar slim AMD 5650U laptop is actually faster. IIRC, the c6i.metal wasn't particularly faster on this kind of single threaded work.

di4na · on March 2, 2023

The difference is in the pricing and the fact the core are "whole"

On intel aws, you pay per HyperThread. On Graviton, you pay per core.

But on this kind of workload and with modern schedulers, HT bump is rather limited. So in practice you are paying twice the price for the same number of cores.

This is the biggest contributing factor to that difference and i keep being surprised noone mention it.

guipsp · on March 2, 2023

Not sure what you're talking about - in aws x86 you pay by core (well as much as pay by core with arm anyways, you can't just buy a 1gb server with 64 cores)

mattashii · on March 2, 2023

AWS x64 'cores' are the virtual cores you see on hyperthreaded CPUs and map 2:1 to physical cores on the CPU, but the AWS ARM offering doesn't have hyperthreading, so the virtual cores map 1:1 to cpu cores.

You can disable hyperthreading on the x64 instances at the cost of halving the number of cores you have available in the instance that you paid for.

guipsp · on March 3, 2023

Yes, but this doesn't really matter - just multiply the cost by two in your accounting. Its not like you one get one core with 15 hyperthread cores

imtringued · on March 2, 2023

"Intel is in trouble" since the calxeda days and ARM is still insignificant to this day.

guipsp · on March 2, 2023

> ARM is still insignificant to this day.

Is it? The phone you use probably uses ARM. If you buy a mac now, it's probably gonna be ARM. It's very much different from the calxeda days!