In my experience ray in AWS is a good way to badly utilize resources and waste a lot of money (as is generally anything cloud or anything python; when you do both it multiplies).
We have a traditional cluster, and we were asked to investigate cloud usage.
Ended up with a bursting type setup where you can use the cloud to spin up more nodes than we have in our cluster, or larger nodes than we have available.
Sounds great. Works okay. The problem was with billing.
Our traditional cluster has been around for a while, upgraded and added on to, and there is a bit of a handshake agreement over who pays how much, based on usage. Every year or two the company departments get together and either ask for less cost or take on more based on usage stats. Each of them pay a fixed cost per year.
With the cloud setup, every job came back with charges from our cloud provider. User C consumed 48 hours of usage with 150 nodes of type Z. That will be $900.
After just a few short months they asked us to turn that capability off.
In my experience ray in AWS is a good way to badly utilize resources and
waste a lot of money
That pretty much sums up Autodesk's relationship with Amazon in its entirety. OTOH if operations isn't your core competency, it's still money well spent.
Edit: Before smashing the vote button I'd talk to someone at Autodesk about just how little oversight goes into their AWS usage and how chaotic the billing is.
You can build a SLURM cluster out of UltraCluster nodes in AWS. Money comparisons can be misleading because many people ignore ancillary expenses in running an HPC facility.
Virtual machines perform extremely poorly, so you must take metal instances. These will cost you the same as buying the hardware outright after 3 months of usage.
And you're still stuck on a non-deterministic high-latency network you can't get rid of, and with very limited hardware configurations.
It's more like a grid than a HPC cluster.
There are only two possible advantages:
- you want a lot of hardware very quickly rather than wait for it to be delivered.
- you don't have the desire/capability to be/hire a network engineer.
When you say "virtual machines perform extremely poorly", on what do you base that?
(note: I've worked in supercomputing and HPC for over two decades"
The network I was talking about is called UltraCluster which have an extremely high bandwidth and low latency, designed to get great scaling on MPI jobs (as well as ML). Typical instances used with UC are p5, which have 8 H100 nvidia GPUs, 192 vCPUs, 2TB RAM, 3.2Tbps bandwidth PER MACHINE, 900GB/sec between GPU peers, and 8 3.84TB SSDs. They are not marketed as metal instances.
No, it's not like a grid. Your thinking is dated and not representative of how people do HPC on AWS, Azure, or Google.
It seems how people do HPC on AWS is limited by what AWS can do (and maybe costs). Our experience was that even the elastic feature wasn't, and we often couldn't get resources anyway.
Maybe dated, but for context, we had 2TB and 128 real cores a decade ago, and I currently work with Summit-type hardware; I'd rather not admit after how long.
Looking into the Ultraclusters page you linked to in a sibling comment, it seems like the host machines pretty much fill out their PCIe connections with Infiniband networking to reach that figure:
EFA is also coupled with NVIDIA GPUDirect RDMA (P5, P4d) and
NeuronLink (Trn1) to enable low-latency accelerator-to-accelerator
communication between servers with operating system bypass.
If you care about correct NUMA and HyperThreading usage, and even more so if you care about latency on the CPU (for example for real-time trading), the only things that perform well are either metal of full-machine-but-with-hypervisor.
Obviously nobody is going to run workloads that need to exploit these kinds of things on ECS instances. But these workloads are niche, not normal. Most code that's written and deployed to some notion of "production" is not CPU bound, it is I/O bound.
Many people ignore various actual or equivalent expenses in running on AWS (like optimization opportunities or bad scaling). It's old, but NASA did a study [1] that rings true and I didn't see rebutted. I saw it after an AWS HPC expert stated that you could scale tightly-coupled MPI applications like density functional theory (think 3D FFTs) on EC2; I eventually got him to admit that was rubbish -- and not the only rubbish claim. The "low latency" network is at least no better than I measured intra-switch with the normal 1GbE NICs on Sun x2200s long ago -- an order of magnitude worse than real HPC fabrics.
1. https://www.nas.nasa.gov/assets/nas/pdf/papers/NAS_Technical...
Are you familiar with UltraCluster? https://aws.amazon.com/ec2/ultraclusters/
It's a product directly intended to compete with other HPC fabrics. I believe it was launched after your paper was published.
Even before then, AWS had ways of placing all your jobs on the same rack, which used same-rack switching. Whenever AWS told us stuff about HPC we tested it carefully, and I'd say that before ultracluster, AWS was being misleading about performance, although Azure was not.
I don't do AWS any more, but the adapters apparently used with UltraCluster had a latency of 15µs last I knew. I think the NAS benchmarks are overall less latency sensitive than DFT and would benefit less from GPUs, and I assume UltraCluster is rather expensive even by AWS standards.
[Former employee of several years.] Generally speaking, I would say if Autodesk believes open source can advance a business goal, it will support that open source project. But I never got a sense of Autodesk sort of culturally or in a deeply embedded way caring about open source. High marks for how they treat employees and as a place to work, but not an OS leader.
Off topic but: currently needing to use autocad for a work project, any recommendations for learning autolisp or w/e the scripting language is? I was having trouble finding a good resource for this, seems like everyone just knows it and the only way to learn is to have good enough Google-fu to find the forum post that fulfills your criteria. And hope it’s not out of date.
Autodesk as an employer has fallen quite a bit from even a decade ago. Despite the culture of crabs in a bucket playing at empire building Bass' tenure was marked by generally employee friendly policies and a sincere passion behind some of their projects. The same definitely can't be said for the marketing dweebs that took over (and judging by Fusion they've ramped up their anti-customer initiatives too).
The shift towards subscription based bullshit was essentially the start of the effort to oust Bass.
[current employee of Autodesk] Minor contributions to existing OSS are generally encouraged day-to-day by employees.
There are also more strategic open source initiatives such as the USD stuff covered here.
Many of us, especially in the research division, would love to put more code out there. (For example some tools that we use internally) The good news on this front is that there is now a sanctioned process for this to happen, and the attitude seems much warmer than when I joined a decade ago.
I’m personally involved in trying to open source some of my own work in the robotics domain, and have been pleasantly surprised with the response.
I do not quite get this. How does this enable someone to run ray or metaflow on a typical batch scheduled HPC system (slurm or alike)? Inter node communication is done via the lustre file system, right?
Metaflow integrates with AWS Batch which many folks use for serious HPC. Internode scheduling happens through the multinode scheduling supported by AWS Batch. networking via EFA etc.
I think it said that data access is via Lustre, and communication is by Nvidia MLNX NCCL, which seems to be some kind of nvidia gpu-specific MPI type library; it would seem to be doing RDMA from GPU to GPU via fabric interconnects, so far as I can tell...
I'd rather have a real HPC cluster.