What are you using for K8s autoscaling? We initially tried a few standard K8s scaling mechanisms and found that they didn't work well for GPU workloads. For example, if we were serving a low-RAM Huggingface model on GPU, it wouldn't trigger autoscaling. But since the GPU can only process one request at a time, the system would get bottlenecked while it waited to process requests one-by-one.
Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.
That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?
Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.
e: also, latency because of reduced resources? Or what do you mean?
I think the Beam website should be a lot clearer about how things work[0], but I think Beam is offering to bill you for your actual usage, in a serverless fashion. So, unless you're continuously running computations for the entire month, it won't cost $1200/mo.
If it works the way I think it does, it sounds appealing, but the GPUs also feel a bit small. The A10G only has 24GB of VRAM. They say they're planning to add an A100 option, but... only the 40GB model? Nvidia has offered an 80GB A100 for several years now, which seems like it would be far more useful for pushing the limits of today's 70B+ parameter models. Quantization can get a 70B parameter model running on less VRAM, but it's definitely a trade-off, and I'm not sure how the training side of things works with regards to quantized models.
Beam's focus on Python apps makes a lot of sense, but what if I want to run `llama.cpp`?
Anyways, Beam is obviously a very small team, so they can't solve every problem for every person.
[0]: what is the "time to idle" for serverless functions? is it instant? "Pay for what you use, down to the second" sounds good in theory, but AWS also uses per-second billing on tons of stuff... EC2 instances don't just stop billing you when they go idle, though, you have to manually shut them down and start them up. So, making the lifecycle clearer would be great. Even a quick example of how you would be billed might be helpful.
Slai is a tool to quickly build machine learning-powered applications. Our browser-based sandbox is the easiest way to build, deploy, and share machine learning models with zero setup [1]. We’re currently a team of four, and we’re looking to hire someone to help us with SRE / DevOps work.
You should have experience setting up and maintaining infrastructure at scale - ideally, you’ll have fluency with Docker/Kubernetes, EKS, KNative, Terraform, Terragrunt, Gunicorn, and Python. You should be able to communicate clearly in English, and you can work from anywhere (although if you prefer to work in-person, we have an office in Cambridge, MA)
We have a hackathon-inspired culture – the team works in one week sprints and has very few meetings besides daily standup and a Friday afternoon all-hands. If this interests you, please send a brief email with your resume to eli at slai dot io.
On paper, Sagemaker does everything, but they don’t do many of those things well. I think Sagemaker is a great product for enterprises who want to maximize the products procured from a single vendor — it’s easy to buy Sagemaker when all of your infra is already on AWS.
It’s fairly painful to productionize a model on Sagemaker — they make you think about a lot of things and fit into AWS primitives. Besides the code for the model, we don’t force users to think about anything. Our focus is helping engineers get models into production, not reading documentation.
Using our tool, you can fork a model and deploy it to production right away — there’s no time spent battling AWS primitives. We’re focused on developer experience above everything else - which means we enforce sandboxes on our platform to be consistent and reproducible.
Yep! You can upload pre-trained models — just upload a pickled binary of your model into the “data” section of the sandbox and then load and return the object in the train function.
We chose to do it this way to ensure that the binary you upload is properly tracked in our versioning system, and that it can be integrated into your handler.
Hi, thanks for checking it out! You bring up some great points.
When it comes to SWEs doing ML, our goal to bring together a bunch of apps that get people most of the way to something they can bring into production.
I appreciate your skepticism of online IDEs, and it’s unlikely we’ll ever completely replace notebooks or whatever IDEs people prefer hacking on locally. Instead, we’d like to take a hybrid approach, in which our online IDE is sufficient for making minor tweaks to a model after its in production, but the brunt of development can still happen locally, in which changes would be pushed to Git and synced with our online IDE.
It’s true that SWEs can deploy their own APIs, but that feels like an unnecessary annoyance to take for granted. At a high level you’re just setting up an API, but really you’re also going to setup some versioning system and a Docker file and monitoring, and all of that adds up to a lot of cognitive overhead.
BTW - the Jupyter-like cell execution can be turned on by clicking the “Interactive Mode” button on the bottom right corner of the IDE.
Makes a lot of sense! It really is true that ML models kind of live in its own little world with their training loop and should interact with the rest through a rest api. And now and then new data gets added for training, as well as the api should change a bit, maybe we tweak the labels. You managed to encapsulate that part. I might port one of our text model to it to try it out :)
The core code editor itself uses Monaco (the same thing under the hood of VSCode), but everything else is custom (i.e. file browser, language server, syntax highlighting, tabs, etc.)
We think developer experience is the factor that has been sorely lacking from the ML tooling space. I’m curious how your experience using Sagemaker has been?
Hi, thanks! The main difference is that HuggingFace contains a huge repository of pretrained models, whereas we're providing the scaffolding to build your own end to end applications. For example, in Slai you could actually embed a HuggingFace model (or maybe two models), combine them into one application, along with API serialization/deserialization, application logic, CI/CD, versioning, etc.
You can think of us as being a store of useful ML based microservices, and not just a library of pre-trained models.
We wrote a bit about this here, if anyone is interested: https://www.beam.cloud/blog/serverless-autoscaling