Hacker Newsnew | past | comments | ask | show | jobs | submit | Mernit's commentslogin

beam.cloud (YC W22) | Product Engineer, SWE, SRE, Infrastructure, DevOps, Networking | NYC | Onsite

Beam is an ultrafast AI inference platform. We built a serverless runtime that launches GPU-backed containers in less than 1 second and quickly scales out to thousands of GPUs. Developers use our platform to serve apps to millions of users around the globe.

We’re working on challenging problems, including:

* Low-level systems development: working with container runtimes, OCI image formats, and lazy-loading large files from content addressable storage

* Efficiently packing thousands of workloads into GPUs across multiple clouds

* Working with cutting-edge technologies, like GPU checkpoint restore and CRIU

You don’t need prior experience with AI/ML, only an interest in working on hard problems and shipping quickly.

Email us at founders<at>beam<dot>cloud, or apply here: https://www.ycombinator.com/companies/beam/jobs/


You should look into beam.cloud (I'm the founder, but it's pretty great)

It lets you quickly run long-running jobs on the cloud by adding a simple decorator to your Python code:

  from beam import function

  # Some long training function
  @function(gpu="A100-80")
  def handler():
    return {}

  if __name__ == "__main__":
    # Runs on the cloud
    handler.remote()
Fully open-source also.


You should checkout beam.cloud (I'm the founder). It's a modern FaaS platform for Python, with support for REST endpoints, task queues, scheduled jobs, and GPU support.


Thanks so much for sharing! I'm going to check out the website and learn more about your service. I'll reach out if I have any questions.


The major clouds don't support serverless GPU because the architecture is fundamentally different from running CPU workloads. For Lambda specifically, there's no way of running multiple customer workloads on a single GPU with Firecracker.

A more general issue is that the workloads that tend to run on GPU are much bigger than a standard Lambda-sized workload (think a 20Gi image with a smorgasbord of ML libraries). I've spent time working around this problem and wrote a bit about it here: https://www.beam.cloud/blog/serverless-platform-guide


> there's no way of running multiple customer workloads on a single GPU with Firecracker.

You can do this with SR-IOV enabled hardware.

https://docs.nvidia.com/networking/display/mlnxofedv581011/s...


beam.cloud | Founding Software Engineer, Infrastructure | Full-time | REMOTE | New York, NY USA

Beam is building a cloud runtime for running remote containers on GPUs. We’re used by thousands of developers for powering their generative AI apps, including companies like Coca-Cola, and we’re backed by great investors like YC and Tiger.

We’re building gnarly low-level distributed systems. You’ll have a major impact on the product and ship features directly to users. If working on a new Pythonic cloud runtime sound exciting, you might really like it here.

Apply here -> https://www.ycombinator.com/companies/beam/jobs/9fKNUsT-foun...


beam.cloud | Founding Software Engineer, Infrastructure | Full-time | REMOTE | New York, NY USA

Beam is building a cloud runtime for running remote containers on GPUs. We’re used by thousands of developers for powering their generative AI apps, including companies like Coca-Cola, and we’re backed by great investors like YC and Tiger.

We’re building gnarly low-level distributed systems. You’ll have a major impact on the product and ship features directly to users. If working on a new Pythonic cloud runtime sound exciting, you might really like it here.

Apply here -> https://www.ycombinator.com/companies/beam/jobs/9fKNUsT-foun...


Do you consider non-US applicants?


Yes we are hiring internationally


There are a number of good options here. The different axes are cost of GPUs, performance, and ease of use / developer experience. You might consider beam.cloud (I'm one of the founders), which is oriented strongly on the performance and developer experience angle.


You should checkout https://beam.cloud (I'm the founder), it'll give you access to plenty of cloud GPU resources for training or inference.

Right now it's pretty hard to get GPU quota on AWS/GCP, so hopefully this is useful for you.


hi, thanks, I already logged in and exploring how to use it rn


Cloudflare AI and Replicate are great for running off-the-shelf models, but anything custom is going to incur a 10+ minute cold start.

For running custom fine-tuned models on serverless, you could look into https://beam.cloud which is optimized for serving custom models with extremely fast cold start (I'm a little biased since I work there, but the numbers don't lie)


Thanks! Looks promising from the outside. Will surely check out


Why would it incur a cold start of 10 minutes on cloudflare? :O

Any proof?


Serverless only works if the cold boot is fast. For context, my company runs a serverless cloud GPU product called https://beam.cloud, which we've optimized for fast cold start. We see Whisper in production cold start in under 10s (across model sizes). A lot of our users are running semi-real time STT, and this seems to be working well for them.


>...this seems to be working well for them.

Is this because the users are streaming audio in a more conversational style?

For example, when you give siri a command, it is stated, and then you stop speaking.

For most of ChatGPT‘s life, in openAI’s iOS app, if you wanted to speak to input text, you would tap the record button, and then tap it off, either using the app’s own Speech to text capability or siri’s input field speech to text.

Conversational speech to text is more ongoing, though, which would make a 10 second cold start OK, because you don’t sense as much lag because you’re continuing to speak.

Or perhaps people in general record input longer than 10 seconds, And you are sending the first chunk as soon as possible to get whisper going.

Then follow up chunks are handled as warm boots? Then the text is reassembled? Is that roughly correct?

Anything you can provide on sort of the request and data flow that works with a longer cold boot time in the context of single recording versus streaming, and how audio is broken up would be helpful.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: