Parallel C++ on AWS Lambda for CRISPR

zbjornson · on Feb 3, 2018

We have a similar application (parallelized C++ code operating on large files, for bioinformatics even) and ended up down the reverse path: started on Lambda, moved to our own RPC system. Lambda got super expensive (in part because there was no way to reuse a worker while it was doing async work like an S3 download), couldn't parallelize nearly enough without hitting AWS hard-limit quotas, had significantly lower CPU perf and didn't provide a way to cache files (like a genome in this case). Spot/preemptible instances keep our costs down while letting us keep a few hundred servers up at a time.

cobookman · on Feb 4, 2018

I feel like serverless is only benefitial for low volume / low traffic workloads.

The only other perk I've found is that the serverless billing model makes it easier to estimate costs.

reconbot · on Feb 4, 2018

I found it lowered our server cost incredibly for a high volume read heavy site. It allowed us to scale in response to increased traffic (not instantly that's a lie, spikes in the thousands of requests a second are not handled well, but over a few minutes it catches up without issue) and not have to provision and spin up servers. We were constantly over provisioned before and now it's a much lower but moving margin.

cobookman · on Feb 5, 2018

You can do that today though with K8s and auto-scaling. Not sure how well autoscaling works on AWS, but on GCP its a breeze.

friendly_chap · on Feb 4, 2018

Well, yes. This is why we are building a solution that you can install anywhere https://github.com/1backend/1backend

You might jokingly say we are reintroducing the servers into the serverless concept :D

But I think we are just giving freedom back to the users.

niklasrde · on Feb 3, 2018

Lambda is an exciting platform, but it does require some bending to get it to work for certain use cases. This reminds me of some problems we solved last year [0], which should not have been problems for a fairly straightforward application.

[0]: https://iplayer.engineering/evaluating-tensorflow-models-in-...

ComputerGuru · on Feb 3, 2018

I read that article before and just reread it now - thanks for writing it up, but I still have the same question I did the first time around. You posit:

> Because we run 200 invocations or so in parallel we’ll only need to download the model once and save it there

From my own reading of the Lambda docs, it seems that a simultaneous request for the same Lambda may or may not spin up a new container, ie while serial requests within the 15 minute timeout will likely reuse the same instance with the same frozen/cached tmp data, parallel requests do not have that assumption.

Was this your finding in practice? If so, were your CloudWatch “keep warm” events a series of just 1 Lambda invocation ~10/15min apart of 200 simultaneous requests serially spaced to keep the instances spun up?

mszczepanczyk · on Feb 3, 2018

> From my own reading of the Lambda docs, it seems that a simultaneous request for the same Lambda may or may not spin up a new container, ie while serial requests within the 15 minute timeout will likely reuse the same instance with the same frozen/cached tmp data, parallel requests do not have that assumption.

This is true. I learned this a hard way some time ago while trying to understand why our lambda function crashes from time to time. It turned out we didn't clean the /tmp folder and we used the space quite extensively. There's a good article on this:

https://aws.amazon.com/blogs/compute/container-reuse-in-lamb...

niklasrde · on Feb 3, 2018

The CloudWatch "keep warm" events are just one invocation, yes.

I do not remember having had issues with it, but honestly, I don't think I actually have stats on that anymore.

I've just checked in S3, but it doesn't look like we have request or data transfer metrics enabled on the model bucket. I may enable those next week to monitor the effectiveness of our strategy better.

ComputerGuru · on Feb 4, 2018

I imagine if the code were properly written to deal with race conditions you wouldn’t notice any issues either way besides an increased latency for some requests.

Good luck!

andrewon · on Feb 3, 2018

What's the difference between this CRISPR search problem and DNA sequence alignment? There were extensive development in the latter and is highly optimized. The author seems to be coming up with solution from scratch.

https://en.wikipedia.org/wiki/Sequence_alignment

vineetg · on Feb 3, 2018

Original author here.

You're right - conceptually the CRISPR search problem and DNA sequence alignment are related. In both, you're looking for place where two (or more) sequences are very similar. I would say there are two major differences.

The first is in the goal of the search. Typically, alignment tools try to find the best positional alignment for two (or more) sequences. The CRISPR search problem is to find every possible match above some similarity threshold.

There are also a few constraints on the CRISPR search problem that allow us to make this much faster than a general DNA sequence alignment tool:

1) We know that that guide sizes tend to be very small (~20bp) 2) Part of the guide must match exactly (the PAM site), allowing us to restrict our search even further. 3) We don't need to worry about insertions or deletions in our search.

Using those three constraints, we can do this search a lot faster than a more general DNA alignment tool!

inciampati · on Feb 4, 2018

Honestly you are taking a big risk designing a DNA search algorithm something from scratch. It's akin to the risk people take when they roll their own crypto. There are aspects of this that you may not be considering, and it tends to be best to rest on the extensive work in the field than assume it is a trivial problem.

How do you deal with natural variation in the genome? Can you be sure your gRNA doesn't target an essential locus in some percentage people who carry a particular allele? The data to solve this is out there (1000 Genomes for instance).

Edit: excuse me, I appreciate that you are using a collection of whole genomes as your target. Will this reliably scale to thousands or millions of genomes and likely recombinations between them?

alangpierce · on Feb 3, 2018

(I work at Benchling and worked on the protein alignments tool.)

Adding to Vineet's response, Benchling has both CRISPR search and sequence alignment, and they have different use cases. CRISPR search uses the custom algorithm described in the article (and the "last post" linked at the top), while alignments use the two popular alignment tools Clustal Omega[1] and MAFFT[2]. It's certainly true that the general alignment problem is complex had has undergone plenty of research around performance and producing good results, which is why we used an off-the-shelf tool for computing the alignment itself (but a custom UI for viewing the result).

[1] http://www.clustal.org/omega/

[2] https://mafft.cbrc.jp/alignment/software/

hawktheslayer · on Feb 3, 2018

Articles like this are why I return to HN everyday. There are 3 terms in the title that I know well, but put them together and they are something completely new to me. And the cost savings here is pretty remarkable.

_wldu · on Feb 4, 2018

I hope Lambda supports C++ natively someday. That would be awesome. Glad to see Go support was added recently too.

akhilcacharya · on Feb 4, 2018

Wouldn't this be easier to do in Fargate?