Hacker Newsnew | past | comments | ask | show | jobs | submit | bcongdon's commentslogin

Thanks! This comment was super helpful. :) I made a couple updates to the post with these findings.

I suspected something like the `ff` labeling approach would improve the solution, but I had no idea how much of a difference `all_different` made vs. `all_distinct`!


https://benjamincongdon.me/blog

I like to write about productivity, Go/Rust, and my various web development projects. I also tend to write pretty frequently about programming language ergonomics.

Popular posts:

* https://benjamincongdon.me/blog/2019/11/11/The-Value-in-Gos-...

* https://benjamincongdon.me/blog/2018/03/23/Python-Idioms-in-...

* https://benjamincongdon.me/blog/2018/03/01/Scraping-the-Web-...

Favorite posts:

* https://benjamincongdon.me/blog/2019/03/07/Generative-Doodli...

* https://benjamincongdon.me/blog/2018/10/07/Wordscapes/

---

I also have a blogroll of other blogs I think are interesting: https://benjamincongdon.me/blogroll

... and a list of books that I've read: https://benjamincongdon.me/books


Hi HN, author here. Corral is my attempt at a performant, easy-to-deploy MapReduce. Unlike traditional frameworks like Hadoop, it uses AWS Lambda for execution and is “serverless” as a result. It was initially kicked-off by AWS adding Lambda support for Go, but draws on experience I’ve had using Lambda tools like Zappa[1] and Serverless[2] in the past.

I think there’s a lot of interesting applications for using function-as-a-service platforms as executors in data processing frameworks such as this.

If you’re interested more in the development/internals of this project, I wrote a blog post with more details: https://benjamincongdon.me/blog/2018/05/02/Introducing-Corra...

[1]: https://github.com/Miserlou/Zappa [2]: https://serverless.com/


I'd like to see your readme expanded with some figures:

* Processing speed - that is, how long does it take to do that word count example on a nontrivial dataset? Something that takes hours on a local machine, vs minutes in map/reduce. Comparing local to this to e.g. Hadoop or Google BigQuery or whatever viable alternative there is. * Cost. I think that's probably the biggest factor here. I don't get the impression that Lambda was intended for big data or highly resource / i/o / processing intensive operations, but, I'd love to be proven wrong. * Actually, mostly just cost vs performance.

I mean it's a neat idea but if the serverless benefit is outweighed by difficulty in setting up, cost, performance, etc compared to dedicated big data solutions it's going to stay a proof of concept.


Hi,

For a small map reduce load, say a terabyte (to replace a single MR node), how much would you estimate the aws cost would be?


Pricing depends a lot on how much memory your job requires[1] and how much processing each record requires -- i.e. the pricing is more sensitive to usage.

As a very rough estimate, for a light-to-medium load of 1Tb, the cost would probably be in the ballpark of ~$0.50. AWS's own reference MR framework[2] (which is mostly a tech demo) quotes prices in a similar order of magnitude.

Corral isn't great for processing-heavy MR jobs, as Lambda pricing rises quickly if you need a lot of memory or take a lot of time with each record. But, for small-ish low-overhead jobs, it can pretty easily beat the pricing and hassle of using something like EMR.

[1]: https://aws.amazon.com/lambda/pricing/#Lambda_pricing_detail... [2]: https://github.com/awslabs/lambda-refarch-mapreduce/


Hi OP, I am the person that made the most recent changes to the AWS Labs refarch. I had been working on a golang version and wanted to clean up the python one. Sunil the original author used the AMPLABS benchmark to calculate the results table. I was planning on updating it with the 1 and 5 node test. Would be happy to include Corral as well.


Nice! Always thought this would be cool. Few thought questions though: how do you get things like consistent hashing? Spark for example can shuffle data somewhat efficiently by sending the data to the right node / getting data from the right node by the hash key, right? How in a serverless stateless world you call a specific Serverless function instance? Assuming it can’t be done, arent you losing performance gained by data locality? Eg data has to be saved in a massive and efficient key value store? Isn’t it much slower than spark’s in memory / data locality (bring the compute to the data and not vice versa). Would love to see benchmarks on this. This is the future IMHO... well done.


Nice! I've been thinking about doing this kind of thing for a while. I have long experimented with doing map reduce style work outside of things like Hadoop and gotten much better results due to being able to tune for different things much more quickly.


Hi.

How do you deal with the 5min (IIRC) execution time limit of Lambda ?


Yeah, max execution time (and max memory usage) are the main constraints of using Lambda.

Corral deals with this by splitting input data into small enough chunks that each chunk can be processed within the timeout -- I exposed options for setting the amount of data that each Lambda function has to process. However, if each data item requires more than 5 min of processing, then corral won't work for you.

The "driver" that coordinates the Lambda functions runs locally (not in Lambda), so it doesn't have this constraint.


The iOS app has been pretty good (if a bit buggy and slow), but the desktop web interface is so bad to the point that I haven't used it in over a year -- and I use myfitnesspal daily.


I've definitely had the app perma-crash on me recently, where the only way to get out of the boot-then-crash loop was to delete and reinstall it.

It also fails at the iOS quick access menu about 80% of the time. e.g. you hard-press the app icon, go to "log food" and it goes back to the home page. Or you do the same for "scan barcode" and again most of the time it opens the app and sends you to the home page again. This has been the case for months.


Fair, I also haven’t. I imagine barely anybody does.


This is so cool! I've been following Inconvergent's work (https://twitter.com/inconvergent). This seems like something I might be interested in.

Thanks for sharing!


Cool! I've been looking for something that does just this


Unfortunate that this doesn't apply to "The Pragmatic Programmer". Still, very cool show of support against DRM.


Hi HN,

This is a project I've been working on for a while now. It's a hobby horse to try new technologies, but I've really enjoyed working on it.

I'm a fan of the GamesDoneQuick (gamesdonequick.com) speedrun marathon that pops up to the top of Twitch twice a year. Inspired by previous sites that have done visualizations like this, I decided to take a crack at realtime stats last year.

This is the current iteration of GDQStatus, which is powered by React + Recharts and AWS Lambda.

The site itself is static - which is hosted in Github pages, and data is fetched from S3 (as a caching layer) with hot data served by a Lambda.

Let me know if you have any questions!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: