Building a Data Intensive Web App with Hadoop, Hive, & EC2

pskomoroch · on Aug 3, 2009

Related blog post here: http://www.cloudera.com/blog/2009/07/31/tracking-trends-with...

Full source code on Github: http://github.com/datawrangling/trendingtopics/tree/master

Dataset on Amazon Public Data Sets: http://developer.amazonwebservices.com/connect/entry.jspa?ex...

loganfrederick · on Aug 3, 2009

I have a friend interning at Hadoop over the summer. I'm particularly interested in how their business model will work. Their branch of Hadoop I believe is being offered for free, and they're basically cloud computing consultants with the goal of customizing Hadoop for different industries/clients. It's a huge market opportunity.

neilc · on Aug 3, 2009

Running a consulting firm around Hadoop and related technologies makes a lot of sense, but taking VC funding to do so is surprising, IMHO (Cloudera are backed by Accel and Greylock). A consulting firm has relatively small capital requirements, but is fundamentally less scalable than a "product" firm: to make 2x revenue, you need ~2x the staff. I'm curious to see whether they'll be able to achieve the sort of returns that a typical VC expects if they stick to a purely-consulting business model.

Cloudera have taken ~$11M in VC funding so far[1]; is there really ~$110M in profit to be made off Hadoop consulting, training and support in the medium term? I wonder.

One possibility is that they're using consulting to build revenue and mindshare in the short-term, and using the capital they've raised to launch something more substantial in the longer-term (say, running their own cloud/hosted Hadoop service).

[1] http://ostatic.com/blog/hadoop-centric-cloudera-gets-6-milli...

mattyb · on Aug 3, 2009

I assume you mean interning at Cloudera?

idefine · on Aug 3, 2009

It was great to have Pete Skomoroch speak about this at the hadoop meetup in DC. I am really glad that it is being shared with the rest of the community now. Cloudera is collecting good use cases and providing innovative ideas in their blog. Thanks again for sharing Pete.

CodeChutney · on Aug 4, 2009

This is almost exactly what I have been working on, and would be a lot of help! Thanks !

CodeChutney · on Aug 4, 2009

Why are you loading the processed data on MySQL tables? I am not sure about how much MySQL would scale, given that wikipedia has ~3million articles. Like I said, I am working on a similar problem right now and we are trying to avoid MySQL. Did you guys consider HBase or other big-table like implementations?

HN insights will be valuable, thank you!

pskomoroch · on Aug 4, 2009

The live site trendingtopics.org is using MySQL for all 3 million articles and it handles it pretty well with the right indexing, bulk loads, and memcached. I built the initial demo in 10 days, so I choose Rails w/ MySQL mostly for simplicity and with the intention of adding Solr or Sphinx search. The way the data is stored (key value style w/ JSON timelines) was actually intended to lend itself to replacing MySQL with another fast big-table like datastore.

CodeChutney · on Aug 4, 2009

Thanks for the quick reply. How many machines are running MySQL for you?

I was reading this website - http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib...

I have not tried HBase and HyperTable myself yet, but the blog post says that they still have latency issues. What are your views?

pskomoroch · on Aug 4, 2009

We're just using a single c1.medium instance for the database right now. Trendingtopics.org is a relatively low traffic, read-only site and most of the reads are for a handful of urls on the front page which can be cached.

Also, after processing the raw log data with Hadoop, we only need to store/lookup 3M records in the MySQL presentation layer, which is well within the capabilities of a tuned RDBMS. Many Rails sites are backed by MySQL, so I thought linking Hadoop/Hive to a common data workflow would make for a good example.

I've been hearing that recent improvements to HBase 0.20 could make it a contender: http://stackoverflow.com/questions/1022150/is-hbase-stable-a... and some high volume sites like Mahalo are already using it. That said, there are other alternative data stores (Cassandra, Voldemort, Tokyo Tyrant) that might be worth exploring if a database isn't cutting it for you.

maolson · on Aug 3, 2009

Awesome stuff. Nice to see a deep practical discussion on building working systems.

neilkod · on Aug 3, 2009

Brilliant tutorial - Very in-depth and covers quite a lot of ground.

miked98 · on Aug 3, 2009

Practical, extensive, and timely piece on the nuts and bolts of weaving Hadoop and EC2.

christofd · on Aug 4, 2009

Intense, thorough tutorial. Nice to see a pragmatic Hadoop walk-thru.