I have a friend interning at Hadoop over the summer. I'm particularly interested in how their business model will work. Their branch of Hadoop I believe is being offered for free, and they're basically cloud computing consultants with the goal of customizing Hadoop for different industries/clients. It's a huge market opportunity.
Running a consulting firm around Hadoop and related technologies makes a lot of sense, but taking VC funding to do so is surprising, IMHO (Cloudera are backed by Accel and Greylock). A consulting firm has relatively small capital requirements, but is fundamentally less scalable than a "product" firm: to make 2x revenue, you need ~2x the staff. I'm curious to see whether they'll be able to achieve the sort of returns that a typical VC expects if they stick to a purely-consulting business model.
Cloudera have taken ~$11M in VC funding so far[1]; is there really ~$110M in profit to be made off Hadoop consulting, training and support in the medium term? I wonder.
One possibility is that they're using consulting to build revenue and mindshare in the short-term, and using the capital they've raised to launch something more substantial in the longer-term (say, running their own cloud/hosted Hadoop service).
It was great to have Pete Skomoroch speak about this at the hadoop meetup in DC. I am really glad that it is being shared with the rest of the community now. Cloudera is collecting good use cases and providing innovative ideas in their blog. Thanks again for sharing Pete.
Why are you loading the processed data on MySQL tables? I am not sure about how much MySQL would scale, given that wikipedia has ~3million articles. Like I said, I am working on a similar problem right now and we are trying to avoid MySQL. Did you guys consider HBase or other big-table like implementations?
The live site trendingtopics.org is using MySQL for all 3 million articles and it handles it pretty well with the right indexing, bulk loads, and memcached. I built the initial demo in 10 days, so I choose Rails w/ MySQL mostly for simplicity and with the intention of adding Solr or Sphinx search. The way the data is stored (key value style w/ JSON timelines) was actually intended to lend itself to replacing MySQL with another fast big-table like datastore.
We're just using a single c1.medium instance for the database right now. Trendingtopics.org is a relatively low traffic, read-only site and most of the reads are for a handful of urls on the front page which can be cached.
Also, after processing the raw log data with Hadoop, we only need to store/lookup 3M records in the MySQL presentation layer, which is well within the capabilities of a tuned RDBMS. Many Rails sites are backed by MySQL, so I thought linking Hadoop/Hive to a common data workflow would make for a good example.
I've been hearing that recent improvements to HBase 0.20 could make it a contender: http://stackoverflow.com/questions/1022150/is-hbase-stable-a... and some high volume sites like Mahalo are already using it. That said, there are other alternative data stores (Cassandra, Voldemort, Tokyo Tyrant) that might be worth exploring if a database isn't cutting it for you.
Full source code on Github: http://github.com/datawrangling/trendingtopics/tree/master
Dataset on Amazon Public Data Sets: http://developer.amazonwebservices.com/connect/entry.jspa?ex...