How many big data jobs were being processed by MapReduce in the 70s, 80s, early 90s? Ya, that's right: none. Sanjay and Jeff were the first to apply the combination of map-shuffle-and-reduce as we know it today to big data processing.
British Telecom used map reduce in billing systems for the dialcom (telecom gold) platform in the 80's - that was on the largest (non black) prime minicomputer site in the UK.
Back then 17x 750's would be roughly the same as one the 5k plus clusters that yahoo etal use.
we used the normal file system (primes probably descended from ITS) and had a load of JCL written in CPL (prime JCL) language to sync every thing up over our Cambridge ring to two sites.
> MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
...
> The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
...
> The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
So it would be quite impossible to have a MapReduce system without distributed computing infrastructure; even if you were doing mapping and reducing, it wouldn't be MapReduce.
How do you do distributed processing without a distributed filesystem? Do you mean you'd load the filesystem into memory and send it to the "processors"?
The data could be stored on a network device, such as a file server or database, for example. It could indeed be local, but it needn't be distributed.
In the example GP gave, the data could possibly have been stored in a database queried using segmentation via consistent hashing (a basic way to distribute jobs across a known number of workers).
...defeating the entire purpose: of large scale parallelism on commodity machines. OTOH if you have a way of achieving order 500X parallelism with a centralized commodity server or database, I would love to hear.
EDIT @supermatt Ah I see, we differ in the definition then, to me it isnt bigdata/largescale unless it churns through big amounts of stored data. Bitcoin mining is no where in the ball park of this, its an append only log of solutions computed in parallel.
How on earth do you think bitcoin mining pools work (as an extremely trivial example). They coordinate ranges between a number of workers. The stored size of those ranges is miniscule in comparison to the data of the hashes between those ranges calculated on each 'miner'. These 'coordinators' absolutely work as a centralised 'commodity' storage server (or database) resource for 500x+ parallelism.
'Big Data' means 'Big Data', not 'Big Storage'. They are completely different things.
The bitcoin example may be a bit oversimplified, and may indeed lean more towards HPC. The example was intended to illustrate data locality (as per the parent question), not the actual computation.
Big Data may incorporate data from various 3rd party, remote, local, or even random sources. For example, testing whether URLs in a search engines index are currently available. This may be a map/reduce job, it may utilize a local source of urls, but it will also incorporate a remote check of the url.
As I said a few links up: DFS is not a requirement for map/reduce.
All MapReduce frameworks I know about today are built on DFSs. There are definitely plenty of frameworks that support map and reduce that don't (e.g. MPI), but these aren't systems based on what was described in the OSDI 2004 paper where the word MapReduce was introduced.
I guess people just fixate on the terms map and reduce when the focus of MapReduce really was....shuffle.
I think the problem is that we are talking about two different things.
The very start of the paper describes the term and it's methodology (which is what we are discussing), and then goes on to explain googles own implementation using GFS (which you seem to be getting hung up on.)
Keep in mind that this whole thread is about "MapReduce", which Holzle was talking about, not the more generic map and reduce that has been around since the 1800s (and they will continue to mapping and reducing in their new dataflow framework, they just won't be using MapReduce). Now for the paper:
> Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.
Inspired doesn't mean equivalent.
> Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
They are using map and reduce as a tool to get something else.
> The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.
They are very specific about what the contribution is. All work that has claimed to be an implementation of MapReduce has followed their core tenants. Even if MPI has a reduce function, it is not MapReduce because it is based on other techniques.
I'm really tired of people who claim there is nothing new or even significant when there clearly was. Ya, everything is built on something these days, but so what? In the systems community, MapReduce has been a huge advance, and now we are moving on (at least for streaming).
I'm still in the camp of there being nothing new here. Now gfs may be a different matter, but that was part of a different paper, and not a requirement of this one. Which is why I have kept stating that a dfs is not a requirement.
If that's what you believe, then you are going to miss out on the last 10 or so years of systems research and improvements. And when Google stops using MapReduce but the new thing still uses map and reduce, you are going to be kind of confused.
I've seen MapReduce done against fairly significant amounts of data stored (10s of TBs per run) on a SAN running over fibre. The compute nodes weren't particularly cheap either - I guess they were commodity machines, but quite a long way from the "cheapest possible" things Google uses.
But it was still useful: it was a good computing model for letting as many compute nodes as possible process data.
That might not be what Google was trying to achieve, but it's difficult to argue that it isn't MapReduce.
Databases we should be so lucky :-) this was old school ISAM files updated with Fortran 77 and 4 different log files all with multiple types of records.
Our "Mappers" did quite a lot of work compared most modern map functions
I don't know about Mr Holzle but you're wrong about map/reduce. I'm aware of two significant counterexamples. I'm sure there are others.
Teradata's been doing map/reduce in their proprietary DBC 1012 AMP clusters since the 80's, providing analytical data warehousing for some of the world's largest companies[1]. Walmart used them to globally optimize their inventory.
MPI systems have been supporting distributed map/reduce operations since the early 90's (see MPI_REDUCE[2]).
I see the crazies are out trying to redefine MapReduce as just being map and reduce and completely missing the point. But whatever, they've probably never seen big data loads and are definitely not involved in the industry.
There's certainly a hype around big data nowadays, often even up to the point of being ridiculous.
The point is that people are starting to use this term to describe something that it's not even technical anymore, let alone describe the actual amount of data: merely using data to drive decision making.
This is not a new thing [0], yet there is a clear trend that shows how this kind of dependency is shifting from being auxiliary to being generative; some of the reasons are:
1. cheaper computing and storage power
2. increased computing literacy among scientists and not.
3. increased availability of digitalised content in many areas that capture human behaviour.
When there's request, there's opportunity for business. One thing that is new and big about Big Data is the market. It should be called "Big Market (of data)".
It's an overloaded term. IMHO it's counterproductive to let the hype around Big Data as a business term pollute the discussion about what contribution Google and others have made in the field of computer science and data processing.
So what did Google really invent? Obviously the name and concept behind MapReduce wasn't new. Nor the fact that they did start to process large amounts of data.
Size and growth are two key factors here. Although it's possible that the NIH syndrome affected Google, it's possible that existing products just weren't able to solve those two requirements. It's difficult to tell exactly how large given that the Google is not very keen at releasing numbers, although it's possible to find some announcements like [1] "Google processed about 24 petabytes of data per day in 2009".
20P is 10000 times more that 200 T. Stop to think a moment what does 10000 mean.
It's enough to completely change the problem, almost any problem. A room full of people becomes a metropolis; an US annual low wage salary becomes 100 million dollars, more than the annual spending of Palau [2]. Well, it's silly to make those comparison, but it's hard to think about anything that scaled by 10000 doesn't change profoundly.
Hell, this absurdly long post is well under 10k!
To stay in the realm of computer science, processor performance didn't increase by a factor of 10000 since PDP-11 from 1978 to Xeon from 2005 [3].
Working at that scale poses unique problems, and that's where real the contributions
to the advancement of the field made by the engineers and the engineering culture at Google are placed. If anything, just knowing it's possible
and having some accounts on what they focused on is inspiring.
This is the Big Data I care about. It's not about fanboyism. It's cool, it's real, it's rare. Arguing who invented the map reduce mechanics is like arguing that hierarchical filesystems where already there hence any progress made in that area by countless engineers is just trivial.
This guy may work for Google, but he's a clown.