I am *pretty sure* that Google didn't invent map-reduce, which has been around s...

seanmcdirmid · on June 26, 2014

How many big data jobs were being processed by MapReduce in the 70s, 80s, early 90s? Ya, that's right: none. Sanjay and Jeff were the first to apply the combination of map-shuffle-and-reduce as we know it today to big data processing.

Also, Urs Holzle is not a clown.

walshemj · on June 26, 2014

British Telecom used map reduce in billing systems for the dialcom (telecom gold) platform in the 80's - that was on the largest (non black) prime minicomputer site in the UK.

Back then 17x 750's would be roughly the same as one the 5k plus clusters that yahoo etal use.

We even sold the system to NZ telecom

seanmcdirmid · on June 26, 2014

Interesting. What kind of distributed file system were they using?

walshemj · on June 26, 2014

we used the normal file system (primes probably descended from ITS) and had a load of JCL written in CPL (prime JCL) language to sync every thing up over our Cambridge ring to two sites.

(we had oxford street dug up for our 10MBs link)

supermatt · on June 26, 2014

A dfs isn't a requirement for map/reduce.

seanmcdirmid · on June 26, 2014

From http://en.wikipedia.org/wiki/MapReduce:

> MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

...

> The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

...

> The name MapReduce originally referred to the proprietary Google technology but has since been genericized.

So it would be quite impossible to have a MapReduce system without distributed computing infrastructure; even if you were doing mapping and reducing, it wouldn't be MapReduce.

supermatt · on June 26, 2014

I see no mention of a distributed file system there. Local storage is not a requirement of distributed processing.

1stop · on June 26, 2014

How do you do distributed processing without a distributed filesystem? Do you mean you'd load the filesystem into memory and send it to the "processors"?

supermatt · on June 26, 2014

The data could be stored on a network device, such as a file server or database, for example. It could indeed be local, but it needn't be distributed.

In the example GP gave, the data could possibly have been stored in a database queried using segmentation via consistent hashing (a basic way to distribute jobs across a known number of workers).

srean · on June 26, 2014

...defeating the entire purpose: of large scale parallelism on commodity machines. OTOH if you have a way of achieving order 500X parallelism with a centralized commodity server or database, I would love to hear.

EDIT @supermatt Ah I see, we differ in the definition then, to me it isnt bigdata/largescale unless it churns through big amounts of stored data. Bitcoin mining is no where in the ball park of this, its an append only log of solutions computed in parallel.

supermatt · on June 26, 2014

How on earth do you think bitcoin mining pools work (as an extremely trivial example). They coordinate ranges between a number of workers. The stored size of those ranges is miniscule in comparison to the data of the hashes between those ranges calculated on each 'miner'. These 'coordinators' absolutely work as a centralised 'commodity' storage server (or database) resource for 500x+ parallelism.

'Big Data' means 'Big Data', not 'Big Storage'. They are completely different things.

seanmcdirmid · on June 26, 2014

Big data doesn't mean big computation, it actually means big data on lots of disks across many nodes. They are completely different things.

You might be into HPC, but that's not what Sanjay and Jeff did. HPC and big data loads are quite different.

supermatt · on June 26, 2014

The bitcoin example may be a bit oversimplified, and may indeed lean more towards HPC. The example was intended to illustrate data locality (as per the parent question), not the actual computation.

Big Data may incorporate data from various 3rd party, remote, local, or even random sources. For example, testing whether URLs in a search engines index are currently available. This may be a map/reduce job, it may utilize a local source of urls, but it will also incorporate a remote check of the url.

As I said a few links up: DFS is not a requirement for map/reduce.

seanmcdirmid · on June 26, 2014

All MapReduce frameworks I know about today are built on DFSs. There are definitely plenty of frameworks that support map and reduce that don't (e.g. MPI), but these aren't systems based on what was described in the OSDI 2004 paper where the word MapReduce was introduced.

I guess people just fixate on the terms map and reduce when the focus of MapReduce really was....shuffle.

supermatt · on June 26, 2014

I think the problem is that we are talking about two different things.

The very start of the paper describes the term and it's methodology (which is what we are discussing), and then goes on to explain googles own implementation using GFS (which you seem to be getting hung up on.)

seanmcdirmid · on June 27, 2014

Keep in mind that this whole thread is about "MapReduce", which Holzle was talking about, not the more generic map and reduce that has been around since the 1800s (and they will continue to mapping and reducing in their new dataflow framework, they just won't be using MapReduce). Now for the paper:

> Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.

Inspired doesn't mean equivalent.

> Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

They are using map and reduce as a tool to get something else.

> The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.

They are very specific about what the contribution is. All work that has claimed to be an implementation of MapReduce has followed their core tenants. Even if MPI has a reduce function, it is not MapReduce because it is based on other techniques.

I'm really tired of people who claim there is nothing new or even significant when there clearly was. Ya, everything is built on something these days, but so what? In the systems community, MapReduce has been a huge advance, and now we are moving on (at least for streaming).

supermatt · on June 27, 2014

I'm still in the camp of there being nothing new here. Now gfs may be a different matter, but that was part of a different paper, and not a requirement of this one. Which is why I have kept stating that a dfs is not a requirement.

seanmcdirmid · on June 28, 2014

If that's what you believe, then you are going to miss out on the last 10 or so years of systems research and improvements. And when Google stops using MapReduce but the new thing still uses map and reduce, you are going to be kind of confused.

nl · on June 26, 2014

I've seen MapReduce done against fairly significant amounts of data stored (10s of TBs per run) on a SAN running over fibre. The compute nodes weren't particularly cheap either - I guess they were commodity machines, but quite a long way from the "cheapest possible" things Google uses.

But it was still useful: it was a good computing model for letting as many compute nodes as possible process data.

That might not be what Google was trying to achieve, but it's difficult to argue that it isn't MapReduce.

walshemj · on June 26, 2014

Databases we should be so lucky :-) this was old school ISAM files updated with Fortran 77 and 4 different log files all with multiple types of records.

Our "Mappers" did quite a lot of work compared most modern map functions

walshemj · on June 26, 2014

I our case the first stage synced up all the required file systems and applied all the required updates before kicking off the mapper stage.

walshemj · on June 26, 2014

effectively yes each worker machine had an identical copy of the required ISAM files which where kept in sync by our system.

We had to build a lot of the functionality that comes out of the box in more modern system like hadoop

dbc1012 · on June 26, 2014

I don't know about Mr Holzle but you're wrong about map/reduce. I'm aware of two significant counterexamples. I'm sure there are others.

Teradata's been doing map/reduce in their proprietary DBC 1012 AMP clusters since the 80's, providing analytical data warehousing for some of the world's largest companies[1]. Walmart used them to globally optimize their inventory.

MPI systems have been supporting distributed map/reduce operations since the early 90's (see MPI_REDUCE[2]).

1- http://www.cs.rutgers.edu/~rmartin/teaching/fall99/papers/te...

2- http://www.mpi-forum.org/docs/mpi-1.0/mpi-10.ps

walshemj · on June 26, 2014

what does falsely claiming that google invented MR make him then ?

gaius · on June 26, 2014

I see the Google fanboys and wannabes are out in force on this thread.

seanmcdirmid · on June 26, 2014

I see the crazies are out trying to redefine MapReduce as just being map and reduce and completely missing the point. But whatever, they've probably never seen big data loads and are definitely not involved in the industry.

gaius · on June 26, 2014

Ooh, scary big data.

I could run your workloads in Excel without breaking a sweat. But go on kidding yourself.

seanmcdirmid · on June 28, 2014

I don't think Excel scales to 10 or 100 TB of data.

gaius · on June 28, 2014

In all seriousness tho', I was running data sets that big in Oracle, in 2006. You can see why I don't take "big data" seriously.

ithkuil · on July 1, 2014

There's certainly a hype around big data nowadays, often even up to the point of being ridiculous.

The point is that people are starting to use this term to describe something that it's not even technical anymore, let alone describe the actual amount of data: merely using data to drive decision making.

This is not a new thing [0], yet there is a clear trend that shows how this kind of dependency is shifting from being auxiliary to being generative; some of the reasons are:

1. cheaper computing and storage power

2. increased computing literacy among scientists and not.

3. increased availability of digitalised content in many areas that capture human behaviour.

When there's request, there's opportunity for business. One thing that is new and big about Big Data is the market. It should be called "Big Market (of data)".

It's an overloaded term. IMHO it's counterproductive to let the hype around Big Data as a business term pollute the discussion about what contribution Google and others have made in the field of computer science and data processing.

So what did Google really invent? Obviously the name and concept behind MapReduce wasn't new. Nor the fact that they did start to process large amounts of data.

Size and growth are two key factors here. Although it's possible that the NIH syndrome affected Google, it's possible that existing products just weren't able to solve those two requirements. It's difficult to tell exactly how large given that the Google is not very keen at releasing numbers, although it's possible to find some announcements like [1] "Google processed about 24 petabytes of data per day in 2009".

20P is 10000 times more that 200 T. Stop to think a moment what does 10000 mean. It's enough to completely change the problem, almost any problem. A room full of people becomes a metropolis; an US annual low wage salary becomes 100 million dollars, more than the annual spending of Palau [2]. Well, it's silly to make those comparison, but it's hard to think about anything that scaled by 10000 doesn't change profoundly. Hell, this absurdly long post is well under 10k!

To stay in the realm of computer science, processor performance didn't increase by a factor of 10000 since PDP-11 from 1978 to Xeon from 2005 [3].

Working at that scale poses unique problems, and that's where real the contributions to the advancement of the field made by the engineers and the engineering culture at Google are placed. If anything, just knowing it's possible and having some accounts on what they focused on is inspiring.

This is the Big Data I care about. It's not about fanboyism. It's cool, it's real, it's rare. Arguing who invented the map reduce mechanics is like arguing that hierarchical filesystems where already there hence any progress made in that area by countless engineers is just trivial.

[0] Historical perspective: James Gleick , http://en.wikipedia.org/wiki/The_Information:_A_History,_a_T...

[1] http://dl.acm.org/citation.cfm?doid=1327452.1327492

[2] https://www.cia.gov/library/publications/the-world-factbook/...

[3] http://www.cs.columbia.edu/~sedwards/classes/2012/3827-sprin...

rbanffy · on June 30, 2014

What was big data in the 70s, 80s and 90s? We just didn't call it map-reduce at the time.

gaius · on June 26, 2014

"Big data" is not a thing, and neither is "the cloud", while I'm here.

seanmcdirmid · on June 26, 2014

Well, then, you really don't understand the value of their contribution, which you have in your mind is just "map" and "reduce."