An Interview with an Anonymous Data Scientist (2016)

Terr_ · on Dec 12, 2017

Good interview, there are a bunch of bits I feel like I ought to be Quoting For Truth but then I'd end up with a pretty bloated reply.

> I want to emphasize that historically, from the very first moment somebody thought of computers, there has been a notion of: “Oh, can the computer talk to me, can it learn to love?” And somebody, some yahoo, will be like, “Oh absolutely!” And then a bunch of people will put money into it, and then they'll be disappointed.

Reminds me of a pre-transistor computing quote from Charles Babbage, about some overeager British politicians:

> On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

dalbasal · on Dec 12, 2017

For some devils advocacy...

I remember hearing from some old salty in the oil business that geologists are the wrong people to ask about peak oil. They always understimated future discoveries. The ones that tended to get it right were finaciers and investors.

The idea is that geologists have their noses down in the details of practical, useful knowledge that they have or can get. Financiers don't really know anything, just that wells have been found in the past. They just model things like exploration money, the rate and quality of new finds, oil prices, production costs...

There could be somethng similar here. The real technology people see mostly problems. All the stuff that would need to be solved, that they have no idea how to solve. The fact that we don't even know what intelligence is. The frauds making audacious claims.

Outsiders see drones, self driving cars, spam filters, google search, chess, face recognition, translation, chatbots^. They see that voice recognition now works. I reckon medical diagnosis might do something soon. In any case, it seems that pone way or another, these add up to something. ...just as a hunch.

Obviously I don't know the answer and this whole comment is based on an anecdote that may not even be true. Still, I don't discount the possibility that the unwashed masses are right.

^just kidding

steamer25 · on Dec 12, 2017

This reminds me of the inspiration for the name of Taleb's "green lumber fallacy". From Wikipedia:

The term green lumber refers to a story by authors Jim Paul and Brendan Moynihan in their book What I Learned Losing A Million Dollars, where a trader made a fortune trading lumber he thought was literally "green" rather than fresh cut.[26] "This gets at the idea that a supposed understanding of an investment rationale, a narrative or a theoretical model is unhelpful in practical trading."[27]

The protagonist makes a big discovery. He remarks that a fellow named Joe Siegel, one of the most successful traders in a commodity called "green lumber," actually thought that it was lumber painted green (rather than freshly cut lumber, called green because it had not been dried). And he made it his profession to trade the stuff! Meanwhile the narrator was into grand intellectual theories and narratives of what caused the price of commodities to move, and went bust. It is not just that the successful expert on lumber was ignorant of central matters like the designation "green." He also knew things about lumber that nonexperts think are unimportant. People we call ignorant might not be ignorant. The fact is that predicting the order flow in lumber and the usual narrative had little to do with the details one would assume from the outside are important. People who do things in the field are not subjected to a set exam; they are selected in the most nonnarrative manner—nice arguments don’t make much difference.[25]

sp527 · on Dec 12, 2017

That is a very impressive anecdote in how powerfully it expresses the dichotomy in thinking styles between two different types of people. Babbage the engineer presumed the most literal interpretation of the query and came to the most logical conclusion. But arguably the politicians, having a keener understanding and appreciation of human fallibility, had the more profound and shockingly prognosticative insight.

newfoundglory · on Dec 12, 2017

What is this supposed to mean?

sp527 · on Dec 12, 2017

I believe what the politicians really wanted to ask was: “how do we account for the tendency for human operators to make mistakes?” If you stop to think about it, what seems at first like a dumb question in fact presages difficulties that will have to be addressed by everything from unit testing to spell checkers to entire classes of non-deterministic algorithms.

IanCal · on Dec 12, 2017

I've always wondered if those questions came after a statement like "this eliminates the possibility of errors", and were less of a question than a statement.

gwern · on Dec 12, 2017

Speaking as a 'loon', his AI history is wrong in several places:

1. the Fifth Generation Project (https://en.wikipedia.org/wiki/Fifth_generation_computer) was 1980s officially ending in 1992, not 'late 1990s' (during the Dot-com bubble?!); 2. the Lisp bubble didn't pop because of a failed DoD piloting project, it popped because of the first AI Winter + commodity SPARC/x86 pressure + recession (https://en.wikipedia.org/wiki/Lisp_machine) (and I don't recall DARPA instituting any policy like 'no AI', just stopping subsidizing Symbolics and later Connection Machine); 3. the Club of Rome report couldn't've killed its modeling language because it only really acquired its present ill repute by the 1990s, the implementation language Modelica (https://en.wikipedia.org/wiki/Modelica) didn't die (last release: April 2017) and is still in industrial use which is more than almost all languages from the 1960s-1970s can say, and even the World3 model (https://en.wikipedia.org/wiki/World3) analyzed in the report continued development for decades; 4. the Oxford paper (https://www.fhi.ox.ac.uk/wp-content/uploads/The-Future-of-Em...) doesn't make precise forecasts for when any automation may happen (merely saying "associated occupations are potentially automatable over some unspecified number of years, perhaps a decade or two"); 5. the GPU server comparison is really weird as computers have almost always cost more than humans and only relatively recently do any computers' hourly costs fall below minimum wage; and 6. the Dartmouth description is wrong, the conference merely proposed (http://www-formal.stanford.edu/jmc/history/dartmouth/dartmou...) that meaningful progress could be made by 10 researchers, not grad students ("We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College...We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.")

Also, come on dude, Keras isn't hard to use - it's not even comparable to Tensorflow. But at least he didn't tell the tank story.

fnl · on Dec 12, 2017

Here's another factual error: Data science is from the 1960s, and was used first in a paper published by Peter Naur in 1974: https://en.wikipedia.org/wiki/Data_science

geezerjay · on Dec 13, 2017

Data science is actually statistics, which goes quite a bit further than the 1960s. In fact, today's data scientists love to quote Box and Fischer.

Data science and data mining are victories of marketing over common sense.

fnl · on Dec 14, 2017

Sorry, I meant that in the sense of the origin of the term. But yes, DS is mostly just another word for statistics. About as pointless as the term AI has become.

MikeGale · on Dec 12, 2017

And there's more where he's plain wrong, like Aluminium.

Despite all that a great antidote to the overhype that I see most days.

gwern · on Dec 12, 2017

I did notice that one, but aluminum is kind of a complex topic (https://en.wikipedia.org/wiki/Aluminium#Synthesis_of_metal): the early cost was both the chemical processing and the low ore content, and one could charitably read him as referring to discovering bauxite and the electrolysis method, and then he's certainly right about the cost of electricity coming down drastically and making aluminum even cheaper. So not clearly wrong IMO, given that it's an extemporaneous interview.

wolfgke · on Dec 12, 2017

> On two occasions I have been asked, — "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" In one case a member of the Upper, and in the other a member of the Lower, House put this question. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Luckily math has developed methods such as error-detecting/error-correcting codes (to insure against small typos/transmission errors), constructive results on continuity and robustness of functions (i.e. we can prove that if the error in the input data is less than some concretely computable delta, the solution will have an error less than epsilon; or we can ensure that the error in the solution is less than some computable epsilon if we can ensure that the error in the input data "is not too large" (i.e. bounded by some computable epsilon) etc.

In this sense I don't consider the question as that absurd.

fellellor · on Dec 12, 2017

From what I know, error correcting codes wrap around information (in a manner of speaking) so as to provide a measure of consistency, which then enables error correction properties. If the information itself is riddled with errors then the error correcting code can't do anything here.

People using Babbage's machine would have entered raw information into that thing. No error correcting code would correct the human induced flaws in that. So the question was absurd at the time.

lou1306 · on Dec 12, 2017

And yet the Schiaparelli lander crashed because the machine couldn't give the right answer to a question that was wrong.

All these solutions are good for a noisy input, but have no use when the input is incorrect (ie. doesn't match reality).

sriku · on Dec 12, 2017

> You become so acutely aware of the limitations of what you’re doing that the interest just gets beaten out of you. You would never go and say, “Oh yeah, I know the secret to building human-level AI.”

A colleague of mine called these "educated incapacities" - where we become acutely aware of impossibilities and lose sight of possibilities. Andrej Karpathy, in one of his interviews iirc, said something like "if you ask folks in nonlinear optimization, they'll tell you that DL is not possible".

It is useful to keep that innocence alive despite being educated, especially if the cost to trying something out doesn't involve radical health risks. That plus a balance with scholarship.

Knowledge, courage and the means to execute are all needed.

rcpt · on Dec 12, 2017

Right. I found that part of the article particularly irritating - there are tons of examples of researchers making substantial contributions outside of their primary field, cf https://mathoverflow.net/q/173268/6360

brucephillips · on Dec 12, 2017

> If you ask folks in nonlinear optimization, they'll tell you that DL is not possible.

I sincerely doubt anyone who knows more than one sentence about deep learning would say that, since deep learning doesn't claim to optimize.

aoki · on Dec 12, 2017

i suspect that what he's referring to is that he's heuristically minimizing a somewhat arbitrary (loss) function in a million-ish dimensions using the simple variants of gradient descent that work under these conditions. it sounds far too WIBNI to produce good results reliably (in practice, let alone in theory). the landscape has so many stationary points at which to get stuck; why would you ever get good results?

there's a small cottage industry of papers (like [0]) that try to explain this.

[0] https://arxiv.org/pdf/1412.0233.pdf

azag0 · on Dec 12, 2017

I think this recent paper [1] sheds quite a bit of light on this.

[1] https://arxiv.org/abs/1703.00810v3

chillee · on Dec 13, 2017

Really don't think that's the best paper to say "sheds quite a bit of light on this". That paper has been somewhat controversial since it came out.

I think https://arxiv.org/abs/1609.04836 is seminal in showing unsharp minima = generalization, the parent's paper is good for showing that gradient descent over non-convex surfaces works fine, https://arxiv.org/abs/1611.03530 is landmark for kicking off this whole generalization business (mainly shows that traditional models of generalization, namely VC dimension and ideas of "capacity" don't make sense for neural nets).

srean · on Dec 12, 2017

You are right. Unfortunately, many (doubly unfortunately, even in academia, well, many who switched careers in optimization to ML) think that machine learning is just optimization.

Regarding deep NNs, one should be careful with what one wishes for, because sometimes they come true. Landing up with the global optimum of that thing would likely be the last thing one wants.

The key to deep NNs is to do such a pathetic job of optimizing the loss that the generalization is good. A problem is that there several different ways of doing a job poorly, not all of them would generalize well. When I have my engineer hat on, I would rather not have lots of indeterminism on my watch if I can afford it. Too dang hard to maintain correctness of.

On the other hand if one has a "with high probability" style result where the probabilities are high enough to be practically relevant, then we have something more workable.

_raoulcousins · on Dec 20, 2017

I don't understand why you don't want a global optimum. Is this obvious? Does the following paragraph explain it, because I don't see the connection.

mljoe · on Dec 12, 2017

It happens when practitioners generalize theorems to scenarios that look similar but don't apply. The common pattern is misapplying an infinite set theorem to finite set case. If you don't know about the theorem in question to begin with, there is no way for you to misrepresent it.

roenxi · on Dec 12, 2017

I think there is also a pretty pervasive over-estimation of how capable humans are.

As I see more of the failure modes of deep learning, a lot of successes and mistakes made by humans start to become more understandable. Machines don't need to be perfect or avoid failures; like humans, they need to work most of the time and then be used in systems that are tolerant of their potential faults and mistakes.

nocoder · on Dec 12, 2017

I work at a tech company and one of the things I have recently noticed is how ML and AI terms are being increasingly used by the business people. The guys who have no technical understanding, these are accountants or marketing guys saying we should ask tech team to design ML to solve these problems. Its as if ML is a thing to through at every kind of imaginable problem and it will be magically solved. I believe a lot of this has to do with PR around this by big tech companies. Take for example, the recent alpha zero vs stock fish PR, it has been spun around by Google in a way as if it was some magic. You hear a lot about how it took just 4 hours and I find it hard to explain to people that 4 hour time is meaningless. It is about how many games it could play in that time. Moreover the match happened between two systems on a different hardware and that is a big difference and also the fact that it used a arbitrary type of time control of, 1 min/move. Again this can make big difference but it is a big struggle to get past this PR fluff. To be clear, I am not denying the advances made by deep mind, I just want people to understand that it has come on back of probably the the world best team of scientists alongside state of the art Google designed hardware and incredible monetary resources of Google.

blueplastic · on Dec 12, 2017

I'm pretty sure you can throw IBM Watson's AI at any of these business problems and you can solve it very quickly.

trts · on Dec 12, 2017

This articulated so much I have learned about the field in the past 5 years. As someone who inherited the title 'data scientist' because that's how my department designated us when it became fashionable, felt fraudulent due to the unlimited expectations of what data science is vs. what I understood it to be, and subsequently has interviewed probably nearly a hundred data science and machine learning 'experts', there seems to be little cohesion to what these terms describe, little understanding by laypersons about data science besides that it is some kind of magic that only the very gifted can command, and no greater distance between hubris and praxis that I have seen sustain itself for so long and so intensely.

The whole interview was an absolute joy to read.

carlsborg · on Dec 12, 2017

It was 2016 and he said "I’ve noticed on AWS prices was that a few months ago, the spot prices on their GPU compute instances were $26 an hour for a four-GP machine, and $6.50 an hour for a one-GP machine. That’s the first time I’ve seen a computer that has human wages.."

Minimum wage (or thereabouts $7.20) now gets you a whopping p2.8xlarge (8 GPU, 32 vcpus, 488GB RAM), and the single GPU machine p2.xlarge is now $0.9 per hour.

This is a crazy data point. What will minimum wage buy you five years from now?

SiempreViernes · on Dec 12, 2017

Depends, do you think the lowest legal wage should go up or down?

jononor · on Dec 12, 2017

Even if I wanted it to double, I don't think that would make it more likely to actually happen. I think the likelihood of machine power available being double or quadruple what it is now is pretty good.

likelynew · on Dec 12, 2017

g3.xlarge is many times faster and spot prices are like 0.5$ per hour.

CalChris · on Dec 12, 2017

This reminds me of ... What’s the difference between a data scientist and a statistician? A data scientist lives in San Francisco.

gaius · on Dec 12, 2017

A data scientist lives in San Francisco

That's completely over-simplifying matters. Data scientists also drink soy lattes and ride children's push scooters.

tikhonj · on Dec 12, 2017

More cynically, the difference is 100k/year :P.

dllthomas · on Dec 12, 2017

In rent? :D

vadimberman · on Dec 12, 2017

Believe it or not, they are flooding Southeast Asia, too.

SiempreViernes · on Dec 12, 2017

The data scientist optimizes ad clicks

cridal · on Dec 12, 2017

data science? doing statistics on a mac...

sundarurfriend · on Dec 12, 2017

It's an interesting read, though not very enlightening in terms of new information. It's same old pre-existing arguments put in a more informal, more directly honest package.

As another person who's seen robots fall over again and again and has a scope for the difficulty of the problem, I'd say there's also the risk of the day to day failures making us lose sight of the forest for the trees, with availability bias working against us.

Also,

> the Y Combinator autistic Stanford guy thing

> the Aspy worldview

It's a bit worrying that use of these terms has turned into a kind of slur, to lump a kind of imagined stunted-worldview with a medical diagnosis. Not particularly pissed that this guy used these, more worried about what it indicates - that these have become so common as to infiltrate friendly informal conversations from seemingly intelligent people.

muraiki · on Dec 12, 2017

Yeah, I was shocked when I came across that. The data scientist appeared to be really in tune with ethical problems, and then speaks like that. It's very disappointing.

MikeGale · on Dec 12, 2017

It is just so amazingly refreshing to read something not put together by a know-nothing.

I wish I saw more than one or two of these a year.

comstock · on Dec 12, 2017

Any bets on when the current deep learning bubble is going to burst?

It’s shocking to me how much technical people buy into this, how “this time it’s different” and AI isn’t “over-promising and substantially under-delivering” this time. Really odd to watch it come round again, when the reality is we’re more likely to see some near incremental progresses, partly fueled by more compute and algorithmic advances. Partly by a lot of PR.

marshray · on Dec 12, 2017

I think we're just used to computers advancing noticeably on a regular basis: "Is this year's iPhone better enough to justify an upgrade?"

Also, we judge the difficulty of things by our own experience. It took us ~1 billion years to get to the point where we could communicate abstract ideas and play chess. These were once believed to be the challenging problems in AI.

It turned out that chess is easy we're just relatively bad at it.

yters · on Dec 12, 2017

Chess is easy when you have the hardware to effectively brute force it. Once someone develops an algorithm that requires an order of moves comparable to a human, and significantly outperforms a human, then AI will be interesting.

comstock · on Dec 12, 2017

I find it somewhat understandable from non-tech people. I’m more surprised at now much people within the tech world but the hype.

brucephillips · on Dec 12, 2017

The big tech companies are demonstrably using deep learning to solve previously unsolvable problems. It's a significant advance.

What's yet to be seen is if startups can profit from this advance, since it depends on massive data and compute.

comstock · on Dec 12, 2017

AlphaGo is interesting. But what big new problems have been solved? (rather than incrementally improved).

joe_the_user · on Dec 12, 2017

The big results as far as I understand:

- Image recognition

- Winning the games computers hadn't won already

- Incremental progress on translation. Plus translation that doesn't need as many domain experts

- Self-driving cars (with related automation applications)

Of image there, image recognition stands out as the big leap and the rest are relatively incremental. One of the things with the other applications is that they provide a recipe format that's more systematic than previous approaches. A lot of vision approaches pre-deep-learning were very hit-or-miss. Deep learning has a lot of black art involved in effective training and a lot of time investment but my impression it is more reliable than what came before.

Any other examples welcome

sanxiyn · on Dec 12, 2017

You forgot speech recognition.

eksemplar · on Dec 12, 2017

This isn't related to alpha go as such, but we can now predict which citizens are going to need help raising their children by having a machine cross reference their case history with public records.

It's not legal yet, but it will be, because it will potentially save lives (and money).

mfukar · on Dec 12, 2017

Well, "big new" sounds like a destructive qualification. Why can't it be "big" and "old"? https://arxiv.org/pdf/1712.01208.pdf

brucephillips · on Dec 12, 2017

https://m.cacm.acm.org/magazines/2017/6/217734-deep-learning...

comstock · on Dec 12, 2017

Looks like great incremental progess. Have you seen the state of Japanese<->English translation? It’s almost completely useless.

I really don’t see this as a huge win for deep learning, anything else?

brucephillips · on Dec 12, 2017

Whether progress is incremental is an ill defined question. I don't consider "super human translation" to be incremental. The key point here is that deep learning has produced significant results. I'm not sure why you care to argue semantics.

comstock · on Dec 12, 2017

Well, I’m interested in understanding how valuable deep learning is and if lives up to the hype.

Better translation of European languages (which wasn’t a totally unsolved problem anyway) doesn’t seem to be something that really lives up to the hype.

Particularly as the article cited doesn’t seem to back up its statements very well.

So... anything else?

brucephillips · on Dec 12, 2017

If super human translation doesn't impress you, what will?

comstock · on Dec 12, 2017

The article doesn’t make that statement. The article doesn’t provide data to support any statements (it’s a pop science piece).

The original blog:

https://research.googleblog.com/2016/09/a-neural-network-for...

Is better suggests deep learning resulted in maybe 10% improvement. Isn’t as good as human in all cases.

brucephillips · on Dec 12, 2017

Ah I misread "as good" as "better". It's still an epsilon difference, though. And the article lists other applications that have had "step wise" improvements, which is the opposite of incremental of course.

Also, you didn't answer the question.

eksemplar · on Dec 12, 2017

It won't. It's going to change the way we do public sector. Not so much because the tech is revolutionary but because the upper echelons of society are sold on it and are actually putting it to use.

Technically you could do a lot of the decision making it'll be doing with human made models and a lot of data, but the machine is cheaper and it's backed by consulting agencies.

RPA was the first indication. It's basically screenscraping and small bots, stuff that's been around for a long time, I mean, it's basically what people use to bot in video games. Yet it's become a multimillion dollar industry over the course of a few years because it caught the right drift.

Like RPA, machine learning isn't just hype. It actually does some things with data really well, and when you couple that with the fact that ministers want this tech, well, that's all you need.

eanzenberg · on Dec 12, 2017

Depends who you ask. If you talk to people knowledgeable about deep learning and its applicability they’ll say we’re in the productivity regime. If you’re asking people who aren’t knowledgeable then they will display their hype.

deviationblue · on Dec 12, 2017

I've noticed an alarming uptick in articles around job titles and what people call themselves, so I feel compelled to say something. I couldn't be bothered what someone calls themselves as long as they can actually get shit done. The focus on titles is misplaced, especially for people who work in BigCo, as most titles in such places are handed down by HR anyways so I don't focus too much on them. But what does the person actually doing on a day to day basis? Is it stats? Is it exploratory analysis and modeling? Are they using ML, or working with data that doesn't fit on a single commodity machine? Writing people off based on what titles they might have had at some job (which they probably might not have any control over) is a good way to lose out on talent that you might have appreciated. But of course, this cuts both ways, would you want to work for someone who gets hung on things like that?

Anyway, overall great article, but this was the one thing that bothered me enough to comment.

nicolewhite · on Dec 12, 2017

I enjoyed his comments on Tensorflow.

> It’s really bad to use. There’s so much hype around it, but the number of people who are actually using it to build real things that make a difference is probably very low.

I wonder how many data scientists out there are actually developing Tensorflow models for a mission-critical project at work. I'm not. I have used Tensorflow successfully within my personal projects, but I've yet to need it for anything "real."

mslate · on Dec 12, 2017

We used it for a sales email classification problem--it significantly out-performed our conventional approaches (i.e. logistic regression + bag-of-words), but we were not PhDs and none of our job titles were "data scientist" so I guess that makes us charlatans ;)

That service offering among the rest of the business was marginal so it never became an offering that our sales team pitched our customers very aggressively, so in this particular case TensorFlow did not push the needle so-to-speak.

amrrs · on Dec 12, 2017

Wondering what Tensorflow has to do with that out-performance since it must be all about the model/algorithm that you implemented in that - like you could've had a TF code running the same conventional approach you mentioned above - which wouldn't have done any magic. Isn't it the algorithm like a convnet doing the magic rather than TF itself responsible for it?

mslate · on Dec 12, 2017

Yes, TF is merely a framework implementing convolutional neural nets, not a novel implementation of them.

We chose TF over other convolutional neural net libraries because it was 1. Python and 2. heavily sponsored by Google.

brucephillips · on Dec 12, 2017

What TF model did you use?

mslate · on Dec 12, 2017

This was "ages" ago, pre 1.0 so ~2 years ago. TBH, I can't recall which model we used. We ran it in production for several months on a proprietary training dataset of 30k emails, re-training it once a week.

I regret not following through more on that project, but hey, you've only got so much political capital to burn when people ask you "and how does it make us money?"

hyperbovine · on Dec 12, 2017

I'm currently using TF for a scientific algorithm that's completely unrelated to deep learning. The speedup over our previous solution is probably on the order of 1000x. There's nothing magical about Tensorflow, we were just too lazy/busy to dive deep on the legacy code, GPUify it, etc. Tensorflow let me do that in a couple of days. So, that's a win. OTOH I completely agree that the API and docs are completely inscrutable at times. Presumably Google is happy with it.

cosmic_ape · on Dec 12, 2017

As other comments mention, if Tensorflow is seen for what it is - a framework for computation, rather than just "a deep learning thingy", it may be pretty useful.

It is probably quite far from a standard usage, but Tensorflow may be used to write some custom graphical models inference, for example. To be practical these algorithms can not be implemented in, say, pure python.

The point is Tensorflow gets you pretty close to assembly level computation. The alternative is to write in, say, cython - which is much more time consuming to write, and does not give you parallelization for free. Another alternative I guess would be torch, but that is the same as tensorflow the way I see it.

EdwardDiego · on Dec 12, 2017

Can anyone comment on his point about Spark's ML libs? I note that was from last year (about 2015 code), not sure what level of beta they were at, but yeah, I use it for batch processing, but have never used the ML aspects, so just curious.

> And even up to last year, there’s just massive bugs in the machine learning libraries that come bundled with Spark. It’s so bizarre, because you go to Caltrain, and there’s a giant banner showing a cool-looking data scientist peering at computers in some cool ways, advertising Spark, which is a platform that in my day job I know is just barely usable at best, or at worst, actively misleading.

Radim · on Dec 12, 2017

Getting better obviously, but the feet-on-the-ground experience for MLlib is still far from pleasant: hard to configure, hard to manage, hard to scale, hard to debug.

By way of anecdote, Spark's MLlib used to contain an implementation of word2vec that failed when used on more than 2 billion words (some arcane integer overflow). So much for scale!

As for performance, in 2016, the break-even point where a Spark cluster started being competitive with a single-machine implementation was around 12 Spark machines (a bit of a hindrance to rapid iterative development, which is the corner stone of R&D): https://radimrehurek.com/florence15.pdf

kwisatzh · on Dec 13, 2017

Can you be more specific in terms of issues with ML Lib? I'm thinking of using it with Spark cause of big data requirements, but have heard MLLib in particular is highly unreliable.

blueplastic · on Dec 12, 2017

lol, that PDF is referencing Spark 1.3 from March 2015 and to say that you need 12 modern Spark machines to break-even with one machine running a non-distributed ML framework is ridiculously wrong. And he wan Spark on EMR, which was pretty unoptimized back then.

Jesus_Jones · on Dec 12, 2017

Hah, this is a great interview! [You can't really trust someone who calls themselves a data scientist, they are just taking that exciting and financially rewarding name], loosely paraphrasing. Too bad it is anonymous. It totally fits my unfair preconceptions of this field. I know, I'm a "computer scientist" with a phd, its not a real science if you have to put science in the name, that's what they tell me.

perturbation · on Dec 12, 2017

I've been seeing nothing but negative, dismissive comments about data science on HN lately, which is really disappointing. There's definitely a lot of hype right now about DL, but almost all of my job does not deal with Big Data or Deep Learning, 'just' machine learning + stats + calc + scripting + data cleaning + deploying models.

I think most people don't have big data (Amazon has an x1 with 4 TB of RAM, after all!) but there's no shame in that. I'll use a big machine for grid search or other embarrassingly parallelizable stuff, but I can confirm that Spark is usually a bad tool for actual ML unless you use one of their out-of-the-box algos. Even then, tuning the cluster on EMR with YARN is a pain, especially for pyspark. There's a gap, I think, between the inflated expectations of "I'm going to get general AI in 5 years and CHANGE THE WORLD" and "this K-means clustering will be a good way to explore our reviews", but somewhere in the middle there is actual value.

(I also hate that "AI" is becoming the new hype-train; I don't consider anything of what I do to be "AI", but you have people calling CNNs or even non-deep-learning models "AI"). This is only going to result in inflated expectations- DS practitioners have to communicate the value without hype, and also find a way to weed out charlatans.

lemondrops · on Dec 13, 2017

It's silly you're getting downvoted for this well-articulated and insightful comment.

I think much of the negativity towards DS from the programming community is because the Data Scientist is what the programmer used to be ~15 years ago. It's that nerdy thing for a select group of very smart people, whereas being a software developer/engineer/architect/whatever has become just another common job (at least outside of Silicon Valley).

Also, from my experience as the lone developer taking the first steps to implement machine learning techniques in my company - lots of developers also think DS/ML is a cool thing with value, but they simply, absolutely don't understand it (and don't want to put in the effort to learn). These techniques are not hard and not magic, but they require a completely different way to think about problems than "traditional" programming does. I've seen developers up and down the hierarchical ladder struggle with wrapping their heads around these concepts, and it's way easier to dismiss it all as "hype" instead of accepting the fact that these techniques will be a huge part of what software development will look like in the future.

gaius · on Dec 12, 2017

I've been seeing nothing but negative, dismissive comments about data science on HN lately, which is really disappointing. There's definitely a lot of hype right now about DL, but almost all of my job does not deal with Big Data or Deep Learning, 'just' machine learning + stats + calc + scripting + data cleaning + deploying models.

But, all those things people did in the '90's or even earlier. It was called "data warehousing" or "decision support" back then. The fundamental techniques - linear regression, logistic regression, k-mean clustering - go back even earlier, to the OR community post-WW2. Banks have been doing credit scoring with these techniques for a loooong time. The manufacturing industry has been using these techniques for even longer. Engineering for even longer than that.

So you can see why people are quite cynical about the way old, established techniques are being presented as the hot new thing - and you can see why people who have been doing this stuff for 20+ years might be annoyed at 20-somethings who claim to have invented this new thing. What's wrong with someone calling themselves a "statistician" or an "applied mathematician"?

But this is by no means purely a DS thing, seems noone is a programmer anymore either, they're all "senior certified enterprise solution architects" or some grandiose thing.

perturbation · on Dec 12, 2017

> But, all those things people did in the '90's or even earlier. It was called "data warehousing" or "decision support" back then.

I would say data warehousing is more concerned with things like OLAP, Star Schema, ETL, etc. than what people are calling 'data science' right now. The same thing with 'decision support', since data warehousing grew out of decision support systems. The most overlap here is with 'data mining' algorithms like association rules clustering.

> The fundamental techniques - linear regression, logistic regression, k-mean clustering - go back even earlier, to the OR community post-WW2.

Here I think you've got a stronger argument. OR has a long, proud history of using applied math for business objectives. But again, I would say most of OR deals with different problems and different techniques - it's more about prescriptive analytics, constrained optimization, linear programming, simulations, etc. than the type of predictive modeling in most data science.

I see data science as a separate field even though it's stitched together from a bunch of others. It's certainly not entirely new, and certainly overhyped in some annoyingly-breathless news reports. I could say the same thing about CS - was it entirely "new" when it started as a discipline? Isn't CS "just" applied math?

jnbiche · on Dec 12, 2017

> seems noone is a programmer anymore either, they're all "senior certified enterprise solution architects"

To be fair, few of the "senior architects" I've worked with in big companies knew how to program very well.

cosmic_ape · on Dec 12, 2017

I think their hype got even you a little bit. That is revealed by the word "even" in the phrase: 'people calling CNNs or even non-deep-learning models "AI"'...

perturbation · on Dec 12, 2017

What I mean by this is - I don't see how anyone could reasonably call a Random Forest "AI" with a straight face, whereas someone could (wrongly, but understandably) call a CNN / RNN / etc. AI if only because it has the word "neural" in it.

There's two groups:

- People who are overly enthusiastic about neural nets

- People who are cynically calling every ML algorithm "AI", up to and including linear regression

and I'm more annoyed at the last one.

soared · on Dec 12, 2017

To anyone non-technical, a decision is AI. 99% of the world is non-technical so its probably only going to continue to be this way.

otalp · on Dec 12, 2017

Jeff Hamerbacher, the guy who coined the term Data Science, also said "The best minds of my generation are thinking about how to make people click ads. That sucks.”

fnl · on Dec 12, 2017

Um, no, that's yet another falsehood in that interview; The term DS is much older, and stems from Peter Naur, anecdotally coined in the 1960s and with a provable [edit: removed wrong ref] paper in 1974 using that term: https://en.wikipedia.org/wiki/Data_science

chestervonwinch · on Dec 12, 2017

Interestingly, Tukey's (of fast Fourier fame) paper, "The future of Data Analysis" [1], was published circa 1961.

[1]: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704...

d--b · on Dec 12, 2017

As important as it is to debunk the hype surrounding AI, it is also important to note that the recent advances in neural nets hinted that we're onto something regarding the functioning of the brain, and in my opinion, it would be equally foolish to dismiss the _possibility_ of a breakthrough that would get us much closer to general AI (for instance if someone came up with some kind of short-term / long-term memory mechanism that works well)

I personally think that the main reason why general AI may be very far away is because there is little incentive today for working on it. Specialized AI seemss good enough to drive cars. Specialized AI should be good enough to put objects in boxes, cut vegetables and flip burgers and so on, and the economical impact of building that is much greater than the economical impact of making a robot that barely passes the turing test and that's otherwise fairly dumb or ethically unbounded.

brucephillips · on Dec 12, 2017

> the data sets have gotten large enough where you can start to consider variable interactions in a way that’s becoming increasingly predictive. And there are a number of problems where the actual individual variables themselves don’t have a lot of meaning, or they are kind of ambiguous, or they are only very weak signals. There’s information in the correlation structure of the variables that can be revealed, but only through really huge amounts of data

This isn't really true, since this can be said of any ML model. ML is nothing new. Deep learning is new. It works because we have so much data that we can start to extract complex, nonlinear patterns.

vadimberman · on Dec 12, 2017

> I feel like the Hollywood version of invention is: Thomas Edison goes into a lab, and comes out with a light bulb. And what you’re describing is that there are breakthroughs that happen, either at a conceptual level or a technological level, that people don’t have the capacity to take full advantage of yet, but which are later layered onto new advances.

Brilliant.

ramtatatam · on Dec 12, 2017

I'm not native English speaker and I find this sentence from the article weird:

> Because the frightening thing is that even if you remove those specific variables, if the signal is there, you're going to find correlates with it all the time, and you either need to have a regulator that says, “You can use these variables, you can't use these variables,” or, > I don't know, we need to change the law. As a data scientist I would prefer if that did not come out in the data. I think it's a question of how we deal with it. But I feel sensitive toward the machines, because we're telling them to optimize, and that's what they’re coming up with."

So is he saying that he is worried optimisation throws results that are not what he would like to see?

pesmhey · on Dec 12, 2017

Race is an incredibly sensitive topic in America. The best analogy I can come up with for the author's statement is this:

You're looking to pick the fastest runners out of a group of people. You run an optimization algorithm to pick out the fastest in that group. Nothing about this optimization accounts for the fact that 1/3 of the people in the group have been being shot in the foot with a gun prior to your optimization. The data will show that they are poor runners without addressing the crime previously committed. In fact, many people would consider it a second act of crime.

yters · on Dec 12, 2017

DL is hyped as a big thing, but why are multiple layers on a NN a breakthrough? The only breakthrough is hardware, but I don't see that hyped.

srean · on Dec 12, 2017

Shh, will you. Some truths are not to be aired in public.

We know that no manager got fired for choosing Java.

There is a researcher's version of that. No researcher got fired for making a neural network more 'convoluted'. It helps if there exists one dataset where it does 0.3% better. Doesn't matter if that data set is(has been since the late 90s) standard fare as a homework problem in machine learning course.

That said we do understand these things a bit better than before. Some concrete math is indeed coming out.

bllguo · on Dec 12, 2017

More layers allowed us to explore exponentially more network architectures. And if you look at a lot of advances in deep learning, particularly in convnets, the architecture is actually key - as important or more than the weights themselves. I guess another thing is that more layers give a disproportionate increase in performance. Some of it is hardware but there have definitely been advances in the theory; people aren't getting these new results from 10, 20 yr old networks that have been made larger.

kerbalspacepro · on Dec 12, 2017

Am I the only one who was expecting to learn about data science and instead I got some moralising?

DrNuke · on Dec 12, 2017

Different communities play a game at different times: the pioneers at first, then the early comers, then the businessmen, then the masses, in the end the legislators.

reesefitz · on Dec 12, 2017

I feel so many data scientists are bullshit. I had the worse interviews, like someone telling me about how ARIMA is so good and why would I even use a LSTM network. Even worse is they cite some bullshit consulting article with skewed data to prove their point.

reesefitz · on Dec 12, 2017

some interviews ask me the stupidest questions "how large is your dataset?" , "have you ever worked with 100GB of data". fucking morons

eanzenberg · on Dec 12, 2017

Eh, pretty disappointing interview. It doesn’t tske a team to utilize gpu computing, it takes one person and I’ve done it. Also, you can’t complain about there being no strong-ai companies and then list accomplishments of strong-ai companies.

I personally don’t like the phrase data scientist but I get it and I get why it’s science as opposed to engineering. I personally like the split between machine learning, BI, and data engineering.

sjg007 · on Dec 12, 2017

I think the contrast is between statisticians and physicists PhDs compiling GPU support... even some CS PhDs have a hard time with that... this is less important as time goes on since the engineers figure it out and make it readily available.

chestervonwinch · on Dec 12, 2017

When I installed Theano, it was just `pip install theano`, and editing a couple of lines in a config file. Are other GPU libs (tensorflow, caffe, etc.) really that much more difficult?

eanzenberg · on Dec 12, 2017

pip install tensorflow-gpu

is all I do, once the dependancies are setup.