Can Spark Streaming Survive Chaos Monkey?

MoOmer · on March 24, 2015

Interesting stuff as usual Netflix; but, the identifying and reporting of issues discovered under these conditions is really excellent for the community. Kudos to you all.

estefan · on March 24, 2015

I notice they're using spark standalone clusters.

I've had problems with executors dying when running under YARN, and it makes keeping track of running jobs and finding log output more difficult. So unless you really need to run on the same instances as other MR tech, it seems standalone is the way to go.

If only AWS provided EMR AMIs for spark standalone clusters I could switch...

sgt101 · on March 24, 2015

Is this Spark on YARN or YARN or AWS that is causing the issue?

estefan · on March 24, 2015

I was trying to run Spark on EMR, so I was running it under YARN on EMR.

The problem I found seemed to have been discovered by one or two other people on mailing lists - it seemed something to do with YARN terminating spark executors for memory usage. It was odd since I wasn't using the cluster for anything else - YARN was only used as the scheduler because it was installed in the AMI, and I didn't want to use the ec2 scripts to start a standalone cluster just on EC2s.

It was difficult to debug because YARN makes things pretty opaque when tasks just die - logs weren't copied around the cluster so I had to SSH to the instance that died and try to find something to indicate what had happened, and half the time the error messages were vague.

I also didn't work out how to view the job tracker UI when running under YARN. "Learning Spark" says I need to proxy through the YARN cluster somehow but doesn't give an example... So I have no progress about running jobs at the moment, only about finished jobs.

In the end, I upped the instance size I was using and everything went well. I think my dataset must have been too large and spark was spilling to disk which tripped things up. So at the moment I don't have confidence that I could process huge datasets with Spark on EMR which is a bit annoying :-(

vbsteven · on March 24, 2015

I would like to know which language they are using with Spark: Python, Java or Scala. I have experience with both python and java and while python works very well for prototyping new Spark flows, once a flow becomes too complex I fall back to Java because of generics in RDDs.

lmm · on March 24, 2015

I'd really recommend looking at Scala - IMO it combines the best parts of both. The syntax is often just as clear and concise as Python, but you also get full static type safety.

noelwelsh · on March 24, 2015

I don't think that is relevant to the post, but from what I know of Netflix they use Java and increasingly Scala. I would guess Scala for Spark; the API is a lot nicer.

toomuchtodo · on March 24, 2015

I'm curious how certain services can survive Chaos Monkey. Memcached is one example; if you start destroying instances, you're going to stampede your persistent datastore to get that memcached replacement hot again.

viraptor · on March 24, 2015

The surviving part doesn't have to be in the server itself. Here's an idea for memcached "surviving":

- set up N available servers

- make clients store to N servers at the same time when they calculate the value

- query M (where M<=N) servers before deciding you have to recalculate

- If you got a response from server between 2 and M, re-store the value everywhere (just pay attention to preserving timeouts)

And you get distributed, self-healing, chaos monkey resistant memcached without any support on the server side.

Also if you want to avoid stampede, you could insert 1s TTL placeholders that mean "back off, someone else is calculating" into keys you know are popular and may experience contention. Just make sure you use CAS so you don't overwrite data with placeholder.

toomuchtodo · on March 24, 2015

That's what I assumed. You lose capacity as you increase reliability, as you're going to need to redundantly store that data somewhere in your memcached cluster if you don't want to go back to the persistent layer. Thanks!

jedberg · on March 24, 2015

The memcache at Netflix is triple replicated, so if one node goes away, the clients ask one of the other nodes with the same data in a different data center, and then repopulate the node in their own datacenter if a replacement has been launched.

(I worked on the system)

There is also another project called Dynamite [0] that puts a gossip/Cassandra-like protocol in front of redis.

[0] http://techblog.netflix.com/2014/11/introducing-dynomite.htm...

toomuchtodo · on March 24, 2015

Thanks for the info!

throw_away · on March 24, 2015

If you can't survive chaos monkey, how are you going to survive real life?

guelo · on March 24, 2015

Have a bunch of memcached servers so that losing one isn't a big deal.

eranation · on March 24, 2015

Great write up. Using spark / graphx and so far it's pretty awesome.

rojoca · on March 24, 2015

Interesting title given that Netflix has just released pricing for their upcoming New Zealand release undercutting the local offering from Spark (formerly Telecom).

EdwardDiego · on March 24, 2015

About the same time that I introduced Spark in our workplace, Telecom announced their multi-million dollar name change.

I spent the first two weeks clarifying that I was indeed referring to Apache Spark, not the Spark-Formerly-Known-As-Telecom. They'll always be Telecom to me.

photorized · on March 24, 2015

I just got a "Netflix can't play this title at this time" error message - are you guys experimenting?