Content Square is one of the current fastest growing
company, deploying lots of analytics tools through a
critical data pipeline. This infrastructure needs to
remain strongly reliable and available, with minimum
downtimes.
RESPONSIBILITIES:
- Build and maintain alerting tools, metrics, and methodologies to
reduce possible downtimes.
- Ensure production-ready applications fit the expected availability
constraints.
- React to system inefficiencies and resolve issues quickly to ensure
system availability and performance.
- Troubleshooting experience tracking down performance, load,
networking, I/O and memory problems.
Coordinate engineering and external communications.
REQUIREMENTS:
- 5+ years of experience with Linux system administration.
- Experience with monitoring systems using tools (like Nagios, Icinga,
Shinken, OpenTSDB) and writing health checks.
- Interest in learning and managing newer technologies like Spark,
Hadoop, Elasticsearch, Kafka…
- Experience of a classical network stack : CDN, DNS, load balancers,
TCP/IP...
- Good understanding of how to think about data durability (think
backups, max time to recovery, and generally how to avoid losing
data at all costs)
PREFERRED:
- Experience with system management tools like Puppet or Chef
- Experience with Scala and/or JVM.
Ref. CS-backend-2015-08-SRO URL http://www.contentsquare.com/en/jobs/#senior-site-reliabilit...
CONTEXT:
RESPONSIBILITIES: REQUIREMENTS: PREFERRED: