I'm the PMT for this project in the EFS team. The "flip the switch" part was ind...

dadkins · on Feb 15, 2022

Thank you for calling out on-call responsibilities in your job listing. Too many job listings today fail to mention that _very significant_ responsibility.

I enjoy working with distributed storage systems, but I don't think I will ever carry a pager for one again. I wish the industry could figure out how to separate designing and building such systems, from giving up your nights and weekends to operate them.

nixgeek · on Feb 15, 2022

Separating design and build from operate is antithetical to Amazon. It isn’t a “figure out” for a lot of companies including Amazon — it’s very intentional and seemingly unlikely to change. They’ve observed that they create a stronger culture of ownership (which then drives getting things fixed faster and more empathy for the customers) through having the builders also be the operators.

Still needs supportive management: there are teams at Amazon who have time to fix everything which paged them at anti-social hours, and there are teams which don’t prioritize beyond minding the SLA of their COE Action Items, and more silently accrue operational debt and page people more often. Tricky balance to be sure.

Even the ‘SRE’ or ‘PE’ approaches you see at Google and Meta don’t obviate the need for development teams to have on-call rotations. At least in “BigTech” where teams operate services instead of shipping shrink-wrapped software it’s becoming rare to NOT see some on-call responsibility with engineering roles (including management). I suppose it isn’t just on-call, and the other big change in BigTech of the last decade was the somewhat widespread elimination of QA teams and SDET roles, and the merger of those responsibilities into the feature/service teams, and to SDE.

geertj · on Feb 15, 2022

There's different schools of thought around this and I certainly understand your perspective. At AWS, carrying a pager at limited times (in our team, 2-3 weeks per quarter as mentioned in the link) is considered an important part of our culture of operating at-scale services. In our team, we try to minimize oncall burden as much as possible by investing in automation, and only alarm if the system really doesn't know what to do. We have a separate planning bucket for burden reduction every quarter.

Other interesting thing to mention is that as an SDE you're not the only one that has oncall duties. In our team at least, PMTs are also oncall for about the same time. This creates a good dynamic as everyone is incentivized to minimize the oncall burden.

mentat · on Feb 15, 2022

Being on call aligns incentives. If it's someone else's problem when what you just design and build then it will operate less well.

lkschubert8 · on Feb 15, 2022

Isn't that the idea behind separating out the SRE (site reliability engineer) role from software engineering?

dekhn · on Feb 15, 2022

Sort of. Many teams in FAANG put their devs on rotations that aren't full on-call like SRE (and some managers put their devs into full SRE rotations without mentioning there is a bonus). I always check with my future managers that they don't plan to do this.

lkschubert8 · on Feb 15, 2022

Haha well aware as a current on call SDE at one of them!

ryanworl · on Feb 15, 2022

I watched the SNIA presentation from SDC2020 on EFS and it described each extent in the file system as a state machine replicated via multi-paxos.

It seems possible to implement this feature via a time-based leader lease on the extent where a read request goes to a read-through cache. The cache will store some metadata (e.g. a hash or a verison number) of the block it is trying to validate from cache and send that to the leader along with the read request.

If the leader has the same version number for a block as was requested by the cache, the block itself does not need to be transferred and the other replicas don't need to be contacted. If you place the cache in the same AZ as the EC2 instance reading the file system, 600ms sounds viable.

Am I on the right track? :)

geertj · on Feb 15, 2022

> Am I on the right track? :)

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!

ryanworl · on Feb 15, 2022

Sounds good! The SNIA presentation was very interesting.

zoover2020 · on Feb 15, 2022

You shared an internal link, aside from this, great work!

CDO salutes you.

staticassertion · on Feb 15, 2022

This sounds a lot more interesting than I was expecting. I assumed you just swapped to better network hardware. Will there be more info at some point?

lillecarl · on Feb 15, 2022

I thought they're already running quite top dog stuff in their whole network, it's the only way to scale big isn't it?

ec109685 · on Feb 15, 2022

Are there targets for what percentage of an efs filesystem’s reads can be satisfied by this cache?

geertj · on Feb 15, 2022

NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.