Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm the PMT for this project in the EFS team. The "flip the switch" part was indeed one of the harder parts to get right. Happy to share some limited details. The performance improvement builds on a distributed consistent cache. You can enable such a cache in multiple steps. First you deploy the software across the entire stack that supports the caching protocol but it's disabled by configuration. Then you turn it for the multiple components that are involved in the right order. Another thing that was hard to get right was to ensure that there are no performance regressions due to the consistency protocol.

Shameless plug, only one I promise. If you think this is cool, we are hiring for multiple PMT and SDE positions, which can be fully remote at the Sr. level and above. DM me for details, see [1], or see amazon.jobs and search for EFS.

[1] https://www.amazon.jobs/en/jobs/1935130/senior-software-deve...

EDIT: public link



Thank you for calling out on-call responsibilities in your job listing. Too many job listings today fail to mention that _very significant_ responsibility.

I enjoy working with distributed storage systems, but I don't think I will ever carry a pager for one again. I wish the industry could figure out how to separate designing and building such systems, from giving up your nights and weekends to operate them.


Separating design and build from operate is antithetical to Amazon. It isn’t a “figure out” for a lot of companies including Amazon — it’s very intentional and seemingly unlikely to change. They’ve observed that they create a stronger culture of ownership (which then drives getting things fixed faster and more empathy for the customers) through having the builders also be the operators.

Still needs supportive management: there are teams at Amazon who have time to fix everything which paged them at anti-social hours, and there are teams which don’t prioritize beyond minding the SLA of their COE Action Items, and more silently accrue operational debt and page people more often. Tricky balance to be sure.

Even the ‘SRE’ or ‘PE’ approaches you see at Google and Meta don’t obviate the need for development teams to have on-call rotations. At least in “BigTech” where teams operate services instead of shipping shrink-wrapped software it’s becoming rare to NOT see some on-call responsibility with engineering roles (including management). I suppose it isn’t just on-call, and the other big change in BigTech of the last decade was the somewhat widespread elimination of QA teams and SDET roles, and the merger of those responsibilities into the feature/service teams, and to SDE.


There's different schools of thought around this and I certainly understand your perspective. At AWS, carrying a pager at limited times (in our team, 2-3 weeks per quarter as mentioned in the link) is considered an important part of our culture of operating at-scale services. In our team, we try to minimize oncall burden as much as possible by investing in automation, and only alarm if the system really doesn't know what to do. We have a separate planning bucket for burden reduction every quarter.

Other interesting thing to mention is that as an SDE you're not the only one that has oncall duties. In our team at least, PMTs are also oncall for about the same time. This creates a good dynamic as everyone is incentivized to minimize the oncall burden.


Being on call aligns incentives. If it's someone else's problem when what you just design and build then it will operate less well.


Isn't that the idea behind separating out the SRE (site reliability engineer) role from software engineering?


Sort of. Many teams in FAANG put their devs on rotations that aren't full on-call like SRE (and some managers put their devs into full SRE rotations without mentioning there is a bonus). I always check with my future managers that they don't plan to do this.


Haha well aware as a current on call SDE at one of them!


I watched the SNIA presentation from SDC2020 on EFS and it described each extent in the file system as a state machine replicated via multi-paxos.

It seems possible to implement this feature via a time-based leader lease on the extent where a read request goes to a read-through cache. The cache will store some metadata (e.g. a hash or a verison number) of the block it is trying to validate from cache and send that to the leader along with the read request.

If the leader has the same version number for a block as was requested by the cache, the block itself does not need to be transferred and the other replicas don't need to be contacted. If you place the cache in the same AZ as the EC2 instance reading the file system, 600ms sounds viable.

Am I on the right track? :)


> Am I on the right track? :)

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!


Sounds good! The SNIA presentation was very interesting.


You shared an internal link, aside from this, great work!

CDO salutes you.


This sounds a lot more interesting than I was expecting. I assumed you just swapped to better network hardware. Will there be more info at some point?


I thought they're already running quite top dog stuff in their whole network, it's the only way to scale big isn't it?


Are there targets for what percentage of an efs filesystem’s reads can be satisfied by this cache?


NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: