Hacker Newsnew | past | comments | ask | show | jobs | submit | dormando's commentslogin

Does anyone run these "at home" with small clusters? I've been googling unsuccessfully and this thread doesn't refer to anything.

So a non-quantized scout won't fit in a machine with 128GB of RAM (like framework or mac studio M4). Maverick is maybe a 512GB M3 Max mac studio. Is it possible (and if so what're the tradeoffs for) running like one instance of Scout on three 128GB frameworks?


Half on topic: what libs/etc did you use for the animations? Not immediately obvious from the source page.

(it's a topic I'm deeply familiar with so I don't have a comment on the content, it looks great on a skim!) - but I've been sketching animations for my own blog and not liked the last few libs I tried.

Thanks!


I heavily, heavily abused d3.js to build these.


Small FYI that I couldn't see them in Chrome 133.0.6943.142 on MacOS. Firefox works.


It's the complete opposite for me — there are no animations in Firefox even with uBlock Origin disabled, but Brave shows them fine.

The browser console spams this link: https://react.dev/errors/418?invariant=418

edit: looks like it's caused by a userstyles extension injecting a dark theme into the page; React doesn't like it and the page silently breaks.


Ohhh interesting! Obviously not ideal, but I guess just an extension issue?


Interesting. Running any chrome extensions that might be messing with things? Alternatively, if you can share any errors you're getting in the console lmk.


Oh, looks like it. I disabled extensions one by one til I found it was reflect.app's extension. Edit: reported on their discord.

False alarm :) Amazing work!!


Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).

For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.

That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.


:'(


<3

Thanks for your continuing work on memcached! I'd be very curious how garnet's benchmarks compare with memcached.


<3 and Thank You :)


At some point the OBSD CVS server was something I donated. Don't think it's in this picture as I don't think it was a Dell. Never saw a picture of the thing in action actually, doh.

I avoided spending a quarter mill on F5's by deploying OpenBSD as firewalls/L4 routers but I was able to keep some of the budget... For a year there I would e-mail them every 6mo and ask what they wanted. Sad when I changed jobs and had to stop.


I just got one a few weeks back but haven't gotten to spend a ton of time with it yet. It's taking some adjustment but I'm liking it so far.

- had 1440p + 1080p monitors on stands side by side before. Now just this one on an arm (which is excellent), that I can adjust to keep my position from being static.

- not having to hold my neck angled while reading my side monitor is helpful.

- realistically there are a few "modes" of working on here. While coding it's pulled a bit closer, while in CAD or similar creative I might push it back a bit and get more of the monitor in view.

- I recline slightly so the monitor is tilted a bit which gives me a solid view of the bottom 60-70% of the monitor. The top is a bit out of range at close distance.

- For coding so far I have the middle-ish of the monitor as a 1440p code-only view. Below that are a few windows for manpages/reference/interactive debugging/repl/etc. On the top end which is normally slightly out of view I have compilation and long running test output which I glance at by moving my eyes.

I like not having to page between desktops while coding when possible. The bottom view is also large enough to hold a browser window or simulator window. Need to also try pushing it back a bit with slightly larger text and see if that's any better.

I don't intend to game on it, maybe windowed mode in the middle or something.

edit: well, also it has this mode where you can split it into two 1440p monitors on different inputs (which you can hook up to the same computer), so depending on the game I might do that as well.


I have a similar experience with coding! I bought this monitor precisely for that. It's nice to keep some bottom IDE panel open (test results, find results, git log, etc) while keeping the rest of my editor at a normal vertical height.

Similarly, I can keep a browser open at a normal (or even extended!) height, plus keep the developer console open at the bottom. It's made web development more pleasant, just the feeling of not being so cramped vertically.


client ecosystem is definitely a sore point now. I've just sort of started working on a replacement for libmemcached to hopefully cut down on the complexity... but then that's a migration and nobody wants to do that.

Pinterest should drop me a line if they're interested in sponsoring work though :)


Thanks for memcached! I am surprised how pylibmc even works at this point - its last released version was in 2019. I do hope Pinterest sponsors.


Now a more philosoraptor style comment: I see Mcrib is a service built to quickly detect and replace memcached's. I treat memcached in infrastructure as a very stable service. Meaning it is infrequently necessary to upgrade it, and it will generally not fail on its own. If it does it will be highly infrequent compared to services with higher churn or more complexity/dependencies. This means if they're failing often enough that you need to rapidly detect and replace them you have a more fundamental problem.

From a structural standpoint I think my technical comment can be useful. If things really are failing this much A) you should figure out why and slow that down. B) if you have a generally stable system and understand the typical rate of failure, you can add tripwires into Mcrib to avoid over-culling services and loudly raise alarms. Then C) you can improve technical reliability with redundancy/extstore/etc.

I've also seen plenty of times where folks have a dependency of a service determine if that service is usable, which I disagree with quite strongly. Consul being down on a node should trigger something to consider if the service is dead. It's important both for reliability (don't kill perfectly working things because you end up having to design around it), and for maintainability as you've now made people afraid of upgrading Consul or other co-dependent services. Other similar failures are single-point-of-testing availability checking where instead you probably want two points of truth before shooting a service.

Now you risk people being afraid of upgrading probably anything, which means they will work around it, abstract it, or needlessly replace it with something they feel safer managing. The latter is at best a waste of time, at worst a time bomb until you find out what conditions this new thing breaks under.

This isn't advocating that you design without assuming anything can fail anywhere at any time; just pointing out that how often a service _should_ fail is extremely useful information when designing systems and designing fail safes, alerts, monitoring, etc.


"I treat memcached in infrastructure as a very stable service."

I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....

Knowing memcache is a rock comes with experience though.


Our underlying hardware (AWS) is nothing like this reliable. We see regular (several times a year) failure of racks of machines or whole DCs.

Across the whole fleet (all services), we lose 1-10 servers per day as a baseline. Major events are then on top of that and can impact thousand of hosts at once.


What service is this?? This must be huge.


> I run memcached at a large scale

I don't believe you run it at the scale Slack does.

The people at Slack who decided to use Mcrouter (and created Mcrib) have experience running Memcached, Mcrouter and Nutcracker in production at two of the biggest web properties in the world.

Trust that they know whereof they speak.


You may not be wrong, in fact you are very likely right, but this is not an argument.

The larger an org gets the more likely it is to do weird things to mitigate organizational difficulties be them budget, human or otherwise.

Those types of things rarely show up in postmortems for obvious reasons.


"I don't believe you run it at the scale Slack does."

Definitely not. We host about %80 of elementary schools in the US. Not slack scale but definitely face many of the same issues :/


I think you nailed the real issue that caused the incident: saying "consul down == unhealthy memcached", then evicting the node. If Mcrib instead did some actual applicative healthchecks (e.g. memcached ping), which could be correlated with some system metrics (cpu, ram), it could avoid evicting those perfectly good nodes with a warm cache that just happen to have a restarting consul agent.

Granted, this is easy to say once the incident happened with an excellent postmortem, but this should be an industry-wide wakeup call: don't do this.

I have the same issue at work, where people treat a "prometheus node_exporter down" as a "the app on the machine is down". I've started to add the actual app name in our alerts, and now people don't freak out anymore when they see "down" alerts: oh node_exporter is down, but not the app? Don't panic and calmly check why.


It’s likely that the memcached install is so large that the underlying instances themselves are failing. When you have hundreds or thousands of instances, failures in the instances themselves become pretty regular.


I don't see this. I have thousands of long-lived instances - full VMs, not containers, running in our hardware.

If they start "going bad", something is wrong. That's a signal I wouldn't want to ignore.

It has happened - once an HBA in a storage node was causing occasional corruption, another time due to a communication failure people were building things with the wrong version of something which had a memory leak and would eventually summon the OOM killer. There have been other issues.

"Have you tried turning it off and back on again" is still a terrible system management strategy.


Failure rates in AWS are probably higher than what you're seeing in your own hardware.


Maybe. If you don't look, you don't know.

But given the number of people I've heard using "we're on AWS, out of my control" as an excuse, this appears to be an unofficial service they offer.


I can say with certainty this isn't strictly true. The failures should be relatively rare; when I say relatively I mean on the level of natural node failure. If natural node failure isn't survivable without special systems to quickly replace downed nodes you don't actually have an N+1 redundancy system. Thus, the pools aren't large enough :) Or, in this case, if they really are failing this much then having them always lose their cache is a major reliability hole.

It's a subtle difference. I think many operators get used to node failures being extremely common when they don't necessarily have to be. I suspect the note on "if they come back on their own ensure they're flushed" meaning they have something unusual causing ephemeral failures. If that's just "cloud networking" there isn't much they can do but it's almost always fixable.


> The failures should be relatively rare; when I say relatively I mean on the level of natural node failure.

And exactly how rare do you believe this to be?

In my experience, node failures at scale of hundreds to thousands of nodes are monthly to weekly, if not daily. Generally speaking, stability is a normal distribution. Young, new instances experience similar failure rates as old instances. If you have any sort of maximum node lifetime (for example, a week) or scale dynamically on a daily basis then you'll see a lot of failures.


Which still means you could implement a hard limit of 1 fail per hour and only allow more replacements with manual intervention. With a thousand nodes, several or hundreds failing within a few hours is so unlikely that you're probably better off preventing automatic failover in these cases.

But that generally mirrors my experience that automatic failover for stable software tends to cause more issues than it solves. A good (i.e. redundant hardware and software) Postgresql server is also so unlikely to fail that wrong detection and cascading issues from automatic failover are more likely than its actual benefits.


I think you're looking at it the wrong way. A server is never just postgres or memcached, there's always other stuff running, and it's that other stuff that can cause problems. Like maybe you're patching the fleet and a node fails to come back up, or due to misconfiguration the disk gets full.

I'd argue that stable systems are actually worse for operational stability as you become complacent and comfortable and when shit hits the fan you are unprepared.


more likely - they are using "spot instances" for memcached, which will cause them to be evicted fairly frequently.


Or horizontal autoscaling based on demand.


Hi! I'd like to offer some hopefully useful information if any Slack folks end up reading this, or anyone else with a similar infrastructure. I'll start with some tech and make a separate philosophical comment.

Also caveat: I have no deep view into Slack's infrastructure so anything I say here may not even be relevant. YMMV.

First some self promotion: https://github.com/memcached/memcached/wiki/Proxy memcached itself is shipping router/proxy software. Mcrouter is difficult to manage and unsupported. This proxy is community developed, more flexible, likely faster, and will support more native features of memcached. We're currently in a stabilization round ensuring it won't eat pets but all of the basic features have been in for a while. Documentation and example libraries are still needed but community feedback help speed those up tremendously (or any kind of question/help request).

It's not clear to me why memcached is being managed like this; mcrouter seems to only be used to abstract the configuration from the clients. It has a lot of features for redundant pools and so on. Especially with what sounds like globally immutable data and the threat of cascading failures during rolling upgrades it sounds like it would be very helpful here.

If cost or pool sizes are the main reasons why the structure is flat, using Extstore (https://github.com/memcached/memcached/wiki/Extstore) can likely help. Even if object value sizes are in the realm of 500 bytes, using flash storage can still greatly reduce the amount of RAM necessary or reduce the pool size (granted the network can still keep up) with nearly identical performance. Extstore takes a lot of tradeoffs (ie; keeping keys in RAM) to ensure most operations don't actually write to flash or double-read. Extstore's in use in tons of places and everyone's immediately addicted.

Finally, the Meta Protocol (https://github.com/memcached/memcached/wiki/MetaCommands) can help with stampeding herds to help keep DB load from exploding without adding excess network roundtrips under normal conditions. I've seen lots of workarounds people build but this protocol extension gives a lot of flexibility you can use to help survive degraded states: anti-stampeding herd, serve-stale, better counter semantics, and so on.


License?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: