Original author here.
Thanks for your comment. I don't think you mean everything bad or snarky here. I like to read critical feedback. Thats how you learn.
But i guess you mix things up.
So yes. This architecture has grown over the time. And the whole story didn't happen yesterday. It starts September 3, 2010. So >6 years back. The first _Real_ problems appeared in 2013. So ~4 years back.
With our current knowledge, i agree that those three use cases should not fit in one redis instance.
> I think the post could have been more useful if it would have conceded the point earlier that they made bad decisions from the get-go, that they had to play catch up with those decisions, and that the advice was for attempting to catch up to these decisions.
This post is written as a kind of story. Of course it would be possible to conduct this post to learnings.
> It would be very interesting to understand why they made the choices they did.
Feel free to ask every question you want to know. I am here and happy to answer.
> Why are they running memcached and Redis at the same time?
Because this are two different systems with two different concepts for different usecases.
If you use caching, i agree. Both can cache data.
But on use cases where you need master slave replication (maybe across data centers), memcached may not fit.
IMO it is hard to say both systems are the same.
E.g. if you deal with caching data that vary a lot in size per entry (same data concept), one memcache instance / consistent hashing ring might be the wrong solution, because of the slap concept of memcache.
> Why wasn't a regular database suitable for data persistence or temporary data?
This question is not answered in general.
But i use one use case.
We had a MySQL table running on InnoDB that was really read heavy with ~300.000.000 lines. The indexes were set for the read patterns. Everything fine.
An normal insert in this table took some time. And we wanted to avoid to spent this time during a user request. This would slow down the request.
One option the dev team considered is to write it in redis and have a cronjob that reads this data out of redis and stores them in the table.
With this the insert time was moved from the user back to the system.
> Why wasn't the cache for search installed on the application server to remove network latency altogether?
There is a cache on the application server. But we have several cache layer in our architecture. This was just one of them.
> Why weren't they running php-fpm from the get-go?
We are working on a switch to php-fpm from the typical prefork model.
Such a task sounds easy. But in a bigger env this can get quite a challenge.
> I made the comment because doing some research or having some understanding from the beginning, might have avoided the post altogether.
Agree. The challenge here was, that the people who introduced Redis at that time has left the company. So in short: We had the problem, and had to fix it. With our current knowledge, we would fix this in a different way and maybe choose different approaches. But yeah, i assume this is a normal learning process.
Thanks for the awesome writeup. Since you opened up questions:
From the start of your cutover (when you started seeing the 500s) to resolution of the various issues along the way--how much time elapsed? (e.g., deciding to swap client libraries, A/B testing, redis upgrade, shifting load to dedicated instances, implementing proxies/memcached)?
And what was the end user impact (e.g., 50% of users would see timeouts during peak usage for the day, or users using certain functionality would be affected 1% of the time, etc.)
Just trying to get a sense of the level of urgency involved in terms of chasing down all these leads. It seemed pretty methodical, so it's hard to tell if it was a slow-burning persistent nagging issue that you chipped away at over a couple of months, or an all hands on deck sequential process of trying a lot of different things in a relatively short period of time to keep the site afloat.
But i guess you mix things up. So yes. This architecture has grown over the time. And the whole story didn't happen yesterday. It starts September 3, 2010. So >6 years back. The first _Real_ problems appeared in 2013. So ~4 years back.
With our current knowledge, i agree that those three use cases should not fit in one redis instance.
> I think the post could have been more useful if it would have conceded the point earlier that they made bad decisions from the get-go, that they had to play catch up with those decisions, and that the advice was for attempting to catch up to these decisions.
This post is written as a kind of story. Of course it would be possible to conduct this post to learnings.
> It would be very interesting to understand why they made the choices they did.
Feel free to ask every question you want to know. I am here and happy to answer.
> Why are they running memcached and Redis at the same time?
Because this are two different systems with two different concepts for different usecases. If you use caching, i agree. Both can cache data. But on use cases where you need master slave replication (maybe across data centers), memcached may not fit. IMO it is hard to say both systems are the same. E.g. if you deal with caching data that vary a lot in size per entry (same data concept), one memcache instance / consistent hashing ring might be the wrong solution, because of the slap concept of memcache.
> Why wasn't a regular database suitable for data persistence or temporary data?
This question is not answered in general. But i use one use case. We had a MySQL table running on InnoDB that was really read heavy with ~300.000.000 lines. The indexes were set for the read patterns. Everything fine. An normal insert in this table took some time. And we wanted to avoid to spent this time during a user request. This would slow down the request. One option the dev team considered is to write it in redis and have a cronjob that reads this data out of redis and stores them in the table. With this the insert time was moved from the user back to the system.
> Why wasn't the cache for search installed on the application server to remove network latency altogether?
There is a cache on the application server. But we have several cache layer in our architecture. This was just one of them.
> Why weren't they running php-fpm from the get-go?
We are working on a switch to php-fpm from the typical prefork model. Such a task sounds easy. But in a bigger env this can get quite a challenge.
> I made the comment because doing some research or having some understanding from the beginning, might have avoided the post altogether.
Agree. The challenge here was, that the people who introduced Redis at that time has left the company. So in short: We had the problem, and had to fix it. With our current knowledge, we would fix this in a different way and maybe choose different approaches. But yeah, i assume this is a normal learning process.
Anyhow. Thanks for your feedback.