Lots of pets here, not many cattle. When you get to this scale, tools like Kubernetes make more sense. Then you can just think about your system in terms of "how much cpu/ram/storage do I need total?" Unless every one of those servers is running at or near capacity, there is a lot of cash being wasted there. There is also a maintenance cost, too. If one of these vital boxes goes down, what is the downtime implication and restoration cost?
I am not saying any of this to be critical of Lichess. There are different ways to solve these problems, and their way is clearly working. This also happens slowly over many years, so it is hard/impossible to see the end state until you are there. The app is very quick and responsive. I got my ass handed to me on my first anon game :) My feedback is more for the community here in the context of using this as a byte-sized case study.
At the end of the day we are reading and writing 1's and 0's to a network device, or a disk. Have to imagine you can run and persist chess games with a lot less resources.
> When you get to this scale, tools like Kubernetes make more sense.
If you look at the details[0], servers are really "just" 6k$ per month. A lot of that goes towards the databases, which you can't optimize much with k8s, so the optimization potential overall is already rather limited. If you then need a consultant or a developer to do the migration, you'll quickly outspend any potential savings.
I'm a fan of k8s, but this is really a showcase for hardware being (comparatively) cheap and the scale that you can go to with conventional hardware.
Why is kube and conventional hardware mutually exclusive?
It’s just a different way to utilize resources. I don't see an issue with the costs - I see an issue with this massive list of inventory that needs to be carefully managed. Managing 30 boxes is not easy when they are all funny shapes and sizes. Are they all running the same OS? Same version? How do you roll out security updates? With kube, you just bring a new node online and then kill the old one. Done.
My argument still remains - what happens when a vital box dies? How quickly can a new one be brought online? Is that automated? What about in the meantime, is service degraded or completely down? What about when the ES boxes shit the bed and you realize you actually needed a cluster of 5 nodes? Do you buy 5 new boxes that are carefully sized, or do you just scale your mega cluster from 4 to 5 and add a few new pods?
Kube is not nearly as big and scary as this community makes it out to be.
> Why is kube and conventional hardware mutually exclusive?
Conventional setup might've been a better choice of words. That being said, in my experience, k8s in smaller setups tends to use more small machines, rather than (very) few big ones.
> It’s just a different way to utilize resources. I don't see an issue with the costs - [...]
Assuming the developer knows k8s and the app doesn't need much integration [0]. If one of those isn't met, you'll need to invest the time to rethink you app and infrastructure. And you don't really want to setup such a setup with k8s knowledge gathered from a few medium posts, so you do really need to hire an expert, plus plan some more hardware for your controllers - there's definitely a money issue.
> [...] I see an issue with this massive list of inventory that needs to be carefully managed.
> Managing 30 boxes is not easy when they are all funny shapes and sizes. Are they all running the same OS? Same version? How do you roll out security updates? With kube, you just bring a new node online and then kill the old one. Done.
Going by the price, those are most likely dedicated boxes. So it's not like spinning down an EC2 machine and starting a new one (and not going for dedicated with these hw requirements would increase costs quite a bit); instead, you'd need to reimage. So, basically, you still have to manage those boxes - in addition to also updating your containers, now.
Or you simply enter `apt install unattended-upgrades` and move on with your live. Managing 30 hosts is really not that big of an issue nowadays.
> My argument still remains - what happens when a vital box dies? How quickly can a new one be brought online? Is that automated?
For the fronted, you could only really solve that by running a second box in the same size. And if you did that, you could automate that quite easily with something like a heartbeat - a failover can be handled without k8s. And do you really need all that uptime? This is a non-profit, not a startup; being down for an hour does not end your business.
> What about when the ES boxes shit the bed and you realize you actually needed a cluster of 5 nodes? Do you buy 5 new boxes that are carefully sized, or do you just scale your mega cluster from 4 to 5 and add a few new pods?
I give you the point, for ES it would really be useful. But horizontally scaling a cluster is really not that hard.
> Kube is not nearly as big and scary as this community makes it out to be.
No, it really isn't. But it does bring in a lot of complexity - how do you handle ingress? Storage? Audits? How do you scale? Does your app run in pods, can it easily fail-over? How do you handle shared state? You also need some additional hardware to handle controlling etc. and, as mentioned above, the knowledge to configure and run a cluster for such a high-profile site.
That being said, I agree with you that, seen from a clean-state, using k8s (or maybe just k3s) here would not be a wrong decision. For all we know, they might actually be using it for ES or some other parts. But we shouldn't disregard that k8s also brings in a lot of complexity, configuration overhead, costs and, next to that, a whole paradigm shift - your storage goes from permanent to ephemeral, unless explicitly marked otherwise, and accessing your app works completely different, just to name a few things. Learning k8s takes time and, while it is definitely worth it (IMO), you can still manage an infrastructure of that size with easier tools without any problems whatsoever. It's basically sure that the overhead in both learning and operations is simply not worth it and would cost much more than any savings k8s brings - which is exactly what my original response pointed out.
[0] The app might depend on specific OS features, use a lot of storage (in which case you will also need to setup a properly HA storage provider) or have some significant state, which would need to be synced to a new instance.
> When you get to this scale, tools like Kubernetes make more sense
They are no where near the scale where Kubernetes makes sense as anything but a tools for making deployments easier. Ignoring the deployment part, I think many are pretty clueless about how big you truly need to benefit from Kubernetes and how much ekstra overhead you'll have in terms of managing clusters.
To be fair, if you application was designed with Kubernetes in mind, the scale where you can make use of it starts a lot smaller. When paying ~6k$ in servers, you're way past the point where adding it would noticeably impact your bill.
As I outlined in my other response, it does not make sense to make the switch in this case (IMO). But if I were to design an application at this scale and the team would have sufficient knowledge in Kubernetes, I'd definitely consider deploying to a cluster - the overhead could be quickly countered by development speed and fail safety, especially when choosing something simpler like k3s.
But I agree that it's definitely not needed here, that scale is a few orders of magnitude away.
IME, Kubernetes helps solve scaling problems. If you aren’t there yet, it’s unnecessarily complex for you. It requires at least 2-3 dedicated ops members to run.
I am not saying any of this to be critical of Lichess. There are different ways to solve these problems, and their way is clearly working. This also happens slowly over many years, so it is hard/impossible to see the end state until you are there. The app is very quick and responsive. I got my ass handed to me on my first anon game :) My feedback is more for the community here in the context of using this as a byte-sized case study.
At the end of the day we are reading and writing 1's and 0's to a network device, or a disk. Have to imagine you can run and persist chess games with a lot less resources.