A bit disappointed by the architecture -- it's a Django stack with MySQL, Redis,...

motakuk · on June 14, 2022

I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.

We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.

It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.

The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.

gen220 · on June 14, 2022

I think your decisions were reasonable, as is the opinion of the person you're responding to.

To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.

Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.

raffraffraff · on June 14, 2022

Nothing wrong with that. I managed 7+ Sensu "clusters" at a previous job, and it's stack was a ruby server, Redis and RabbitMQ. But I completely ditched RabbitMQ and used Redis for the queue and data. Simpler, more performant and more reliable (even if the feature was marked experimental). Our alerts were really spammy, and we had ~8k servers (each running a bunch of containers) per cluster, so these things were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) Memory usage was miniscule, typically <300mb. Any box could be restarted without any impact because Redis just operated in (failover) mode and Sensu was horizontally scalable.

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

Deritio · on June 14, 2022

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

slotrans · on June 14, 2022

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.

throwaway892238 · on June 14, 2022

A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

One of them allows a thousand different nodes on different networks to share a single dataset with high availability. The other can't share data with any other application, doesn't have high availability, is constrained by the resources of the executing application node, has obvious performance limits, limited functionality, no commercial support, etc etc.

And we're talking about a product that's intended for dealing with on-call alerts. The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

pphysch · on June 14, 2022

This discussion is in the context of a self-contained app called Grafana OnCall, which is built on Django, which does not particularly care which RDBMS you are using.

At the very least, SQLite should be the default database for this product, and users can swap it out with their MySQL database cluster if they really are Google-scale.

gjulianm · on June 14, 2022

> The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

An important question to ask is how much availability are you actually gaining from the setup. It wouldn't be the first time I see a system moving from single-node to multinode and being less available than before due to the extra complexity and moving pieces.

slotrans · on June 14, 2022

I don't need any of that stuff, and nor does anyone who would use this. People who need clustered high-availability stuff are paying for PagerDuty or VictorOps.

This is for tiny shops with 4 servers. And tiny shops with 4 servers don't have time to spin up a horrendous stack like this. I was excited to see this announcement until I saw all the moving pieces. No thanks!

throwaway892238 · on June 14, 2022

If you only have 4 servers, make a GitHub Action (or, hell, since we're assuming one node with SQLite, a cron job on one of your 4 servers) that curls your servers every 5 minutes and sends you a text when they're down. You don't need a Lamborghini to get groceries.

Spivak · on June 14, 2022

And this is the on-prem version of those tools. Just because it isn't the tool you wanted doesn't mean it's not good.

sergiomattei · on June 14, 2022

It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.

vhold · on June 14, 2022

AlertManager is one component of a more complicated infrastructure.

https://prometheus.io/docs/introduction/overview/#architectu...

https://kubernetes.io/docs/concepts/overview/components/

Too · on June 15, 2022

That picture shows the complete monitoring stack with lots of optional components (pushgw, service discovery, Grafana).

When comparing to OnCall you need OnCall AND still the rest of that Prometheus picture.

Compare with this picture where everything in the leftmost Alert Detection box is what you see in the Prometheus architecture. https://grafana.com/docs/oncall/latest/getting-started/

pphysch · on June 14, 2022

OnCall also does nothing unless you have something external firing alerts for you. They both fill similar niches in a larger monitoring system; this does not excuse OnCall having a drastically more complex internal architecture.

skullone · on June 14, 2022

That seems like a perfectly reasonable architecture. If only all of us could work on battle tested components like those during our job!

contravariant · on June 14, 2022

For something that is supposed to add some more features to the basic email/HTTP message alert like grafana generates, I do wonder what extra features require an additional 2 databases, a message queue and a separate task queue.

skullone · on June 14, 2022

probably keeps history, state, escalation flow, etc?

mkl95 · on June 14, 2022

> Django stack with MySQL, Redis, RabbitMQ, and Celery

MySQL is a weird if not slightly disturbing choice. Other than that it's a boring, battle-tested stack that is relatively easy to scale. I agree that Go is nicer, but I'm biased by several years of dealing with horrific Flask / Django projects.

airstrike · on June 15, 2022

> several years of dealing with horrific Flask / Django projects.

I think you misspelled "beautiful"

goodpoint · on June 14, 2022

That's very bad. 99% of organizations don't have a volume of alerts that justifies any of MySQL, Redis and RabbitMQ.

Complexity comes at a steep price when something critical (e.g. OnCall) breaks and you have to debug it in a hurry.

Shoving everything in a container and closing the lid does not help.

alex_dev · on June 14, 2022

One of the most frustrating aspects of being a software engineer is dealing with others that love to over-engineer. Unfortunately, they make enough noise that complex solutions are necessary that it gets managers scared about taking any easier, simpler solutions.

lazyant · on June 14, 2022

Curious as to what architecture you would have preferred or why this pretty standard stack (that can be deployed to k8s) is not giving you.

gjulianm · on June 14, 2022

Installation in a regular system without Kubernetes? Right now I can install Grafana, Prometheus and Alertmanager in a regular Linux system using distribution packages, and just worry about those programs themselves. If I want to install OnCall, I need not only OnCall plus four other non-trivial dependencies that will still need configuration, management and troubleshooting. All for something that is going to deal with far less load than any of Grafana/Prometheus/Alertmanager. I honestly do not understand it.

lazyant · on June 14, 2022

you can install this stack without kubernetes no? I don't see anything k8s-specific

gjulianm · on June 14, 2022

The problem still stands of adding dependencies, extra complexity and configuration. I’m usually happy about Grafana/Prometheus deployments because the base installation is fairly simple and self-contained, but this looks like a bit of a mess.

heavyset_go · on June 14, 2022

Yes, there is nothing Kubernetes specific here, and this can be deployed using whatever container orchestration system you want.

chrisandchris · on June 14, 2022

Not OP, but one may interpret your response as "I don't understand why you prefer a single binary over this architecture that requires 6 different services and prefers k8s".

IMHO, OP just stated that one could solve this with less dependencies and have the same (if not a better) result.

pphysch · on June 14, 2022

Yes, thank you. I would be surprised if this same product couldn't be delivered with just Python(Django) + SQLite + Redis (assuming writing everything in Go is unrealistic). Spinning up a venv and launching a local Redis instance is significantly more reasonable than having to configure MySQL, RabbitMQ, and Celery.

lazyant · on June 14, 2022

I missed that interpretation :(

IMHO a fat binary written from scratch would have been a way worse choice than to use a standard stack, both in terms of bugs and time, let alone Open Source contributions or any scalability.

In terms of number of services, what do you get rid of that produce a better result? maybe RMQ and use a worse queue?, celery and write your own task manager or use another dependency?

theptip · on June 14, 2022

For a simple low-scale app you can often do without Redis and Celery/RMQ if you just push everything into Postgres.

Far less scalable, but it is dramatically simpler to deploy. Often gets you surprisingly far though. Would be interesting to know how many monitored integrations could be supported by that flow.

gjulianm · on June 14, 2022

I bet quite a lot, probably at least 10-50 per second without doing anything special for performance, i.e. multiple queries per alert, calling different APIs, things like that. I don't know of many places that are dealing with alerts measured in "per second" as a unit.

Not to mention that having multiple components doesn't mean it's "scalable" by default, it could happen that some part of the pipeline doesn't like multiple instances of something.

picozeta · on June 14, 2022

How does a message queue work via Postgres? Many people (including me) use Redis to run background jobs.

slotrans · on June 14, 2022

This is a very confused question. The data store you keep your queued items in is completely orthogonal to what a message queue actually is.

A simple way to use an RDBMS as a message queue, that has been in use since before most HN readers were born, is roughly:

  - enqueue an item by inserting a row into a table with a status of QUEUED
  - use a SELECT FOR UPDATE, or UPDATE...LIMIT 1, or similar, to atomically claim and return the first status=QUEUED item, while setting its status to RUNNING (setting a timestamp is also recommended)
  - when the work is complete, update the status to DONE

There are more details to it obviously but that's the outline.

The first software company I worked for was using this basic approach to queue outbound emails (and phone and fax... it was 2005!), millions per day, on an Oracle DB that also ran the entire rest of the business. It's not hard.

theptip · on June 14, 2022

Here's the option I'm familiar with (siblings have others too):

https://github.com/malthe/pq

Doesn't have all the plumbing you'd want, there is a wrapper (https://github.com/bretth/django-pq/) that seems to give you an entrypoint command more like `celery worker ...` but I've not investigated it closely.

minusf · on June 14, 2022

https://github.com/procrastinate-org/procrastinate

https://github.com/gavinwahl/django-postgres-queue

infogulch · on June 14, 2022

lmgtfy https://www.crunchydata.com/blog/message-queuing-using-nativ...

pphysch · on June 14, 2022

Any of the following:

Python(Django)+Redis+[SQLite]

Python(Django)+Postgres

[Compiled Go binary]+[SQLite]

SQLite barely even counts as an architectural dependency TBH :)

MarquesMa · on June 14, 2022

This. I find open source projects written in Go or Rust are usually more pleasant to work with than Java, Django or Rails, etc. They have less clunky dependencies, are less resource-hungry, and can ship with single executables which make people's life much easier.

Just think about Gitea vs GitLab.

matsemann · on June 14, 2022

Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

As for python, at least getting a dockerfile helps a lot. Otherwise it's a huge mess to get running, yes.

Python is still a hassle anyways, since the lack of true multithreading means that you often need multiple deployments, which the Celery usage here for instance shows.

Volundr · on June 14, 2022

> Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

Maybe I'm behind the times, but I can't figure out what you mean here. As far as I know 'java -jar' or servlets are still the most common ways of running a Java app. Are you talking graal and native image?

matsemann · on June 14, 2022

For deploying your own stuff, most people do as before, yes. But even then, it's at least still only a single jar file, containing all dependencies. Not like a typical python project where they ask you to run some command to fetch dependencies and you have to pray it will work on your system.

But using jlink for java, one can package everything to a smaller runtime distributed together with the application. So then I feel it will be not much different than a Go executable.

> The generated JRE with your sample application does not have any other dependencies...

> You can distribute your application bundled with the custom runtime in custom-runtime. It includes your application.

From the guide here https://access.redhat.com/documentation/en-us/openjdk/11/htm...

FridgeSeal · on June 14, 2022

Python application deployments are all fun and games until suddenly the documentation starts unironically suggesting that you should “write your configuration as a Python script” that should get mounted to some random specific directory within the app as if that could ever be a sane and rational idea.

eeZah7Ux · on June 14, 2022

Hell no, I want stuff like OnCall packaged into Linux distribution. I need something stable and reliable and that receive security fixes.

Maintaining tenths of binaries pulled from random github projects over the years is a nightmare.

(Not to mention all the issues around supply chain management, licensing issues, homecalling and so on)

morelisp · on June 14, 2022

At this point I trust the Go modules supply chain considerably more than any free distro's packaging, which is ultimately pulling from GitHub anyway.

eeZah7Ux · on June 14, 2022

This is plain false. Most production-grade distribution do extensive vetting of the packages, both in terms of code and legal.

Additionally, distribution packages are tested by a significant number of users before the release.

Nothing of this sort happens around any language-specific package manager. You just get whatever happens to be around all software forges.

Unsurprisingly, there has been many serious supply chain attacks in the last 5 years. None of which affected the usual big distros.

morelisp · on June 15, 2022

> None of which affected the usual big distros.

I guess we can argue about "big" but didn't both Arch (https://lists.archlinux.org/pipermail/aur-general/2018-July/...) and Gentoo (https://wiki.gentoo.org/wiki/Project:Infrastructure/Incident... and older, https://bugs.gentoo.org/show_bug.cgi?id=323691) have actual compromised packages? And also not five years ago, but Fedora (https://lists.fedoraproject.org/pipermail/announce/2011-Janu...) and Debian (https://www.debian.org/News/2003/20031202) had compromises but no known package changes.

morelisp · on June 14, 2022

No, Go modules implement a global TOFU checksum database. Obviously a compromised upstream at initial pull would not be affected, but distros (other than the well-scoped commercial ones) don’t do anything close to security audits of every module they package either. Real-world untargeted SCAs come from compromised upstreams, not long-term bad faith actors. Go modules protects against that (as well as other forms of upstream incompetence that break immutable artifacts / deterministic builds).

MVS also prevents unexpected upgrades just because someone deleted a lockfile.

dijit · on June 14, 2022

> At this point I trust the Go modules supply chain considerably more than any free distro's packaging

What has happened in the package ecosystem to make you believe this? Is it velocity of updates or actual trust?

I haven’t heard of any malicious package maintainers.

morelisp · on June 15, 2022

Better automation ensuring packages are immutable, fewer humans in the packaging loop.

dijit · on June 15, 2022

generally I prefer humans in the loop, someone to actually test things. This is why distros are stable compared to other distros which are more bleeding edge.

morelisp · on June 15, 2022

For SC security, the fewer points of attack between me and the source the better.

For other kinds of quality, I have my own tests which are much more relevant to my use cases than whatever the distro maintainers are doing.

I've been a DD and while distros do work to integrate disparate upstreams as well as possible, they rarely reject packages for being fundamentally low quality or make significant quality judgements qua their role as maintainer (only when they're a maintainer because they're also a direct user). Other distributions do even less than Debian.

dijit · on June 15, 2022

I have seen scenarios where package maintainers have rejected updating packages because the upstream is compromised though.

morelisp · on June 16, 2022

Fedora currently packages 10646 crates. It's implausible that they're manually auditing each one at each upgrade for anything other than "test suites pass", let alone something like obfuscated security vulnerabilities.

In the end most distros will be saved by the fact they don't upgrade quickly. Which is also accomplished by MVS without putting another attack vector in the pipeline.

dijit · on June 20, 2022

No person manages more than 250 packages (and he's a RH employee).

There's more than a hundred package maintainers (I'm not sure exactly how many), but the median is about 50 packages.

Do you think people can't keep up with the updates for 50 packages?

morelisp · on June 24, 2022

I think I don't want "more than a hundred" additional points of trust, especially if they're trying to audit 50+ projects with various levels of familiarity each. And no, I don't believe one person can give a real audit to 50 packages each release even if was their actual job.

To paraphrase, all "more than a hundred" of those people need to be lucky every time.

heavyset_go · on June 14, 2022

That's a tried and true stack, and a very good one for maintaining sane levels of reliability, consistency, durability etc. Resource wise, at least with Celery, RabbitMQ and Django, they're also pretty lean.

It even ships in containers along with Docker Compose files and Helm charts, which would suit the deployment use cases of 99% of users. I understand that you're not using containers, but I don't think that's a limitation that many are inflicting upon themselves as of late, and if pressed, installing Docker Compose takes about 5 minutes and you don't have to think about it again.

pphysch · on June 15, 2022

> Docker Compose takes about 5 minutes and you don't have to think about it again

Except when you need to pin and repin versions to comply with a security policy, which may be why you aren't even running containers in the first place

minusf · on June 14, 2022

not gonna argue that a single binary is the ultimate deploy solution but running a django app is not that difficult (although i am biased cause i do that for a living).

i love django projects but mysql, celery and rabbitmq -- no thanks.

pphysch · on June 14, 2022

Don't get me wrong, I love Django and think its a great framework for writing internal tools like this. Redis gets a pass too since Django has native support for it in 4.0+. It's really the (IMHO unnecessary) combo of MySQL+RabbitMQ+Celery that turns me off.

Redis itself has had solid support for building reliable distributed task streaming for nearly 4 years (Redis ConsumerGroups introduced in 2018).