Can somebody explain why would I use it instead a simple Redis/SQS/Postgres queue implemented in 50 LOC (+ some grafana panel for monitoring) (which is pretty much mandatory even for a wrapper of this or any other service)? I'm not trying to mock it, it's a serious question. What is implied by "task queue" that makes it worth bothering to use a dedicated service?
You're right, if all you need is a queue with a small number of workers connected at low volume, you don't need Hatchet or any other managed queue - you can get some pretty performant behavior with something like: https://github.com/abelanger5/postgres-fair-queue/blob/main/....
The point of Hatchet is to support more complex behavior - like chaining tasks together, building automation around querying and retrying failed tasks, handling a lot of the fairness and concurrency use-cases you'd otherwise need to build yourself, etc - or just getting something that works out of the box and can support those use-cases in the future.
And if you are running at low volume and trying to debug user issues, a grafana panel isn't going to get you the level of granularity or admin control you need to track down the errors in your methods (rather than just at the queue level). You'd need to integrate your task queue with Sentry and a logging system - and in our case, error tracing and logging are available in the Hatchet UI.
That caught my attention. Retrying failed tasks isn’t easy. There are all kinds of corner cases that pop up one by one. If you have some nice way to handle the common failure modes ("text me" or "retry every ten minutes" or "retry 5 times, then give up" or "keep retrying, but with exponential backoff") then that’s something I’d love to use.
(Wiring together 40+ preemptible TPUs was a nice crucible for learning about all of these. And much like a crucible, it was as painful as it sounds. Hatchet would’ve been nice.)
I have a bunch of different queues that used SAQ (~50 LoC for the whole setup) and deployed it to production. A lot of them use LLMs, and when one of them failed it was near impossible to debug. Every workflow has over a dozen connected tasks, and every task can run on over a dozen separate rows before completing... I was spending hours in log files (often unsuccessfully)
The dashboard in Hatchet has a great GUI where you can navigate between all the tasks, see how they all connect, see the data passed in to each one, see the return results from each task, and each one has a log box you can print information to. You can rerun tasks, override variables, trigger identical workflows, filter tasks by metadata
It's dramatically reduced the amount of time it takes me to spot, identify, and fix bugs. I miss the simplicity of SAQ but that's the reason I switched and it's paid off already
Is that a problem with the underlying infrastructure though? Im not seeing how using postgres queues would solve your issue... Instead it seems like an issue with your client lib, SAQ not providing the appropriate tooling to debug.
FWIW, Ive used both dramatiq/celery with redis in heavy prod environments and never had an issue with debugging. And Im having a tough time understanding how switching the underlying queue infrastructure would have made my life easier.
No it's not a problem with the underlying infrastructure. I believe the OP was asking why use this product, not why is this specific infrastructure necessary. The infrastructure before was working fine (with SAQ at least, Celery was an absolute mess of SIGFAULTs), so that was not really part of my decision. I actually really liked SAQ and probably preferred it from an infra perspective.
It's nice to be running on Postgres (i.e. not really having to worry about payload size, I heard some people were passing images from task to task) but for me that is just a nicety and wasn't a reason to switch.
If you're happy with your current infra, happy with the visibility, and there's nothing lacking in the development perspective, then yeah probably not much point in switching your infra to begin with [1]. But if you're building complicated workflows, and just want your code to run with an extreme level of visibility, it's worth checking out Hatchet.
[1] I'm sure the founders would have more to say here, but as a consumer I'm not really deep in the architecture of the product. Best I could do could be to give you 100 reasons I will never use Celery again XD
You can use celery with postgres without issues if you want the stuff you don't get with that, like tweakable retries, tweakable amounts of prefetch and other important-at-scale things. Plus out of the box working sdk with higher level patterns for you developers. Like what if devs want to track how long something waited in the queue or a metric about retries etc, things that you'd have to roll by hand.
I also want the answer to this question. Instinctually i want to say if you’re asking this Q it means you don’t need it (just like most people dont need Kubernetes/Snowflake/data lakes)