I wrote this because alerting on CPU% / loadavg as machine health indicator has burned me a few times.
The simple split I use now is:
- CPU% = how busy the cores are
- PSI = how much time tasks are stalled (CPU / memory / IO)
In an eBPF agent I am working on (Linnix), I ended up looking at CPU and PSI together. High CPU + high PSI is interesting. High CPU + low PSI is usually just “busy”.
This obviously doesn’t replace latency/SLO alerts at the app level.It’s only about which host metric to look at.
Right now I’m sticking to process lifecycle (sched_process_fork and sched_process_exit), mostly for correlation.
It’s much easier to grab container ID / cgroup metadata at fork time and say “this pod/image is the bad actor” than it is to reconstruct that context off a firehose of sched_switch events.
I agree that run queue latency / scheduler stats are the “better” signals for pure performance debugging. But scheduler switches generate a huge volume of events compared to forks.
So I’m starting with fork/exec/exit + container/cgroup mapping
If you’ve shipped scheduler-level tracing in production I’d love to hear how you handled filtering + aggregation.
OP here.
I’ve been doing backend work for ~15 years, but this was the first time I really felt why eBPF matters.
We had a latency spike that all the usual polling tools missed — top, CloudWatch, Datadog, everything looked normal. In the end it was a misconfigured cron job spawning ~50 short-lived workers every minute. Each one ran for ~500ms, burned the CPU, and exited before the next poll. So all our “snapshot” tools were basically blind.
I wrote the post to show this exact gap: Polling = snapshots, Tracing = event stream. For stuff that appears and disappears between polls, only tracing really sees it.tools like execsnoop or auditd can catch this, but in our case the overhead felt too high to leave on 24/7 in production. I amm currently playing with a small Rust+Aya agent that listens on ring buffers so we can run this continuously with less overhead.
If you just want to try the idea, the post has a few bpftrace one-liners so you can reproduce the detection logic without writing any C or Rust.
I started this as a personal project to help with monitoring my personal projects. The eBPF monitoring works well - that part is solid.
The AI part is experimental, especially the idea of running inference on CPU (can't afford GPUs and didn't want to rely on OpenAI APIs, though that's where it started). It's hit-or-miss depending on the model.
Not production-tested at scale - just sharing in case it's useful to others who want to tinker with eBPF + Rust.
Full transparency: I did use AI to help write the documentation because honestly, writing docs feels boring and will review thoroughly now based on your feedback
Open sourcing something for the first times so trying and learning
It does seem super cool! But if you aren't even editing the basic README.md - it's not that you used AI to help, but that you that you didn't even do the most basic editing, I don't know what to trust. If I can't trust the docs why spend my time?
My side project is mahasherpa.com.
A script turned into website I wrote to find myself a job after getting laid off back in november.
Have not made any money out of it yet
I have used an algorithm to find me a job and now I am trying to make something that can help all the job-seekers like me in India so they can save their valuable time.
Please do provide feedback
Hello James,
Thanks for the pointing out the bug and I think I solved the bug. You are most welcome if you want to try again. Please let me know if there are any other suggestions.
Just a quick note:
This application only has jobs for india and you need to signup to actually see the jobs that match your profile and I give you my word that you will not receive any spam email or I won't post anything to your social network profiles I only use those to find out if you have any referral for any job and count towards your match score as you are more likely to get notice by recruiters/employers