Question I possibly couldn't ask a more appropriate person: how were the long-ru...

ericbarrett · on Nov 28, 2020

I think your first question is asking about something different than we did at the time (PHP 5.x). There was no central process that “ran” the backups—is that what you mean? There was a cron job on each backup server that started the work (systemd would do the job nowadays). The code would figure out what database servers it was responsible for and kick off those jobs, then exit on completion. Reporting and restaging to central storage was another set of crons, and so on. So though they ran for hours or days, the PHP processes had a well-defined start and terminus. The “master” process on each host started the worker children, did some housekeeping, wait()ed for hours, and took note of the exit codes. What you’re trying to do sounds a lot harder.

If you’re missing a magical command that sets up a test environment, I recommend writing it in shell, PHP, whatever, and sticking it in $HOME/bin. Or a makefile with a target so you can just run “make testserver” or the like; that way it will stay with the project. Or scripts in $PROJECT/bin or $PROJECT/scripts. Doesn’t really matter as long as it’s documented in the README and simple to execute. It’s permissible and customary to have a cleanup command, as well, if you started a background process. You could even have those start and stop commands create and then disable a systemd unit—that way you won’t have to look it up every time.

> How was killall isolated from nuking important server processes?

In general we didn’t isolate our processes against signals, because there was occasionally reason to send them. When we did send them, we sent them precisely. If a few kill -9s didn’t stop a backup then there was almost certainly a disk issue on the host and it was stuck on a kernel I/O system call, and we’d “nuke” the host (send it through a self-diagnosis and reimage cycle; cloud analogy: terminate and reallocate). It was definitely a cattle-not-pets environment. Other backup hosts would take up the slack.

exikyut · on Nov 30, 2020

Thanks very much for replying! Apologies for response latency...

Cron jobs make a lot of sense for periodic batch work.

It's interesting you used PHP to start and manage child workers (I recall reading somewhere in the docs about PHP being unable to report error code status correctly under certain conditions, but I can't find it right now).

Regarding environment, I was mostly pining about PHP's lack of a "correct"/idiomatic way to handle genuinely fatal conditions (like syntax errors) in a way that meant the binary would stay alive. I'm suddenly reminded of Java's resilience to failure - it'll loudly tell you all about everything currently going sideways in stderr/logcat/elsewhere, but it's "difficult enough" to properly knock a java process over (IIUC) that it's very commonly used for long-running server processes.

PHP-FPM has this same longevity property, but the CLI was designed to prefer crashing. I just always wished I didn't have to bolt on an afterthought extra layer to get reliability. So I wondered if I could learn anything from this particular long-running-process scenario. Cron is hard to beat though :)

Hmm. Automating the systemd unit creation process. Hmmmm... :) <mumbling about not knowing whether su/sudo credential/config sanity will produce password prompts>

And... heh, that's right, `kill` exists. Need to step up my game and stop `killall php`-ing. Good point.

Thanks again for the insight.