The problem child for classic CGI is Windows, where process creation is 100x slower than any other POSIX implementation.
You can measure this easily, get a copy of Windows busybox and write a shell script that forks off a process a few thousand times. The performance difference is stark.
That's interesting. I hadn't thought about that. Still, fork() (plus _exit() and wait()) takes 0.7ms in Termux on my phone for small processes, as measured by http://canonical.org/~kragen/sw/dev3/forkovh.c.
Are you really saying that it takes 70ms on Microsoft Windows? I don't have an installed copy here to test.
Even if it does, that would still be about 15% of the time required for `python3 -m cgi`, so it seems unlikely to be an overriding concern for CGI programs written in Python, at least on manycore servers serving less than tens of millions of hits per day. Or does it also fail to scale across cores?
So, that does 19999 fork()s in 224 seconds, which works out to about 11 milliseconds per fork(). (On my Linux box, it's actually doing clone(), and also does a bunch of wait4(), rt_sigaction(), exit_group(), etc., but let's assume the fork() is the bottleneck.)
This is pretty slow, but it's still about 6× faster than the 70 milliseconds I had inferred from your "100× slower than any other POSIX".
Also note that your Red Hat machine is evidently forking in 39μs, which is several times faster than I've ever seen Linux fork.
You can measure this easily, get a copy of Windows busybox and write a shell script that forks off a process a few thousand times. The performance difference is stark.