Author here. I think you should go through the article again. I think it's quite readable, and there are no "hand-optimizations" as you say. Also, the single-core implementation was already faster than the C version - the multithreaded version was only done to explore different methods of concurrency in Go.
Hope that clarifies things.