As I elaborated on elsewhere, if you focus on task parallelism to the exclusion of data parallelism, like Go tends to do, then you are not making good use of the hardware.
What does it mean for a programming language to "tend to focus on task vs data parallelism"? In my mind, Go gives you primitives from which to build either data or task parallel algos, and the programmer can choose the kind of algorithm that they find appropriate (perhaps data parallelism is always faster, but perhaps some developers find task parallelism easier and their time is not well spent squeezing out the performance delta?). Is there some set of primitives that would cater more naturally to data parallelism? If these primitives exist, are they really 'better' than Go's primitives, or do they simply make task parallelism harder without making data parallelism easier? These aren't rhetorical questions; I'm genuinely curious to hear your opinion.
You want SIMD and parallel for, parallel map/reduce, etc. Goroutines are too heavyweight for most of these tasks: you will be swamped in creation, destruction, and message passing overhead. What you need is something like TBB or a Cilk-style scheduler.