Sure -- for a good example, GPUs go partway there by not speculating (they still...

Sure -- for a good example, GPUs go partway there by not speculating (they still have cache hierarchies though). It works because GPU workloads have massive data parallelism, so while one group of threads (a "warp") is stalled waiting for data, the cores can just execute other threads. Sun/Oracle had built a number of Sparc chips along this line too, e.g. the Niagara (Sun UltraSPARC T1) tolerates memory latency by having a bunch of SMT threads (8 per core, IIRC?) rather than OoO scheduling.

The problem is that single-thread performance is really important for a lot of workloads, because (i) parallelization is hard, (ii) even for parallelized workloads, serial bottlenecks (critical sections, etc.) still exist, and (iii) latency is often important too (one web request on one core of a server, or compiling one straggler extra-large file in a parallel build, for example).