Technically speaking, the breakthrough was also the fact that it allowed for parallelization of running the computation. Instead of going word by word in a sequence, and optimizing for the next word, the approach shifted to looking at words independently and then applying the same statistical approach of finding the next word relative to that word or sequence. Then the final outcome was a weighted sum of these independent pieces.