Yes, certainly the vector instructions. There are a few keys issues many C compilers seem to run into today:
- Loop unroll identification is really bad. For example, ICC will unroll and turn a single-level loop with a very obvious body into streaming stores if the increment is "i++" but not if it is the constant "i+=1".
- The register allocation problem has some subtelties. Many of the SSE registers overlap the name of the multiple packed-value register with those of the individual ones (e.g. "AX = lower 16 of EAX"), so knowing that you want four numbers to be in the right place without additional moves means a little bit more thinking.
But, there's also very little control-flow analysis done or global program analysis done except for some basic link-time code generation and profile-guided optimization. There's a lot you can do (e.g. cross-module inlining; monomorphizing to remove uniform representation; dramatic representation changes of datatypes), though admittedly some of it requires a more static language with some additional semantic guarantees.
What do you think are the most promising directions to improve in this area?