You should take a look at cilk. They compile two versions of your code: one sequential, one parallel. The runtime switches between them dynamically. This avoids paying a lot of overhead managing parallelism for small subroutines, but still presents the programmer with a unified view of very fine grained concurrency.