It's pretty slow. Alternatives that take advantage of hardware counters (like OProfile and Intel VTune) don't get in the way so much. With long-running simulations, or when trying to simulate performance in light of real-world loads you really don't want the VM overhead you get with Valgrind/Cachegrind if you can avoid it.