A common metric in embedded is DMIPS/MHz. This is considered a bit antiquated (first written for the VAX!), but the Dhrystone benchmark is free and simple to implement. The important part is that its (supposedly) independent of clock speed to show the efficiency of your CPU design, and so is normally run from cache to get zero wait-states.
CoreMark[1] is a new replacement, and is becoming increasingly popular.
Once you implement/simulate your design in silicon, power usage becomes a good comparison metric. As the depends on clock frequency, DMIPS/mW is another common comparison benchmark. Since a lot of embedded applications spend most of there time in very low power states with the core stopped, sleep current and wake/sleep time are now becoming very important. This is more of a whole chip benchmark, and is a very popular area for microcontroller manufacturers to fight over right now, as results can vary wildly depending on the application. The makers of CoreMark have tried to come out with a benchmark[2], but it doesn't cover peripherals yet and isn't quite as popular.
Once you implement/simulate your design in silicon, power usage becomes a good comparison metric. As the depends on clock frequency, DMIPS/mW is another common comparison benchmark. Since a lot of embedded applications spend most of there time in very low power states with the core stopped, sleep current and wake/sleep time are now becoming very important. This is more of a whole chip benchmark, and is a very popular area for microcontroller manufacturers to fight over right now, as results can vary wildly depending on the application. The makers of CoreMark have tried to come out with a benchmark[2], but it doesn't cover peripherals yet and isn't quite as popular.
1. https://www.eembc.org/coremark/ 2. http://www.eembc.org/ulpbench/