I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slow, not an interactive pace. I'm interested in offline LLM but couldn't see how it was going to produce $3,000 ROI.
It would be really cool if there was awebsite "we there yet" for reasonable offline AI.
It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.
There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.
We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.
I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.