I returned an M2 Max studio with 96GB RAM, unquantized llama 70B 3.1 was dog slo...

FloatArtifact · 2025-03-05T18:20:03 1741198803

It would be really cool if there was awebsite "we there yet" for reasonable offline AI.

It could track different hardware configurations and reasonably standardized benchmark performance per model. I know there's benchmarks buried in github Llama repository.

robbomacrae · 2025-03-05T19:33:32 1741203212

There seems to be a LOT of interest in such a site in the comments here. There seem to be multiple IP issues with sharing your code repo with an online service so I feel a lot of folks are waiting for the hardware to make this possible.

We need a SWE-bench for open source LLM's and for each model to have 3Dmark like benchmarks on various hardware setups.

I did find this which seems very helpful but is missing the latest models and hardware options. https://kamilstanuch.github.io/LLM-token-generation-simulato...

FloatArtifact · 2025-03-06T17:03:24 1741280604

Looks like he bases the benchmarks off of https://github.com/ggml-org/llama.cpp/discussions/4167

I get why he calls it a simulator, as it can simulate token output. It's an important aspect for evaluating use case if you need to get a sense of how much token output is relevant beyond the simple tokens per second text.