Providers use h100 because using 4090 in DCs is grey area, since Nvidia doesn't permit it.
Paper discussing here is using 4 bit compute, which is 4x on 4090 in comparison with bf16 compute, while h100 doesn't have this at all (i.e. best you can get is 2x compute with fp8). So this paper will even out difference between those two to some extent. If to judge by theoretical numbers - H100 has 1979 TFLOPs fp8 compute, and 4090 has 1321 TOPS. Which puts it around ~65% of performance. Given the price of it ~$2K compared to H100s ~$30K this seems like a very good deal.
Paper discussing here is using 4 bit compute, which is 4x on 4090 in comparison with bf16 compute, while h100 doesn't have this at all (i.e. best you can get is 2x compute with fp8). So this paper will even out difference between those two to some extent. If to judge by theoretical numbers - H100 has 1979 TFLOPs fp8 compute, and 4090 has 1321 TOPS. Which puts it around ~65% of performance. Given the price of it ~$2K compared to H100s ~$30K this seems like a very good deal.
But again, no 4090 in DCs.