In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM...

In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM (for double precision). Both cuBLAS and rocBLAS support ZGEMM, and the latter is open source. rocBLAS is also pretty complicated though, and perhaps not such a good learning resource. This is a more readable library which implements CGEMM, or at least a similar operation: https://git.astron.nl/RD/recruit/ccglib.

The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.