Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].
IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them.
From the linked document:
Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer.
In an IB network, two cards connect point to point over the switch and "beam" one's RAM contents to other. On top of it, with accelerated MPI, certain operations are offloaded to IB cards and IB switches (like broadcast, sum, etc.), so MPI library running on the host doesn't have to handle or worry about these operations, leaving time and processor cycles for computation itself.
IB didn't invent RDMA, and it's not even the only way to do it today.
it's also not amazingly great, since it only solves a small fraction of the cluster-communication problem. (that is, almost no program can rely on magic RDMA getting everything were it needs to be - there will always be at least some corresponding "heavyweight" messaging, since you still needs locks and other synchronization.)
I’ve used other peripherals that did this. Under the hood you would have a virtual mapping to a physical address and extent where the virtual mapping is in the address space of your process. This is how dma works in qnx because drivers are userspace processes. The special thing here is essentially doing the math in the same process as the driver.
I agree that sounds very nice for distributed computation.
> The special thing here is essentially doing the math in the same process as the driver.
No, you're doing MPI operations on the switch fabric and the IB ASIC itself. CPU doesn't touch these operations, but only see the result of the operation. NVIDIA's DPU is just a more general purpose version of this.