You are just trying to host a llama server?
Matching the VRAM doesn't necessarily matter, get the most you can afford on a single card. Splitting beyond 2 cards doesn't work well at the moment.
Getting a non Nvidia card is a problem for certain backends (like exLLaMA) but fine for llama.cpp in the near future.
AFAIK most backends are not pipelined, the load jumps sequentially from one GPU to the next.
You are just trying to host a llama server?
Matching the VRAM doesn't necessarily matter, get the most you can afford on a single card. Splitting beyond 2 cards doesn't work well at the moment.
Getting a non Nvidia card is a problem for certain backends (like exLLaMA) but fine for llama.cpp in the near future.
AFAIK most backends are not pipelined, the load jumps sequentially from one GPU to the next.