If I had to handle this problem, I'd do some kind of "split on existing loaded GPUs" for new sessions, and then when some cap is hit, spool an additional GPU in the background and the transfer the new session to that GPU as soon as the model is loaded.
I'd have to play with the configuration and load calcs, but I'm sure there's a low param, neat solution to the request/service problem.
I'd have to play with the configuration and load calcs, but I'm sure there's a low param, neat solution to the request/service problem.