-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seems like a good idea. But need to specifically calculate the total KV cache for these nemotron models where finding the which layers to split. For example, if the total KV cache calculated is 10GB and there are four cards. Then you split when KV cache total up to a certain layer hits 2.5GB, 5GB and 7GB. For example, suppose the first 10 layers get to 2.6GB, then you split there first, When the next 9 layers gets to 5GB, you split there again and finally 11 layers later it gets to 7.6GB, then you split there again and leave the remaining 50 layers left to the last GPU. Probably the same problem also exists for the similar OpenELM models. |
Ah. I think you also need to take into account of the parameter size of each layer as well for the truly even split |
I am not very proficient in LLM architectures, but your algorithm seems exactly right. You divide total model size in VRAM by the number of GPUs, lets call obtained VRAM per GPU m. Then for each GPU you go layer by layer until size in one GPU exceeds m, assume it happens on layer k. Then you compare VRAM use at k and k - 1 and see which one is closer to m. Then proceed with n-k or n-k+1 left layers. There is also a possibility of reordering layers, but I it will increase intra GPU communication with probably quite limited positive effect on more even distribution. |
I find that you can use the "-ts" switch to manually allocate VRAM to different GPUs. Maybe you can give that a try until someone makes llama.cpp doing it automatically. |
Thanks I will try! |
Prerequisites
Feature Description
I ran into problem running nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF model in LM Studio on dual RTX 3090 setup. LM Studio splits model evenly among GPUs (default llama-cli option), but in case of Nemotron with much bigger first layers it leads to very unequal VRAM usage. This results in OOM, when I try to increase context size while having plenty of free VRAM on the second GPU. I got exactly the same behavior by using llama-cli with default even split.
Motivation
This is necessary for any model that has unbalanced structure, e.g. first layers are much bigger than later ones. Without this feature downstream application wouldn't be able to load model weights evenly and use VRAM efficiently in case of multi GPU setup, since it doesn't have info about layers sizes.
Possible Implementation
Please make equivalent to --tensor-split command or change it behavior to split according VRAM usage rather than number of layers for convenient use of models with asymmetric layers sizes. Possible implementation, in case of two GPUs: solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers.
The text was updated successfully, but these errors were encountered: