Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

goodglitch · 2025-03-30T12:26:49Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I ran into problem running nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF model in LM Studio on dual RTX 3090 setup. LM Studio splits model evenly among GPUs (default llama-cli option), but in case of Nemotron with much bigger first layers it leads to very unequal VRAM usage. This results in OOM, when I try to increase context size while having plenty of free VRAM on the second GPU. I got exactly the same behavior by using llama-cli with default even split.

Motivation

This is necessary for any model that has unbalanced structure, e.g. first layers are much bigger than later ones. Without this feature downstream application wouldn't be able to load model weights evenly and use VRAM efficiently in case of multi GPU setup, since it doesn't have info about layers sizes.

Possible Implementation

Please make equivalent to --tensor-split command or change it behavior to split according VRAM usage rather than number of layers for convenient use of models with asymmetric layers sizes. Possible implementation, in case of two GPUs: solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers.

ymcki · 2025-03-30T13:07:05Z

Seems like a good idea. But need to specifically calculate the total KV cache for these nemotron models where finding the which layers to split. For example, if the total KV cache calculated is 10GB and there are four cards. Then you split when KV cache total up to a certain layer hits 2.5GB, 5GB and 7GB. For example, suppose the first 10 layers get to 2.6GB, then you split there first, When the next 9 layers gets to 5GB, you split there again and finally 11 layers later it gets to 7.6GB, then you split there again and leave the remaining 50 layers left to the last GPU.

Probably the same problem also exists for the similar OpenELM models.

ymcki · 2025-03-31T07:07:08Z

Ah. I think you also need to take into account of the parameter size of each layer as well for the truly even split

goodglitch · 2025-04-01T16:49:01Z

I am not very proficient in LLM architectures, but your algorithm seems exactly right. You divide total model size in VRAM by the number of GPUs, lets call obtained VRAM per GPU m. Then for each GPU you go layer by layer until size in one GPU exceeds m, assume it happens on layer k. Then you compare VRAM use at k and k - 1 and see which one is closer to m. Then proceed with n-k or n-k+1 left layers.

There is also a possibility of reordering layers, but I it will increase intra GPU communication with probably quite limited positive effect on more even distribution.

ymcki · 2025-04-13T08:17:22Z

I find that you can use the "-ts" switch to manually allocate VRAM to different GPUs. Maybe you can give that a try until someone makes llama.cpp doing it automatically.

goodglitch · 2025-04-15T10:56:53Z

I find that you can use the "-ts" switch to manually allocate VRAM to different GPUs. Maybe you can give that a try until someone makes llama.cpp doing it automatically.

Thanks I will try!

goodglitch added the enhancement New feature or request label Mar 30, 2025

ymcki mentioned this issue Mar 31, 2025

Llama-3_1-Nemotron 51B support turboderp-org/exllamav2#726

Open

ymcki mentioned this issue Apr 12, 2025

Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

goodglitch commented Mar 30, 2025

ymcki commented Mar 30, 2025

ymcki commented Mar 31, 2025

goodglitch commented Apr 1, 2025

ymcki commented Apr 13, 2025

goodglitch commented Apr 15, 2025

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Comments

goodglitch commented Mar 30, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

ymcki commented Mar 30, 2025

ymcki commented Mar 31, 2025

goodglitch commented Apr 1, 2025

ymcki commented Apr 13, 2025

goodglitch commented Apr 15, 2025