-
Notifications
You must be signed in to change notification settings - Fork 11.5k
llama : add llama_batch_ext #11875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
llama : add llama_batch_ext #11875
Conversation
@ggerganov Would you mind having a look on this initial proposal? Thank you! |
include/llama.h
Outdated
struct llama_batch_ext_token_info { | ||
llama_token token; | ||
llama_pos pos; | ||
int32_t n_seq_id; | ||
llama_seq_id * seq_id; | ||
int8_t logits; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be very future-proof. Mixed-modality batches would have tokens, embeddings and tensors mixed together in the same batch. So calling llama_batch_ext_get_token_info(batch, i);
is not always well-defined because it might not be a token at position i
.
Maybe we can postpone this "token_info" API. I think all usages in the examples that require to read back info from the batch can be implemented in the example code without relying on the API. This way we can focus only on implementing only the API for creating batches and adding data to them. Later on when we have a better idea of the implementation, we can add a helper API to get info back from the batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. Furthermore, this API requires doing a copy, so it won't be the best for performance. It's better to remove this API for now.
I think all usages in the examples that require to read back info from the batch can be implemented in the example code without relying on the API.
This kind of logic is currently being used inside llama-server
, not sure it appears on any other examples. I think I can make a thin wrapper for llama_batch_ext
inside the example code. Feel free to tell me if you have a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is removed in 1d6ba97 , a new server_batch
wrapper is added to manage token logits placement in the batch
OK so I've been able to apply this to various example (not all of them). Would be nice if you can have a quick look @ggerganov before I migrate the rest. One thing to note, the loop check over tokens in batch (discussed in #11875 (comment)) is used by both |
The
It seems we rather need something to query the batch, no? How do you imagine I was thinking something like: struct llama_batch_ext_part;
llama_batch_ext_part * part = llama_batch_ext_get_part(batch, i);
if (llama_batch_ext_part_is_token(part)) {
llama_token id = llama_batch_ext_part_get_id(part);
... get token id, sequence id, etc. ...
} But since I'm not 100% about all the details yet related to multi-modal batches, I think it is better to postpone this API for later, and handle the batch information in the user code for now. |
I don't have a clear idea yet, but I'm thinking as a developer using
So when I retrieve back the logits/embeddings, I would imagine that the
Yes we can, and this will be quite similar to my point above. I'm thinking about these 2 options:
|
Hm, yes. The Btw, this makes me wonder if we should actually move the output buffers for logits and the embeddings to be owned by the |
Now that #12181 has been merged, it should be a good time to get this merged too. |
Yes thanks for the heads up, I'll focus on finishing this today & tomorrow |
If the output logits and embeddings are staying are |
@ggerganov Could you take a look on this PR this week? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I still have some concerns:
- I am thinking that it would be a good idea to make the
llama_batch_ext
API to treat tokens and embeddings in a very similar manner. Ultimately, I think we should be able to create batches that contain both tokens and embeddings. For example the call:
LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
int32_t n_tokens,
int32_t n_seq_max);
might be better to define as:
// either one of these - not sure which one yet
LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
struct llama_model * model,
int32_t n_seq_max);
// this one will figure out `n_seq_max` from the context
// maybe actually this one is the best
LLAMA_API struct llama_batch_ext * llama_batch_ext_init(
struct llama_context * ctx);
Passing n_tokens
to the batch init call was necessary in the past in order to pre-allocate an array with enough size. But it is technically redundant information, because we can add new tokens and embeddings and dynamically resize the batch in libllama
as needed. So I think there is no longer need to provide this information.
- I fill like the embeddings API should mirror the tokens API. So instead of:
LLAMA_API struct llama_batch_ext * llama_batch_ext_init_from_embd(
const float * embd,
size_t n_tokens,
size_t n_embd,
llama_pos pos0,
llama_seq_id seq_id);
LLAMA_API int32_t llama_batch_ext_set_pos(struct llama_batch_ext * batch, llama_pos * pos, size_t n_pos);
we should probably have something like:
LLAMA_API int32_t llama_batch_ext_add_embd(
struct llama_batch_ext * batch,
const float * embd, // size would be inferred from the context
llama_pos * pos, // multiple pos per embd
size_t n_pos,
const llama_seq_id * seq_ids,
size_t n_seq_ids,
bool output);
I think these change would help us think about tokens and embeddings as something very similar.
Nice, thanks for the comment. Yes I agree with your points. For |
Ok so I implemented the new For now, qwen2vl-cli is broken because the
But in reality, cgraph only accepts:
One is the transposed version of another, so will be simple to do: before But before implementing this, I just want to check with you if my direction still looks ok. |
src/llama-batch.cpp
Outdated
struct llama_batch_ext * llama_batch_ext_init(int32_t n_tokens_alloc, int32_t n_seq_max) { | ||
return llama_batch_ext_init_impl(n_tokens_alloc, 0, n_seq_max); | ||
struct llama_batch_ext * llama_batch_ext_init(struct llama_context * ctx) { | ||
return llama_batch_ext_init_impl(llama_n_batch(ctx), 0, llama_n_seq_max(ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now this is a good solution. Later, we will be resizing dynamically and not need the llama_n_batch()
.
I think this looks good. |
I implemented the fix, tested The command is: A bit confused about this, I found this nice illustration on qwen2vl model page on HF, which shows that qwen only use 3 pos per token. It's also confirmed by the config.json file. However, I'm not sure why in llama.cpp we use up to 4 pos. ![]() |
cc @HimariO |
examples/server/server.cpp
Outdated
@@ -1963,7 +1963,7 @@ struct server_context { | |||
const int32_t n_batch = llama_n_batch(ctx); | |||
|
|||
// only a single seq_id per token is needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think I removed it in one of the commits above, we don't need n_batch
anymore so I removed this whole code block
The fourth position ID is mainly for future-proofing, in case a newer model that takes 3D/depth input(like SpatialLM) is added. Currently, both Qwen2 & 2.5VL only use 3 position ID per token. |
@ngxson We discussed with @slaren these changes and he raised a good point that the batch API does not need to explicitly pass token positions. These can be inferred from the KV cache. Since not having to pass explicit token positions would simplify the batch API, it's a good idea to take it into account when redesigning it. So I will try to do some KV cache refactoring to support this. When I'm ready, I will come back to this PR and update it respectively. |
Thanks for the head up! Yes for text batch it would be nice if the position can be inferred from KV. Please also note that, for multimodal batch, we may also need the N-dimension position. For example, in the case of Qwen2VL we have a normal position + an additional 2D position for each image token. I think what we can do is that for the that the 2D coordinate position can be given by the user, and the "normal" position can still be inferred from KV. So for example in this PR, the API for Qwen will accepts |
I think the Qwen2VL image positions could be inferred from the token position: llama.cpp/examples/llava/qwen2vl-cli.cpp Lines 39 to 50 in 3fd072a
So ideally, the user would not have to pass those as well. |
We still need to know the image size In far future (or not that far?), |
Yes, the multi-dim positions are complicating things. Not sure what is the best solution.
Maybe with Gemma demonstrating that this is not necessary, new models won't need these complications and we don't have to add support at all. Are there other models other than Qwen2VL that need 2D positions?
This seems something that the user code can implement logic (similar how to bos/eos tokens are added). Which models use this pattern? Anyway, the KV cache refactoring can be done before making a decision about how to handle the images, so we can re-discuss this after that. |
Just to clarify before writing my response, there are 2 reasons why the image size is needed:
Hmm yeah that could be right. Because M-RoPE is invented by Qwen, I don't see anyone using it for now. Not sure if other models will adopt it in the future. But please not that gemma 3 does not use slices. This makes working with gemma 3 vision easy, but the current problem with gemma 3 is that the image size is fixed. For bigger images, they need to rely on a technique called "pan and zoom", which essentially a prompting technique that allow the model to "ask" the runtime to zoom the image, then rerun the generation. This is obviously very inefficient. Models like SmolVLM, MiniCPM-V, Qwen (and maybe many other) are already using slicing technique that I said earlier, so we definitely need to support this in the API.
In fact, better to think is that is the "chat template" for image. While it can be implemented in user code, I think it's better to make it transparent from user POV, as this part is model-specific and even harder than normal chat template for text, this is not something user can easily debug. In gemma3-cli, you can see that the |
@ggerganov I've been working on audio input and output recently and I think the API proposed by this PR pretty much correspond to what I need (ofc except for the position, which will be nicer to be hidden from user code). Having this PR merged could save me some efforts, and more importantly unblock my researches on multimodal API, so I'm wondering if anything I could do on my side to accelerate this a bit more? Thank you! Also now that the position is hidden from user code, I think we should also somehow modify the |
My plan for the next steps was to refactor the
The |
Ref comment: #11292 (comment)
Closes #10381
Migration patterns:
Current status:
llama_batch
from public API --> To be discussedllama-server
works for nowllama_batch
can be migrated to cpp types