Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Parallelization / Batching Explanation #4130

Unanswered
lapp0 asked this question in Q&A
Nov 19, 2023 · 4 comments · 42 replies
Discussion options

Hi All,

I'm seeking clarity on the functionality of the --parallel option in /app/server, especially how it interacts with the --cont-batching parameter. My specific observation involves setting --ctx-size to 8192 and --parallel to 32. From the logs, it appears there are 32 slots, each handling a context segment of 256. My question is: Does this configuration imply that each slot processes a distinct segment of the context?

For instance, if I input 32 instances of an identical prompt with a length of 4096, would the first half of the slots remain idle due to the prompt already existing in the KV cache? This leads to confusion, as it seems the total number of submittable jobs is limited by the slot count. This is puzzling because different prompts might rely on the same slot for varied tasks. If the initial segment of the context is identical for multiple prompts, that segment might not require processing, as it's already in the KV cache.

I'm trying to understand the rationale behind dividing the context into segments when batching. Could you provide an explanation of how the --parallel and --cont-batching options function?

References:

You must be logged in to vote

Replies: 4 comments 42 replies

Comment options

Bump- trying to understand this as well...

You must be logged in to vote
0 replies
Comment options

The --ctx-size argument actually specifies the total size of the KV cache (legacy name, --kv-size would be better). This corresponds to the total amount of tokens that can be stored across all independent sequences.

For example, if we specify --ctx-size 8192 this means that we can process:

  • 2 sequences, each of max length of 4096 tokens
  • 4 sequences, each of max length of 2048 tokens
  • 8 sequences, each of max length of 1024 tokens
  • ...
  • 32 sequences, each of max length of 256 tokens
  • etc.

Simply put, if we want to be handling P sequences in parallel and we know that each sequence can have a maximum of T tokens (prompt + generated), then we want to set our KV cache size (i.e. --ctx-size) to T*P in order to be able to handle the worst-case scenario where all P sequences fill-up the maximum T tokens.

Since llama.cpp implements a "unified" cache strategy, the KV cache size is actually shared across all sequences. This means that it's allowed to have sequences with more than T tokens as long as the sum of all tokens does not exceed P*T.

From the logs, it appears there are 32 slots, each handling a context segment of 256. My question is: Does this configuration imply that each slot processes a distinct segment of the context?

No. Each sequence has it's own context. The tokens from each sequences "see" only the tokens from that same sequence. This is achieved with the KQ_mask:

https://github.com/ggerganov/llama.cpp/blob/dd5ae06405c5565b99889bdb3f168f4351252cfb/llama.cpp#L6328-L6368

Each llama_decode call accepts a llama_batch. The batch can contain an arbitrary set of tokens - each token has it's own position and sequences id(s). The position and the sequence ids of a token determine to which other tokens (both from the batch and the KV cache) it will attend to, by constructing the respective KQ_mask. When processed, the batch of tokens is stored in the KV cache. This way, as long as there is enough free slots in the KV cache, we can call llama_decode. In this framework, continuous batching is trivial.

Another great benefit is that different sequences can share a common prompt without any extra compute. All it takes is to assign multiple sequence ids to the common tokens in the KV cache. A basic example is a system prompt of S tokens that can sit at the beginning of the unified KV cache and be attended to by all sequences that we process.

Together with the simplicity and advantages of this implementation, there are a few disadvantages:

  • The attention for each sequence is computed over the entire KV cache. This is suboptimal in the case of many sequences and very long contexts because we compute "cross-sequence attention" which is masked after that (i.e. we throw it away)
  • The determinism of the results is now also a function of where the tokens of a sequence are placed in the KV cache

In order to resolve these, I think we should add a standard attention implementation where each sequence has it's own KV cache buffer and the attention is computed separately. This way, users would be able to choose which implementation to use based on their specific use case.

You must be logged in to vote
40 replies
Comment options

If you set -np N, then at most N requests will be processed at a given time and the rest will be pending in a queue.

That's not exactly what I am describing. You shouldn't need any -np value if you have a unified cache, you should always process as many requests as you can at a time. For example, if you have a cache with size 10k, and 5 requests that require 2k cache slots, then you can process the 5 requests at the same time. But if each request requires 10k slots, then you can only process one request at a time. That way you can support any number of simultaneous clients regardless of the size of the cache.

Comment options

Yes, got it. I guess we can add this when the unified KV cache is used - seems easy to support.

Comment options

@yychyo proposes to support the case where the sum of all N parallel sequences is less than C. I.e. to allow to have lengths larger than C/N, but their total sum to be less than C. The split KV cache does not have a good way to handle this. It can be handled with the unified KV cache. But I am not convinced it's necessary to support this.

I don't know if it is strictly necessary, but it would be a lot more flexible. For example, the case that you mentioned before with some agentic tool that kept one big slot (with a very large context), and occasionally launched smaller requests in a different slot should be handled perfectly with a unified cache. I think that starting with a unified cache was a smart thing to do, but we never really took advantage of it, and it is a bit disappointing that in the end it was just removed in favor of multiple independent caches.

Comment options

For example, if you have a cache with size 10k, and 5 requests that require 2k cache slots, then you can process the 5 requests at the same time.

Note that we don't know in advance how many cache slots a sequence would require. It could be any number up to the total training context of the model.

So in this example, if we reach a point where the sequences have the following lengths and are still generating:

  • seq 0: 8000 tokens
  • seq 1: 1000 tokens
  • seq 2: 500 tokens
  • seq 3: 250 tokens
  • seq 4: 250 tokens

Now the cache is full and it is not clear how to proceed. The simplest solution that we can implement is to randomly terminate one of the sequences and continue. I think we can probably do better and instead of terminating the sequence, we can offload it to host memory cache and queue it for later reprocessing when another sequence finishes.

But maybe as a first step we can simply terminate the largest sequence and go from there.

Comment options

I think that's likely to be the best way to maximize throughput, especially if there is a second level cache in host memory or disk. Maybe reserve a small amount of cache space for generation for each sequence so that you don't end in a case where you have to drop it from the cache immediately, but otherwise just generate in parallel for as long as you can, and if you run out of cache space during generation, you drop one of the requests from the (first level) cache and continue with the rest.

Comment options

If I understand correctly, I'm told that setting -c beyond model context window size degrades output quality. However, does it make it difference in parallel?
For example, if I want to utilize the full context size for llama3 (8192) in 16 parallels, would setting -c 131072 and -np 16 decrease the quality of output?
Or, should I set -c 8192 -np 16 and only send prompt with max 512 tokens?
What about generation tokens? Should prompt + generation should be <= 512?
Then should I set -c 8192 -np 16 -n 256 and only send prompt with max 256 tokens?

You must be logged in to vote
1 reply
Comment options

For example, if I want to utilize the full context size for llama3 (8192) in 16 parallels, would setting -c 131072 and -np 16 decrease the quality of output?

It will not decrease the quality and this is the recommended way to go

What about generation tokens?

For 16 parallel slots, setting -c 131072 -np 16 would guarantee that the worst-case scenario in terms of memory consumption where all 16 sequences are 8192 tokens long, would fit in the KV cache

Then should I set -c 8192 -np 16 -n 256 and only send prompt with max 256 tokens?

Yes, you can do that and it will work in the worst-case scenario where all 16 slots have prompts of 256 tokens each and generate 256 tokens each: 16*(256 + 256) = 8192

Comment options

Is there a standard implementation of KV cache that allows each request to compute its own KV cache independently, rather than sharing one for all requests? I am currently trying to observe the performance of multi-request inference while maintaining a constant KV cache size for each sequence (i.e., actively adjusting n_ctx so that n_ctx / np remains constant). When increasing the value of np, the performance drops sharply. I suspect this is due to the design of the KV cache, which causes the computation of KQV to grow quadratically, rather than linearly.

You must be logged in to vote
1 reply
Comment options

There isn't atm. I would like to refactor the KV cache implementation to support both unified, standard and other types of cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /