shifting or self extending hybrid memory? · ggml-org/llama.cpp · Discussion #16708

leok7v
Oct 21, 2025

Hi folks,

I have a question about hybrid memory models "group attention" / self extend / shift code.
I am experimenting with several hybrid but not fully recurrent models.
In the code (main.cpp/server.cpp/passkey.cpp) I see the same sequences for shifting content and group attention context reduction via llama_memory_seq_rm(), llama_memory_seq_add() for shift and llama_memory_seq_div(), llama_memory_seq_add() for SelfExtend
LFM2-VL-450M-Q8_0.gguf
and also falcon-h1-0.5b-instruct-q8_0.gguf

The LFM2-VL-450M-Q8_0.gguf model returns:

is_recurrent: 0
is_hybrid: 1
can_shift: 1

as it should and reading the code:

bool llama_memory_hybrid::get_can_shift() const {
 // Shifting is trivially supported for recurrent
 return mem_attn->get_can_shift();
}

one might expect that llama_memory_seq_rm() would work shifting the KV cache content but it does not because of:

bool llama_memory_seq_rm(
 llama_memory_t mem,
 llama_seq_id seq_id,
 llama_pos p0,
 llama_pos p1) {
 if (!mem) {
 return true;
 }
 return mem->seq_rm(seq_id, p0, p1);
}

calling:

bool llama_memory_hybrid::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
 // Try removing from the recurrent cache first since it may fail. If it does
 // fail, the cache will not have been mutated.
 if (!mem_recr->seq_rm(seq_id, p0, p1)) {
 return false;
 }
 return mem_attn->seq_rm(seq_id, p0, p1);
}

and understandably failing in:

bool llama_memory_recurrent::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
 //printf("[DEBUG] calling llama_memory_recurrent::seq_rm` with `seq_id=%d, p0=%d, p1=%d`\n", seq_id, p0, p1);
 uint32_t new_head = size;
 if (p0 < 0) {
 p0 = 0;
 }
 if (p1 < 0) {
 p1 = std::numeric_limits<llama_pos>::max();
 }
 // models like Mamba or RWKV can't have a state partially erased
 if (seq_id >= (int64_t) size) {
 // could be fatal
 return false;
 }
 if (0 <= seq_id) {
 int32_t & tail_id = cells[seq_id].tail;
 if (tail_id >= 0) {
 const auto & cell = cells[tail_id];
 // partial intersection is invalid
 if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
 //printf("[DEBUG] inside `llama_memory_recurrent::seq_rm`: partial intersection is invalid, so returning false\n");
 return false;
 }
 // invalidate tails which will be cleared
 if (p0 <= cell.pos && cell.pos < p1) {
 tail_id = -1;
 }
 }
 } else {
 ...
 }
}

checking condition: (0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)

SelfExtend code path with grp_attn_n and grp_attn_w silently succeeds but make further calls to init_batch() fail with unable to find_slot()

Questions:

Is this expected behavior?
Can I detect situation like this earlier than attempting call to llama_memory_seq_rm() that is failing and returning false?
In all the code in main.cpp, server.cpp, passkey.cpp I do not see bool results of llama_memory_seq_rm() calls and any error reporting/recovery/mitigation for failure - which is probably means that surrounding code path has guarantees that the call will succeed but I've failed to find such guarantees. What am I missing there?
I did not deeply investigate grp_attn_n and grp_attn_w for the reason that I believe the hybrid models in question maybe do not need it at all if I correctly detect "thou cannot/should not shift, self extend" condition. If that is incorrect - any hints on what could go wrong for hybrid models?
Should there be checks and at least error logging for all llama_memory_seq_rm() calls that return false on failure?

Any help and clarification would be greatly appreciated.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

shifting or self extending hybrid memory? #16708

Uh oh!

{{title}}

Uh oh!

leok7v
Oct 21, 2025

Replies: 0 comments

Select a reply

Uh oh!

shifting or self extending hybrid memory? #16708

Uh oh!

leok7v Oct 21, 2025

Replies: 0 comments

leok7v
Oct 21, 2025