-
Notifications
You must be signed in to change notification settings - Fork 13.5k
-
Hi folks,
I have a question about hybrid memory models "group attention" / self extend / shift code.
I am experimenting with several hybrid but not fully recurrent models.
In the code (main.cpp/server.cpp/passkey.cpp) I see the same sequences for shifting content and group attention context reduction via llama_memory_seq_rm(), llama_memory_seq_add() for shift and llama_memory_seq_div(), llama_memory_seq_add() for SelfExtend
LFM2-VL-450M-Q8_0.gguf
and also falcon-h1-0.5b-instruct-q8_0.gguf
The LFM2-VL-450M-Q8_0.gguf model returns:
is_recurrent: 0
is_hybrid: 1
can_shift: 1
as it should and reading the code:
bool llama_memory_hybrid::get_can_shift() const {
// Shifting is trivially supported for recurrent
return mem_attn->get_can_shift();
}
one might expect that llama_memory_seq_rm() would work shifting the KV cache content but it does not because of:
bool llama_memory_seq_rm(
llama_memory_t mem,
llama_seq_id seq_id,
llama_pos p0,
llama_pos p1) {
if (!mem) {
return true;
}
return mem->seq_rm(seq_id, p0, p1);
}
calling:
bool llama_memory_hybrid::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
// Try removing from the recurrent cache first since it may fail. If it does
// fail, the cache will not have been mutated.
if (!mem_recr->seq_rm(seq_id, p0, p1)) {
return false;
}
return mem_attn->seq_rm(seq_id, p0, p1);
}
and understandably failing in:
bool llama_memory_recurrent::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
//printf("[DEBUG] calling llama_memory_recurrent::seq_rm` with `seq_id=%d, p0=%d, p1=%d`\n", seq_id, p0, p1);
uint32_t new_head = size;
if (p0 < 0) {
p0 = 0;
}
if (p1 < 0) {
p1 = std::numeric_limits<llama_pos>::max();
}
// models like Mamba or RWKV can't have a state partially erased
if (seq_id >= (int64_t) size) {
// could be fatal
return false;
}
if (0 <= seq_id) {
int32_t & tail_id = cells[seq_id].tail;
if (tail_id >= 0) {
const auto & cell = cells[tail_id];
// partial intersection is invalid
if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
//printf("[DEBUG] inside `llama_memory_recurrent::seq_rm`: partial intersection is invalid, so returning false\n");
return false;
}
// invalidate tails which will be cleared
if (p0 <= cell.pos && cell.pos < p1) {
tail_id = -1;
}
}
} else {
...
}
}
checking condition: (0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)
SelfExtend code path with grp_attn_n and grp_attn_w silently succeeds but make further calls to init_batch() fail with unable to find_slot()
Questions:
- Is this expected behavior?
- Can I detect situation like this earlier than attempting call to
llama_memory_seq_rm()that is failing and returningfalse? - In all the code in main.cpp, server.cpp, passkey.cpp I do not see bool results of
llama_memory_seq_rm()calls and any error reporting/recovery/mitigation for failure - which is probably means that surrounding code path has guarantees that the call will succeed but I've failed to find such guarantees. What am I missing there? - I did not deeply investigate
grp_attn_nandgrp_attn_wfor the reason that I believe the hybrid models in question maybe do not need it at all if I correctly detect "thou cannot/should not shift, self extend" condition. If that is incorrect - any hints on what could go wrong for hybrid models? - Should there be checks and at least error logging for all
llama_memory_seq_rm()calls that returnfalseon failure?
Any help and clarification would be greatly appreciated.
Beta Was this translation helpful? Give feedback.