-
Notifications
You must be signed in to change notification settings - Fork 758
Fix excessive memory allocation for static-shape attention ops#2636
Fix excessive memory allocation for static-shape attention ops #2636Pranaykarvi wants to merge 1 commit intoapple:main from
Conversation
Pranaykarvi
commented
Dec 31, 2025
Hi @TobyRoseman , just a gentle follow-up in case this slipped through.
Happy to make any adjustments if needed, thanks!
@TobyRoseman
TobyRoseman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your new unit tests don't pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a f-string since there is no variable being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is also an issue in several other places of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line of the comment. It doesn't really add much and can easily become outdated/inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added comments in this file are far too long. They need to be much concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are constants? If so the variable name should be all in caps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import statement should be done at the top of the file. If for some reason, they can't be done at the top of the file, do it at the top of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is also an issue elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but I don't think any of this is specific to a Llama-style Transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of duplicated code here with the previous method. Please create a helper function.
Summary
This PR fixes excessive memory allocation for Transformer attention ops when the
sequence length is statically known at compile time (e.g.
seq_len=128).For static-shape attention, Metal may eagerly allocate large intermediate buffers
(e.g. QKT matrices), which can lead to multi-GB allocations and OOM on iOS devices.
The existing attention slicing pass was gated behind a high sequence-length
threshold and did not trigger for smaller static shapes.
This change enables memory-efficient attention slicing for static sequence lengths
while preserving the existing behavior for dynamic-shape models.
Problem
When exporting Transformer models with a statically known sequence length,
scaled_dot_product_attentionmay materialize large intermediate tensors duringlowering. On iOS, this can result in excessive Metal buffer allocation (observed
~10GB) and OOM during inference or benchmarking, even for relatively small models
(e.g. Llama-style models with
seq_len=128).Solution
scaled_dot_product_attention_sliced_qpass to break the computation intosmaller chunks and reduce peak memory usage.
1280) and behavior fordynamic-shape models to avoid unnecessary overhead.
This approach limits the change to the pathological static-shape case and avoids
global behavior changes.
Testing
a static sequence length (
seq_len=128).pathological buffer materialization.
Notes
eager buffer allocation.
Fixes #2590.