-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek #27457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully optimizes the deepseek MLA implementation by removing two redundant D2D copies, which is achieved by leveraging in-place writes within the kernels. The changes in _v_up_proj and _forward_prefill are well-aligned with the performance goals. I have one suggestion to correct a return type annotation that was missed during refactoring.
@HAIAI Please take a look
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@LucasWilkinson PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Uh oh!
There was an error while loading. Please reload this page.
Purpose
We found there are two redundant D2D copy in deepseek, which can be removed by in-place write inside the kernel.
Before
imageimage
After
imageimage
This PR replace those two index copy to inplace write inside the kernel, and we witness there are about 3% performance gain over deepseek on AMD Mi300 with 3.5k/1k inputs
Test Plan
gsm8k
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.