-
Notifications
You must be signed in to change notification settings - Fork 229
MOE uses more memory than dense model and is slower #166
Open
Description
I am training a ~520 M model, but I have found that the megablocks moe version uses substantially more memory and takes longer to train than a dense model of corresponding size. I am using a model embedding dimension of 1536. The moe model has 48 experts with 8 active and and expert size of 128. I set lbl loss weight to 0.001.
Image Image ImageMetadata
Metadata
Assignees
Labels
No labels
Type
Fields
Give feedbackNo fields configured for issues without a type.