MOE uses more memory than dense model and is slower #166

Open

Description

opened

on Mar 3, 2025

I am training a ~520 M model, but I have found that the megablocks moe version uses substantially more memory and takes longer to train than a dense model of corresponding size. I am using a model embedding dimension of 1536. The moe model has 48 experts with 8 active and and expert size of 128. I set lbl loss weight to 0.001.

Image Image Image

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOE uses more memory than dense model and is slower #166

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions