Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

d0rc/egg.c

Repository files navigation

EGGROLL in C

A minimalist, dependency-free implementation of the EGGROLL (Evolution Guided General Optimization via Low-rank Learning) algorithm family.

Mission: To get the most capable pairs of models / platforms cross-implementations of EGGROLL family of algorithms as possible and squeeze every possible bit of performance from the equipment, implementing everything required from the scratch in hardware-optimized fashion.

This project demonstrates integer-only training of a language model, completely bypassing the need for standard floating-point arithmetic or heavy ML frameworks like PyTorch or JAX.

Key Features

  • Pure C / Bare Metal: Old-school fashioned and goal-oriented. Zero external dependencies on the CPU side, keeping it close to the metal.
  • Apple Silicon Optimized: Vectorized operations using ARM NEON intrinsics and parallelized via Grand Central Dispatch (GCD).
  • NVIDIA CUDA Optimized: Custom GPU kernels utilizing Warp-level primitives, Shared Memory, and CUB/Thrust for maximum throughput.
  • Integer Only: Operates primarily on int8 weights/activations with int32 (CPU) or int64 (GPU) accumulation—sticking to integer math as long as it yields the best performance for the hardware.
  • Gradient Free: Uses Evolution Strategies (ES) with low-rank perturbations instead of backpropagation. It's both wisdom and freedom!

Quick Start

1. Prepare Data

Ensure you have a text dataset named input.txt in the current directory.

2. Compile & Run

Apple Silicon / CPU

clang -O3 full_trained_egg.c -o egg
./egg

NVIDIA GPU (CUDA)

nvcc -O3 full_cuda_train_egg.cu -o egg_cuda
./egg_cuda

Training Output

Advanced Implementations

Int8NativeFormer (full_cuda_train_transformer_adam_mgpu.cu)

An int8 model.

  • Native int8 Architecture: Operates on raw bytes with a compact N-layer, H-dim topology.
  • Quantized Sigmoid Self-Attention: An int32/int64 accumulation scheme and quantized weighting.
  • Auto-Norm & Entropy Monitoring: Adaptive normalization layers.
  • EGG DEBUG: debug-printing "tool" to monitor entropy flow through the network and weights distribution and saturation.
  • Information-Regulated Optimizer: A hybrid ES-AdamW approach where the optimizer (float32) regulates the amount of updates applied to the integer weights, ensuring stable learning.
  • Performance: Achieves ~300k tokens/second with a population of 40,000+ (×ばつ5) on a single 4090 GPU setup, reaching loss rates (~1.45 bits/byte).

Multi-GPU Strategy

The system employs a Synchronous Replicated Model with Sharded Evaluation:

  • Sharded Evaluation: The population is split across GPUs, with each evaluating a subset of perturbations in parallel.
  • Implicit Synchronization: Instead of exchanging gradients (All-Reduce), GPUs receive only fitness scores. Since noise is deterministic, each GPU independently reconstructs the update, keeping replicas synchronized with negligible bandwidth.

image-20251206052311833

Compile & Run (Multi-GPU)

nvcc -O3 -arch=native full_cuda_train_transformer_adam_mgpu.cu -o egg_transformer_mgpu
./egg_transformer_mgpu

Debugging (egg_debug_printer.h)

A lightweight header-only tool for monitoring integer model stability and detecting saturation or mode collapse.

  • Metrics: Tracks Mean, StdDev, bit-level Entropy (0.00-8.00), and Saturation percentages per layer.
  • Usage: Define EGG_DEBUG during compilation to enable ANSI-colored logs for activations and attention scores.

Distributed Implementation (d-eggs)

A robust, distributed training system designed to scale EGGROLL across multiple nodes and GPUs.

For full documentation, architecture details, and usage instructions, please see d-eggs/README.md.

  • Architecture: Coordinator-Worker model with fault tolerance.
  • Key Features: Custom Binary Protocol, CUDA Graphs, Ternary Packing, Muon Optimizer.

Configuration

Configuration

Community & Contributing

We are friendly and welcome all sorts of contributions!

  • Testers: Open issues with a description of your available compute, join existing issues if you can platforms described there.
  • Moderators: To keep all this under control.
  • Creatives: Even if you have nice creative IDEA on README design - you're welcome.

References

About

EGGROLL in C, integer-first training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /