Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

itsdevcoffee/mojo-audio

Repository files navigation

mojo-audio

High-performance audio DSP and ML inference library for voice conversion — built in Mojo and Python, runs on NVIDIA DGX Spark ARM64 with zero PyTorch CUDA dependency.

Mojo License Performance


What it is

mojo-audio has two layers:

DSP layer (Mojo) — low-level audio processing: FFT, mel spectrogram, resampling, VAD, pitch shifting, iSTFT. 20–40% faster than librosa through SIMD vectorization and multi-core parallelization.

ML inference layer (Python + MAX Graph) — GPU-accelerated neural network inference without PyTorch CUDA. Runs natively on DGX Spark SM_121 ARM64 via MAX Engine:

Model Purpose Backend
AudioEncoder HuBERT / ContentVec content features MAX Graph GPU
PitchExtractor RMVPE pitch (F0) estimation MAX Graph GPU + numpy BiGRU

Together these form the core of a voice conversion pipeline (content extraction → pitch extraction → synthesis) that runs fully on Spark without cloud or PyTorch.


ML Inference

AudioEncoder — HuBERT / ContentVec

Extracts content feature vectors from raw audio. Supports facebook/hubert-base-ls960 and lengyue233/content-vec-best. Automatically uses GPU if available.

from models import AudioEncoder
model = AudioEncoder.from_pretrained("facebook/hubert-base-ls960")
features = model.encode(audio_np) # [1, N] float32 @16kHz → [1, T, 768]

GPU pipeline: CNN feature extractor + positional conv (numpy, avoids MAX conv2d groups bug) + ×ばつ transformer blocks.

PitchExtractor — RMVPE

Extracts F0 (fundamental frequency) per 10ms frame. No PyTorch CUDA needed — runs on DGX Spark ARM64.

from models import PitchExtractor
model = PitchExtractor.from_pretrained() # downloads lj1995/VoiceConversionWebUI/rmvpe.pt
f0_hz = model.extract(audio_np) # [1, N] float32 @16kHz → [T] float32 Hz, 0=unvoiced

Architecture: U-Net MAX Graph (5-level encoder + bottleneck + 5-level decoder) → numpy BiGRU → pitch salience bins → Hz per frame.

Running the models

# Fast tests (no download)
pixi run test-models
pixi run test-pitch-extractor
# Full correctness tests (downloads model weights ~180–360MB)
pixi run test-models-full
pixi run test-pitch-extractor-full
# GPU benchmark
pixi run bench-models

DSP Layer

Mel Spectrogram (Mojo)

Whisper-compatible mel spectrogram preprocessing — 20–40% faster than librosa.

from audio import mel_spectrogram
var mel = mel_spectrogram(audio) // (80, 2998) for 30s @16kHz, ~12ms with -O3

Performance:

30-second audio @16kHz:
librosa (Python): 15ms (1993x realtime)
mojo-audio (-O3): 12ms (2457x realtime) ← 20–40% faster

Optimization journey: 476ms (naive) → 12ms (-O3) = 40x total speedup through iterative FFT, RFFT, twiddle caching, sparse mel filterbank, SIMD float32, radix-4 butterflies, and multi-core parallelization.

Other DSP components

Module What it does
resample.mojo Lanczos resampler (48kHz → 16kHz)
vad.mojo Voice activity detection / silence trimming
pitch.mojo Phase vocoder pitch shifting
wav_io.mojo WAV file I/O
ffi/ C-compatible shared library (libmojo_audio.so)

Running DSP tests

# All Mojo DSP tests
pixi run test
# Individual
pixi run test-pitch
pixi run bench-optimized # mel spectrogram benchmark
pixi run bench-python # librosa baseline comparison

Installation

Requirements: pixi, Mojo 0.26+, Linux x86_64 or aarch64

git clone https://github.com/itsdevcoffee/mojo-audio.git
cd mojo-audio
pixi install

Build FFI shared library (for C/Rust/Python DSP integration):

pixi run build-ffi-optimized # → libmojo_audio.so (Linux) or .dylib (macOS)

See macOS Build Guide for macOS-specific setup.


Project Structure

mojo-audio/
├── src/
│ ├── audio.mojo # Mel spectrogram, FFT, STFT, windowing
│ ├── pitch.mojo # Phase vocoder pitch shifting
│ ├── resample.mojo # Lanczos resampler
│ ├── vad.mojo # Voice activity detection
│ ├── wav_io.mojo # WAV I/O
│ ├── ffi/ # C-compatible shared library exports
│ └── models/ # MAX Graph ML inference (Python)
│ ├── audio_encoder.py # HuBERT / ContentVec via MAX Graph
│ ├── pitch_extractor.py # RMVPE pitch extraction via MAX Graph
│ ├── _rmvpe.py # U-Net graph + numpy BiGRU
│ ├── _rmvpe_weight_loader.py
│ └── _weight_loader.py # HuBERT/ContentVec weight loader
├── tests/
│ ├── test_audio_encoder.py # AudioEncoder tests (pytest)
│ ├── test_pitch_extractor.py # PitchExtractor tests (pytest)
│ ├── test_fft.mojo # FFT correctness
│ ├── test_mel.mojo # Mel spectrogram
│ └── ... # Other Mojo DSP tests
├── experiments/
│ ├── hubert-max/ # HuBERT MAX Graph experiments
│ ├── contentvec-max/ # ContentVec benchmarks
│ └── max-bug-repro/ # MAX Engine bug reproductions
├── docs/
│ ├── plans/ # Implementation plans
│ ├── context/ # Architecture reference
│ └── project/ # Roadmap
└── pixi.toml

Platform Support

Platform DSP ML Inference
Linux x86_64 (NVIDIA RTX) ✅ GPU
Linux aarch64 (DGX Spark SM_121) ✅ GPU
macOS Apple Silicon ✅ CPU
macOS Intel ✅ CPU

Roadmap

The next steps are tracked in docs/project/03-06-2026-roadmap.md:

  • Sprint 2: Full GPU AudioEncoder (remove numpy bridge once MAX conv2d groups bug is fixed), phase-locked phase vocoder
  • Sprint 3: HiFiGAN vocoder in MAX Graph
  • Sprint 4: Full VITS synthesis — end-to-end voice conversion on Spark
  • Sprint 5: Shade integration and demo

Comparison

Feature mojo-audio librosa torchaudio RVC / Applio pyworld
DSP
Mel spectrogram via librosa
FFT / STFT via librosa partial
Resampling via librosa
Voice activity detection via silero
Phase vocoder pitch shift
iSTFT / Griffin-Lim
WAV I/O
C FFI / shared library
ML Inference
HuBERT content features ✅ MAX Graph ✅ PyTorch
ContentVec content features ✅ MAX Graph ✅ PyTorch
RMVPE pitch extraction ✅ MAX Graph ✅ PyTorch
WORLD pitch extraction via pyworld
GPU inference ✅ MAX Engine ✅ CUDA ✅ CUDA
Platform
Linux x86_64
DGX Spark ARM64
macOS Apple Silicon partial
PyTorch CUDA required
Performance
Mel spec vs librosa +20–40% baseline ~parity baseline
GPU inference without CUDA

Known Issues

MAX Engine conv2d groups bug (v26.1): ops.conv2d returns incorrect results when groups > 1 and kernel size is large (K≥128). Filed as modular/modular#6129. Workaround: HuBERT's pos_conv layer runs outside the MAX Graph via numpy.


Citation

@software{mojo_audio_2026,
 author = {Dev Coffee},
 title = {mojo-audio: Audio DSP and ML inference for voice conversion},
 year = {2026},
 url = {https://github.com/itsdevcoffee/mojo-audio}
}

GitHub | Issues | Roadmap

AltStyle によって変換されたページ (->オリジナル) /