Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, a...

Rust 6,060 549 Updated Feb 17, 2026

black-forest-labs / flux

Official inference repo for FLUX.1 models

Python 25,209 1,854 Updated Jul 31, 2025

meta-llama / llama-models

Utilities intended for use with Llama models.

Python 7,482 1,332 Updated Feb 11, 2026

cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Python 1,986 135 Updated Nov 7, 2025

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Python 4,363 373 Updated Oct 19, 2025

naklecha / llama3-from-scratch

llama3 implementation one matrix multiplication at a time

Jupyter Notebook 15,234 1,286 Updated May 23, 2024

meta-llama / llama3

The official Meta Llama 3 GitHub site

Python 29,250 3,512 Updated Jan 26, 2025

facebookresearch / lightplane

Lightplane implements a highly memory-efficient differentiable radiance field renderer, and a module for unprojecting features from images to 3D grids.

Python 285 9 Updated Aug 6, 2024

pytorch / torchtitan

A PyTorch native platform for training generative AI models

Python 5,076 707 Updated Feb 17, 2026

rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Python 320 26 Updated Dec 9, 2023

hammoudhasan / SynthCLIP

Code base of SynthCLIP: CLIP training with purely synthetic text-image pairs from LLMs and TTIs.

Python 102 2 Updated Mar 23, 2025

facebookresearch / MetaCLIP

NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024

Python 1,813 75 Updated Nov 27, 2025

karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Python 10,318 1,004 Updated Jul 1, 2024

gautierdag / bpeasy

Fast bare-bones BPE for modern tokenizer training

Python 176 6 Updated Jun 23, 2025

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Python 17,299 1,379 Updated Feb 8, 2026

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 2,893 245 Updated Feb 17, 2026

google-research-datasets / wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

1,099 45 Updated Sep 27, 2024