Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

RightNow-AI/picolm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

7 Commits

Repository files navigation

C11 Binary Size RAM Zero Dependencies MIT License

PicoLM

Run a 1-billion parameter LLM on a 10ドル board with 256MB RAM.
Pure C. Zero dependencies. One binary. No Python. No cloud.

echo "Explain gravity" | ./picolm model.gguf -n 100 -j 4


The Perfect Match: PicoLM + PicoClaw

PicoLM was built as the local brain for PicoClaw — an ultra-lightweight AI assistant in Go that runs on 10ドル hardware. Together, they form a fully offline AI agent — no cloud, no API keys, no internet, no monthly bills.

Every other LLM provider needs the internet. PicoLM doesn't.

The Hardware The Architecture
9ドル.90 — that's the entire server PicoLM powers the LLM box in PicoClaw's agent loop

Why they're a perfect fit

Cloud Provider (OpenAI, etc.) PicoLM (Local)
Cost Pay per token, forever Free forever
Privacy Your data sent to servers Everything stays on-device
Internet Required for every request Not needed at all
Latency Network round-trip + inference Inference only
Hardware Needs a 599ドル Mac Mini Runs on a 10ドル board
Binary N/A ~80KB single file
RAM N/A 45 MB total

How it works

PicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI — PicoClaw formats them into a chat template, pipes the prompt to picolm via stdin, and reads the response from stdout. When tools are needed, --json grammar mode guarantees valid JSON even from a 1B model.

Telegram / Discord / CLI
 │
 ▼
 ┌──────────┐ stdin: prompt ┌───────────┐
 │ PicoClaw │ ──────────────────► │ picolm │
 │ (Go) │ ◄────────────────── │ (C) │
 └──────────┘ stdout: response │ + model │
 │ └───────────┘
 ▼ 45 MB RAM
 User gets reply No internet

Quick setup

# 1. Build PicoLM
cd picolm && make native # or: make pi (Raspberry Pi)
# 2. Download model (one-time, 638 MB)
make model
# 3. Build PicoClaw
cd ../picoclaw && make deps && make build
# 4. Configure (~/.picoclaw/config.json)
{
 "agents": {
 "defaults": {
 "provider": "picolm",
 "model": "picolm-local"
 }
 },
 "providers": {
 "picolm": {
 "binary": "~/.picolm/bin/picolm",
 "model": "~/.picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
 "max_tokens": 256,
 "threads": 4,
 "template": "chatml"
 }
 }
}
# 5. Chat — fully offline!
picoclaw agent -m "What is photosynthesis?"

Or install everything in one line

curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash

Performance on real hardware

Device Price Generation Speed RAM Used
Pi 5 (4-core) 60ドル ~10 tok/s 45 MB
Pi 4 (4-core) 35ドル ~8 tok/s 45 MB
Pi 3B+ 25ドル ~4 tok/s 45 MB
Pi Zero 2W 15ドル ~2 tok/s 45 MB
LicheeRV Nano 10ドル ~1 tok/s 45 MB

JSON tool calling

PicoClaw automatically activates --json grammar mode when it needs structured output. This guarantees syntactically valid JSON even from a 1B parameter model — essential for reliable tool calling on tiny hardware:

picoclaw agent -m "Search for weather in Tokyo"
# → PicoLM generates: {"tool_calls": [{"function": {"name": "web_search", "arguments": "{\"query\": \"weather Tokyo\"}"}}]}

For the full PicoClaw documentation, see the PicoClaw README.


What is PicoLM?

PicoLM is a minimal, from-scratch LLM inference engine written in ~2,500 lines of C11. It runs TinyLlama 1.1B (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:

  • Raspberry Pi Zero 2W (15,ドル 512MB RAM, ARM Cortex-A53)
  • Sipeed LicheeRV (12,ドル 512MB RAM, RISC-V)
  • Raspberry Pi 3/4/5 (1-8GB RAM, ARM NEON SIMD)
  • Any Linux/Windows/macOS x86-64 machine

The model file (638MB) stays on disk. PicoLM memory-maps it and streams one layer at a time through RAM. Total runtime memory: ~45MB including the FP16 KV cache.

 ┌──────────────────────────────────────────┐
 What goes │ 45 MB Runtime RAM │
 in RAM │ ┌─────────┐ ┌──────────┐ ┌───────────┐ │
 │ │ Buffers │ │ FP16 KV │ │ Tokenizer │ │
 │ │ 1.2 MB │ │ Cache │ │ 4.5 MB │ │
 │ │ │ │ ~40 MB │ │ │ │
 │ └─────────┘ └──────────┘ └───────────┘ │
 └──────────────────────────────────────────┘
 ┌──────────────────────────────────────────┐
 What stays │ 638 MB Model on Disk │
 on disk │ (mmap — OS pages in layers │
 (via mmap) │ as needed, ~1 at a time) │
 └──────────────────────────────────────────┘

Features

Feature Description
GGUF Native Reads GGUF v2/v3 files directly — no conversion needed
K-Quant Support Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32
mmap Layer Streaming Model weights stay on disk; OS pages in one layer at a time
FP16 KV Cache Halves KV cache memory (44MB vs 88MB for 2048 context)
Flash Attention Online softmax — no O(seq_len) attention buffer needed
Pre-computed RoPE cos/sin lookup tables eliminate transcendentals from hot loop
SIMD Acceleration ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected
Fused Dot Products Dequantize + dot-product in one pass — no intermediate buffer
Multi-threaded matmul Parallel matrix-vector multiply across CPU cores
Grammar-Constrained JSON --json flag forces valid JSON output (for tool calling)
KV Cache Persistence --cache saves/loads prompt state — skip prefill on re-runs
BPE Tokenizer Score-based byte-pair encoding, loaded from GGUF metadata
Top-p Sampling Temperature + nucleus sampling with configurable seed
Pipe-friendly Reads prompts from stdin: echo "Hello" | ./picolm model.gguf
Zero Dependencies Only libc, libm, libpthread. No external libraries.
Cross-platform Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V.

Quick Start

One-liner install (Raspberry Pi / Linux)

curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash

This will:

  1. Detect your platform (ARM64, ARMv7, x86-64)
  2. Install build dependencies (gcc, make, curl)
  3. Build PicoLM with optimal SIMD flags for your CPU
  4. Download TinyLlama 1.1B Q4_K_M (638 MB)
  5. Run a quick test
  6. Generate PicoClaw config
  7. Add picolm to your PATH

Build from source

git clone https://github.com/rightnow-ai/picolm.git
cd picolm/picolm
# Auto-detect CPU (enables SSE2/AVX on x86, NEON on ARM)
make native
# Download a model
make model
# Run it
./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
 -p "The meaning of life is" -n 100

Build on Windows (MSVC)

cd picolm
build.bat
picolm.exe model.gguf -p "Hello world" -n 50

Platform-specific builds

make native # x86/ARM auto-detect (recommended for local machine)
make pi # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)
make pi-arm32 # Pi Zero / Pi 1 (32-bit ARM)
make cross-pi # Cross-compile for Pi from x86 (static binary)
make riscv # RISC-V (Sipeed LicheeRV, etc.)
make static # Static binary for single-file deployment
make debug # Debug build with symbols, no optimization

Usage

PicoLM — ultra-lightweight LLM inference engine
Usage: picolm <model.gguf> [options]
Generation options:
 -p <prompt> Input prompt (or pipe via stdin)
 -n <int> Max tokens to generate (default: 256)
 -t <float> Temperature (default: 0.8, 0=greedy)
 -k <float> Top-p / nucleus sampling (default: 0.9)
 -s <int> RNG seed (default: 42)
 -c <int> Context length override
 -j <int> Number of threads (default: 4)
Advanced options:
 --json Grammar-constrained JSON output mode
 --cache <file> KV cache file (saves/loads prompt state)

Examples

Basic generation:

./picolm model.gguf -p "Once upon a time" -n 200

Greedy decoding (deterministic, temperature=0):

./picolm model.gguf -p "The capital of France is" -n 20 -t 0
# Output: Paris. It is the largest city in France and...

Chat with TinyLlama (ChatML format):

./picolm model.gguf -n 200 -t 0.7 -p "<|user|>
What is photosynthesis?</s>
<|assistant|>
"

Force JSON output (for tool calling / structured data):

./picolm model.gguf --json -t 0.3 -n 100 -p "<|user|>
Return the current time as JSON.</s>
<|assistant|>
"
# Output: {"time": "12:00 PM"}

Pipe from stdin:

echo "Explain quantum computing in one sentence" | ./picolm model.gguf -n 50

KV cache — skip repeated prefill:

# First run: processes prompt + saves cache
./picolm model.gguf --cache prompt.kvc -p "Long system prompt here..." -n 50
# Second run: loads cache, skips prompt prefill (74% faster)
./picolm model.gguf --cache prompt.kvc -p "Long system prompt here..." -n 50
# Output: "Skipping 25 cached prompt tokens"

Multi-threaded on a Pi 4 (4 cores):

./picolm model.gguf -p "Hello" -n 100 -j 4

Performance

Measured on TinyLlama 1.1B Q4_K_M (638 MB model):

Metric x86-64 (8 threads) Pi 4 (4 cores, NEON) Pi Zero 2W
Prefill ~11 tok/s ~6 tok/s ~1.5 tok/s
Generation ~13 tok/s ~8 tok/s* ~2 tok/s*
Runtime RAM 45 MB 45 MB 45 MB
First token ~2.3s ~4s ~16s
Binary size ~80 KB ~70 KB ~65 KB

*Estimated with NEON SIMD enabled. Actual numbers depend on SD card speed and thermal throttling.

What makes it fast

 Raw C inference ████████████░░░░░░░░ 13.5 tok/s (baseline: 1.6)
 + Fused dot products ████████████████░░░░ (eliminate dequant buffer)
 + Multi-threaded matmul █████████████████░░░ (4-8 cores in parallel)
 + FP16 KV cache █████████████████░░░ (halve memory bandwidth)
 + Pre-computed RoPE ██████████████████░░ (no sin/cos in hot loop)
 + Flash attention ██████████████████░░ (no O(n) attention alloc)
 + NEON/SSE2 SIMD ███████████████████░ (4-wide vector ops)
 + KV cache persistence ████████████████████ (skip prefill entirely)

Architecture

 ┌─────────────────────────────────┐
 │ picolm.c │
 │ CLI + Generation Loop │
 └──────┬──────────────┬───────────┘
 │ │
 ┌────────────┘ └────────────┐
 │ │
 ┌────────┴────────┐ ┌──────────┴──────────┐
 │ model.h/c │ │ sampler.h/c │
 │ GGUF Parser │ │ Temperature + │
 │ mmap Layer │ │ Top-p Sampling │
 │ Streaming │ └──────────┬──────────┘
 │ Forward Pass │ │
 │ KV Cache I/O │ ┌──────────┴──────────┐
 └───┬────────┬────┘ │ grammar.h/c │
 │ │ │ JSON Constraint │
 ┌────────┘ └───────┐ │ Logit Masking │
 │ │ └─────────────────────┘
┌─────┴──────┐ ┌───────┴────────┐
│ tensor.h/c │ │ tokenizer.h/c │
│ matmul │ │ BPE Encode │
│ rmsnorm │ │ Decode │
│ softmax │ │ Vocab Lookup │
│ rope │ └────────────────┘
│ silu │
│ threading │
└─────┬──────┘
 │
┌─────┴──────┐
│ quant.h/c │
│ Q4_K, Q6_K │
│ Q3_K, Q2_K │
│ FP16, F32 │
│ NEON + SSE │
│ Fused Dots │
└────────────┘

The LLaMA Forward Pass (what happens for each token)

×ばつ22 layers │ RMSNorm │─────────────────────────────────────────┐ │ │ │ │ Q = xb @ Wq │ Matrix-vector multiply (quantized) │ │ K = xb @ Wk │ Store K,V in FP16 KV cache │ │ V = xb @ Wv │ │ │ │ │ │ RoPE(Q, K) │ Rotary position encoding (table lookup)│ │ │ │ │ Attention │ Flash attention with online softmax │ │ (GQA 32→4) │ Grouped-query: 32 Q heads, 4 KV heads │ │ │ │ │ x += Out@Wo │ Output projection + residual │ │ │ │ │ RMSNorm │ │ │ │ │ │ SwiGLU FFN │ gate=SiLU(xb@Wg), up=xb@Wu │ │ │ x += (gate*up) @ Wd │ └───────┬───────┘─────────────────────────────────────────┘ │ ▼ ┌───────────────┐ │ Final RMSNorm │ │ x @ W_output │─→ logits[32000] └───────┬───────┘ │ ▼ ┌───────────────┐ │ Grammar Mask │ (if --json: force valid JSON structure) │ Sample Token │ temperature → softmax → top-p → pick └───────────────┘">
Input Token
 │
 ▼
┌───────────────┐
│ Embedding │ Dequantize row from token_embd → x[2048]
│ Lookup │
└───────┬───────┘
 │
 ▼
┌───────────────┐ ×ばつ22 layers
│ RMSNorm │─────────────────────────────────────────┐
│ │ │
│ Q = xb @ Wq │ Matrix-vector multiply (quantized) │
│ K = xb @ Wk │ Store K,V in FP16 KV cache │
│ V = xb @ Wv │ │
│ │ │
│ RoPE(Q, K) │ Rotary position encoding (table lookup)│
│ │ │
│ Attention │ Flash attention with online softmax │
│ (GQA 32→4) │ Grouped-query: 32 Q heads, 4 KV heads │
│ │ │
│ x += Out@Wo │ Output projection + residual │
│ │ │
│ RMSNorm │ │
│ │ │
│ SwiGLU FFN │ gate=SiLU(xb@Wg), up=xb@Wu │
│ │ x += (gate*up) @ Wd │
└───────┬───────┘─────────────────────────────────────────┘
 │
 ▼
┌───────────────┐
│ Final RMSNorm │
│ x @ W_output │─→ logits[32000]
└───────┬───────┘
 │
 ▼
┌───────────────┐
│ Grammar Mask │ (if --json: force valid JSON structure)
│ Sample Token │ temperature → softmax → top-p → pick
└───────────────┘

Memory Budget

For TinyLlama 1.1B Q4_K_M with 2048 context length:

Component Size Notes
FP16 KV cache ~40 MB 22 layers x 2 x 2048 x 256 x 2 bytes
Tokenizer ~4.5 MB 32K vocab strings + scores + sorted index
Activation buffers ~0.14 MB x, xb, xb2, q, hb, hb2
Logits buffer ~0.12 MB 32000 x 4 bytes
Dequant scratch ~0.02 MB Max(n_embd, n_ffn) floats
Norm weights (pre-dequant) ~0.35 MB 45 norm vectors x 2048 x 4 bytes
RoPE tables ~0.03 MB cos + sin x 2048 x 32 entries
Total runtime ~45 MB
Model file (on disk) 638 MB Memory-mapped, ~1 layer in RAM at a time

With 512 context (for constrained devices):

Component Size
FP16 KV cache ~10 MB
Everything else ~5 MB
Total ~15 MB

Optimizations Deep-Dive

PicoLM implements 9 optimizations that brought generation speed from 1.6 tok/s to 13.5 tok/s on x86, with even larger gains expected on ARM with NEON:

1. ARM NEON SIMD

4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with vmovl_u8vmovl_u16vcvtq_f32_u32, and RoPE with interleaved vld2q_f32 / vst2q_f32.

2. x86 SSE2 SIMD

Auto-detected on Intel/AMD. 4-wide __m128 operations for dot products, RMSNorm, and vector operations.

3. FP16 KV Cache

Key and value vectors stored as 16-bit floats instead of 32-bit. Halves KV cache memory from ~88MB to ~44MB. Conversion uses software fp32_to_fp16() / fp16_to_fp32() — no hardware FP16 support required.

4. Pre-computed RoPE Tables

Sine and cosine values for all positions computed once at model load. The forward pass does a table lookup instead of calling sinf() / cosf() / powf() 64 times per token.

5. Flash Attention (Online Softmax)

Single-pass attention with running maximum rescaling. Eliminates the O(seq_len) attention score buffer — critical for long contexts on memory-constrained devices.

6. Fused Dequantize + Dot Product

vec_dot_q4_K_f32() dequantizes and accumulates in one pass. No intermediate float buffer for the weight row. Reduces memory traffic by ~50% for matmul.

7. Multi-threaded Matrix Multiply

matmul() distributes output rows across threads using pthreads. Each thread processes its chunk independently with fused dot products. Scales linearly up to ~8 cores.

8. Grammar-Constrained JSON

The --json mode pre-analyzes every token in the vocabulary at load time (brace delta, bracket delta, quote parity). During generation, it masks logits to guarantee syntactically valid JSON — essential for tool-calling with small models.

9. KV Cache Persistence

--cache file.kvc saves the FP16 KV cache state after prompt processing. On the next run with the same prompt, it loads the cache and skips prefill entirely. 74% latency reduction for repeated system prompts.


Supported Models

PicoLM supports any LLaMA-architecture model in GGUF format:

Model Parameters GGUF Size (Q4_K_M) RAM Needed
TinyLlama 1.1B 1.1B 638 MB ~45 MB
Llama 2 7B 7B 4.1 GB ~200 MB
Phi-2 2.7B 1.6 GB ~90 MB

Recommended for embedded: TinyLlama 1.1B Q4_K_M — fits comfortably on devices with 256MB+ RAM.

Supported quantization formats

Q2_K Q3_K Q4_K Q4_0 Q5_K Q6_K Q8_0 F16 F32


File Structure

PicoLM/
├── README.md ← you are here
├── BLOG.md ← technical deep-dive blog post
├── install.sh ← one-liner Pi installer
│
├── picolm/ ← the inference engine (pure C)
│ ├── picolm.c ← CLI entry point, generation loop (273 lines)
│ ├── model.h/c ← GGUF parser, mmap, forward pass (146 + 833 lines)
│ ├── tensor.h/c ← matmul, rmsnorm, softmax, rope (44 + 298 lines)
│ ├── quant.h/c ← dequantization, SIMD kernels (140 + 534 lines)
│ ├── tokenizer.h/c ← BPE tokenizer (32 + ~200 lines)
│ ├── sampler.h/c ← temperature + top-p sampling (19 + ~100 lines)
│ ├── grammar.h/c ← JSON grammar constraints (64 + 175 lines)
│ ├── Makefile ← build targets for all platforms
│ └── build.bat ← Windows MSVC build script
│
└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf ← model file (638 MB, not in git)

Total C source: ~2,500 lines. That's the entire inference engine — GGUF parsing, mmap, dequantization, matrix math, attention, tokenization, sampling, and grammar constraints.


How It Works

The mmap trick

Traditional inference engines load the entire model into RAM. PicoLM doesn't. Instead:

  1. The model file is memory-mapped (mmap on Linux/macOS, MapViewOfFile on Windows)
  2. Weight pointers point directly into the mapped file — no copying
  3. During the forward pass, each layer's weights are accessed sequentially
  4. The OS automatically pages in the needed weights and evicts old ones
  5. madvise(MADV_SEQUENTIAL) hints the access pattern to the kernel

Result: A 638MB model runs on a device with 256MB RAM. Only ~30MB of the model is in physical memory at any time.

Quantization

Weights are stored in 4-bit quantized format (Q4_K_M). For TinyLlama:

  • Original: 1.1B parameters x 4 bytes = 4.4 GB
  • Q4_K: 1.1B parameters x ~0.56 bytes = 638 MB
  • Quality loss: Minimal — Q4_K preserves 6-bit scales per 32-weight sub-block

Grouped-Query Attention (GQA)

TinyLlama uses 32 query heads but only 4 key/value heads. Each KV head is shared by 8 query heads. This reduces KV cache size by 8x compared to full multi-head attention.


Building & Testing

Prerequisites

Platform Requirements
Linux/Pi gcc, make (install via apt install build-essential)
macOS Xcode Command Line Tools (xcode-select --install)
Windows Visual Studio Build Tools (cl.exe)

Verify your build

# Build
make native
# Test with greedy decoding (deterministic output)
./picolm model.gguf -p "The capital of France is" -n 20 -t 0
# Expected: "Paris. It is the largest city in France..."
# Test JSON mode
./picolm model.gguf --json -p "Return JSON with name and age" -n 50 -t 0.3
# Expected: valid JSON like {"name": "...", "age": ...}
# Test KV cache
./picolm model.gguf --cache test.kvc -p "Hello" -n 10 -t 0
./picolm model.gguf --cache test.kvc -p "Hello" -n 10 -t 0
# Second run should say "Skipping N cached prompt tokens"

Memory verification

PicoLM prints memory stats to stderr:

Memory: 1.17 MB runtime state (FP16 KV cache separate)

Total = runtime state + FP16 KV cache. For TinyLlama with 2048 context: ~45 MB.


FAQ

Q: Can this run Llama 2 7B? A: Yes, if you have enough RAM for the KV cache (~1.4 GB for 7B with 4096 context). The model file stays on disk via mmap. On a Pi 4 with 4GB RAM, it works but is slow (~1-2 tok/s).

Q: Why not use llama.cpp? A: llama.cpp is excellent but requires ~200MB+ for the runtime on small models, has complex build dependencies, and targets desktop/server use cases. PicoLM is purpose-built for embedded: 45MB RAM, 80KB binary, zero dependencies.

Q: Is the output quality good? A: TinyLlama 1.1B is a small model — it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a 10ドル board with no internet. For structured output, the --json grammar mode guarantees valid JSON regardless of model quality.

Q: What about GPU acceleration? A: PicoLM is CPU-only by design. The target hardware (10ドル-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2) provides meaningful speedup.

Q: Can I use a different model? A: Any LLaMA-architecture GGUF model works. Download from HuggingFace and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality/size balance) or Q2_K (smallest, lower quality).


Roadmap

  • AVX2/AVX-512 kernels for x86 (2-4x generation speed on modern CPUs)
  • Speculative decoding with a draft model
  • Context sliding window (infinite generation beyond max_seq_len)
  • Weight pruning for further memory reduction
  • Continuous batching for server mode
  • Mistral / Phi architecture support

Technical Blog

For a detailed writeup of the optimization journey (with code snippets and war stories), see BLOG.md.


License

MIT License. See LICENSE for details.


PicoLM — because intelligence shouldn't require a data center.

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /