Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

lukecarr/litmus

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Repository files navigation

Litmus

CI Release

Specification testing for structured LLM outputs.

Litmus lets you define test cases with input strings and expected JSON outputs, run them against LLM models through providers like OpenRouter, Cloudflare AI Gateway, OpenAI, Google Gemini, xAI, and Anthropic, and compare accuracy, latency, and throughput across models.

Example output

$ litmus run --tests example/tests.json --schema example/schema.json --prompt-file example/prompt.txt --model openai/gpt-4.1-nano --model mistralai/mistral-nemo 
Running 2 tests against openai/gpt-4.1-nano...
Running 2 tests against mistralai/mistral-nemo...
Litmus Test Report
──────────────────────────────────────────────────
Timestamp: 2025εΉ΄12月27ζ—₯T16:19:30Z
Test File: example/tests.json
Schema: example/schema.json
Model: openai/gpt-4.1-nano
──────────────────────────────────────────────────
Provider: OpenAI
Results: 2 passed / 0 failed (100.0% accuracy)
Tokens: 148 in / 34 out
Latency: P50=363ms P95=454ms P99=462ms
Duration: 2.11s (16.1 tok/s)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TEST β”‚ STATUS β”‚ LATENCY β”‚ TOKENS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extract person info β”‚ βœ“ PASS β”‚ 263ms β”‚ 74/17 β”‚
β”‚ Extract another person β”‚ βœ“ PASS β”‚ 464ms β”‚ 74/17 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Model: mistralai/mistral-nemo
──────────────────────────────────────────────────
Provider: Mistral
Results: 2 passed / 0 failed (100.0% accuracy)
Tokens: 64 in / 56 out
Latency: P50=254ms P95=262ms P99=263ms
Duration: 763ms (73.4 tok/s)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TEST β”‚ STATUS β”‚ LATENCY β”‚ TOKENS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extract person info β”‚ βœ“ PASS β”‚ 246ms β”‚ 32/28 β”‚
β”‚ Extract another person β”‚ βœ“ PASS β”‚ 263ms β”‚ 32/28 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Model Comparison
──────────────────────────────────────────────────
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MODEL β”‚ PROVIDER β”‚ ACCURACY β”‚ P 50 LATENCY β”‚ TOK / S β”‚ TOKENS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ openai/gpt-4.1-nano β”‚ OpenAI β”‚ 100.0% β”‚ 363ms β”‚ 16.1 β”‚ 182 β”‚
β”‚ mistralai/mistral-nemo β”‚ Mistral β”‚ 100.0% β”‚ 254ms β”‚ 73.4 β”‚ 120 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

Download a pre-built binary from the latest release, or install with Go:

go install go.carr.sh/litmus@latest

Or compile from source:

git clone https://github.com/lukecarr/litmus.git
cd litmus
go build -o litmus .

Quick Start

  1. Set your OpenRouter API key:
export OPENROUTER_API_KEY="your-api-key"
  1. Create a test file (tests.json):
[
 {
 "name": "Extract person info",
 "input": "John Smith is 30 years old and works at Acme Corp",
 "expected": {
 "name": "John Smith",
 "age": 30,
 "company": "Acme Corp"
 }
 },
 {
 "name": "Extract another person",
 "input": "Jane Doe, age 25, is employed by TechStart Inc",
 "expected": {
 "name": "Jane Doe",
 "age": 25,
 "company": "TechStart Inc"
 }
 }
]
  1. Create a JSON schema (schema.json):
{
 "type": "object",
 "properties": {
 "name": { "type": "string" },
 "age": { "type": "integer" },
 "company": { "type": "string" }
 },
 "required": ["name", "age", "company"],
 "additionalProperties": false
}
  1. Create a prompt file (prompt.txt):
Extract the person's name, age, and company from the given text.
  1. Run tests:
litmus run --tests tests.json --schema schema.json --prompt-file prompt.txt --model openai/gpt-4.1-nano

GitHub Action

Run Litmus in a GitHub Actions workflow. The action annotates failing tests inline on the test file and writes a results table to the job summary:

- uses: lukecarr/litmus@v0.3.0
 with:
 tests: example/tests.json
 schema: example/schema.json
 prompt-file: example/prompt.txt
 model: openai/gpt-4.1-nano
 api-key: ${{ secrets.OPENROUTER_API_KEY }}

Each input maps to a litmus run flag, and output defaults to github. The tag you pin is the Litmus version that runs (@v0.3.0 runs Litmus v0.3.0; a branch or SHA runs the latest release). The step exits non-zero when any test fails. See the GitHub Actions guide for all inputs and Cloudflare setup.

Usage

Basic Command

litmus run --tests <test-file> --schema <schema-file> --prompt <prompt> --model <model>

Providers

Litmus sends requests through a provider selected with --provider.

OpenRouter

The default provider. Set your key with --api-key or the OPENROUTER_API_KEY environment variable:

export OPENROUTER_API_KEY="your-api-key"
litmus run --tests tests.json --schema schema.json --prompt-file prompt.txt --model openai/gpt-4.1-nano

OpenAI

Call the OpenAI API directly with --provider openai. Set your key with --api-key or the OPENAI_API_KEY environment variable. Direct providers use the bare model name, without a provider/ prefix:

export OPENAI_API_KEY="your-api-key"
litmus run --provider openai --tests tests.json --schema schema.json --prompt-file prompt.txt --model gpt-4o

Google Gemini

Call the Gemini API directly with --provider google (alias gemini), through Google's OpenAI-compatible endpoint. Set your key with --api-key, GEMINI_API_KEY, or GOOGLE_API_KEY:

export GEMINI_API_KEY="your-api-key"
litmus run --provider google --tests tests.json --schema schema.json --prompt-file prompt.txt --model gemini-2.5-flash

xAI (Grok)

Call the xAI API directly with --provider xai (alias grok). Set your key with --api-key or the XAI_API_KEY environment variable:

export XAI_API_KEY="your-api-key"
litmus run --provider xai --tests tests.json --schema schema.json --prompt-file prompt.txt --model grok-4

Anthropic (Claude)

Call the Anthropic API directly with --provider anthropic (alias claude). Set your key with --api-key or the ANTHROPIC_API_KEY environment variable:

export ANTHROPIC_API_KEY="your-api-key"
litmus run --provider anthropic --tests tests.json --schema schema.json --prompt-file prompt.txt --model claude-opus-4-8

Anthropic uses its native Messages API rather than an OpenAI-compatible endpoint. Litmus enforces your schema by forcing a tool call whose input is the structured response.

Cloudflare AI Gateway

Pass --provider cloudflare and point Litmus at your gateway with --cf-account-id and --cf-gateway. Models use the same provider/model names as OpenRouter.

There are two ways to supply credentials, and you can combine them:

  • A downstream provider key via --api-key (or CLOUDFLARE_API_KEY). Litmus sends it as the Authorization header. This is the key for the model's own provider, for example your OpenAI key.
  • A gateway token via --cf-token (or CF_AIG_TOKEN). Litmus sends it as the cf-aig-authorization header. It is required when the gateway has authentication enabled, and it is sufficient on its own when the gateway stores provider keys for you.
export CLOUDFLARE_ACCOUNT_ID="your-account-id"
export CLOUDFLARE_GATEWAY_ID="your-gateway"
litmus run \
 --provider cloudflare \
 --api-key "$OPENAI_API_KEY" \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano

A single --api-key is sent as the upstream Authorization header on every request, so it only works when all the models you compare share one upstream provider. To compare models from different upstream providers in one run, store the provider keys in the gateway and authenticate with --cf-token alone.

Flags

Flag Short Description
--tests -t Path to test cases JSON file (required)
--schema -s Path to JSON schema file (required)
--prompt -p System prompt for the LLM
--prompt-file Path to file containing system prompt
--model -m Model to test against (required, can be repeated)
--parallel -P Number of parallel requests per model (default: 1)
--output -o Output format: terminal, json, html, or github (default: terminal)
--provider LLM provider: openrouter (default), cloudflare, openai, google, xai, or anthropic
--api-key Provider API key. OpenRouter: OPENROUTER_API_KEY. Cloudflare: the downstream provider key, or CLOUDFLARE_API_KEY. OpenAI: OPENAI_API_KEY. Google: GEMINI_API_KEY. xAI: XAI_API_KEY. Anthropic: ANTHROPIC_API_KEY
--cf-account-id Cloudflare account ID (or CLOUDFLARE_ACCOUNT_ID), used with --provider cloudflare
--cf-gateway Cloudflare AI Gateway ID (or CLOUDFLARE_GATEWAY_ID), used with --provider cloudflare
--cf-token Cloudflare AI Gateway token for authenticated gateways (or CF_AIG_TOKEN)

Examples

Single model:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano

Multiple models for comparison:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt "Extract entities from the text" \
 --model openai/gpt-4.1-nano \
 --model mistralai/mistral-nemo

Parallel execution:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano \
 --parallel 5

JSON output for CI/CD:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano \
 --output json > results.json

HTML report:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano \
 --output html > report.html

Test File Format

The test file is a JSON array of test cases:

[
 {
 "name": "Test name (for display)",
 "input": "The input text to send to the LLM",
 "expected": {
 "field1": "expected value",
 "field2": 123
 }
 }
]
  • name: A human-readable name for the test case
  • input: The user message sent to the LLM
  • expected: The expected JSON output (must match the schema)

JSON Schema

The schema file should be a valid JSON Schema. It is passed to the provider's response_format parameter to enforce structured output from the LLM.

Example schema:

{
 "type": "object",
 "properties": {
 "sentiment": {
 "type": "string",
 "enum": ["positive", "negative", "neutral"]
 },
 "confidence": {
 "type": "number",
 "minimum": 0,
 "maximum": 1
 }
 },
 "required": ["sentiment", "confidence"],
 "additionalProperties": false
}

Output

Litmus supports four output formats via the --output flag:

  • terminal (default): Colored, formatted output for the terminal
  • json: Machine-readable JSON for CI/CD pipelines
  • html: Self-contained HTML report for sharing and archiving
  • github: GitHub Actions workflow commands with inline annotations and a job summary

Terminal Output

The terminal output includes:

  • Provider used for each model
  • Summary metrics (pass/fail counts, accuracy %)
  • Token usage and throughput (tokens/second)
  • Latency percentiles (P50, P95, P99)
  • Detailed test results table
  • Field-level diff for failures
  • Model comparison table (when testing multiple models)

JSON Output

Use --output json to get machine-readable output:

{
 "timestamp": "2025εΉ΄12月27ζ—₯T16:19:30Z",
 "prompt": "Extract entities...",
 "schema_file": "schema.json",
 "test_file": "tests.json",
 "models": [
 {
 "model": "openai/gpt-4.1-nano",
 "results": [...],
 "metrics": {
 "total_tests": 10,
 "passed": 9,
 "failed": 1,
 "accuracy": 90.0,
 "latency_p50_ms": 450,
 "throughput_tps": 25.5
 }
 }
 ]
}

HTML Output

Use --output html to generate a self-contained HTML report:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano \
 --output html > report.html

The HTML report includes all the same information as the terminal output, formatted for viewing in a browser. It's self-contained with no external dependencies, making it easy to share or archive.

HTML Report Screenshot

GitHub Actions Output

Use --output github inside a GitHub Actions workflow:

litmus run \
 --tests tests.json \
 --schema schema.json \
 --prompt-file prompt.txt \
 --model openai/gpt-4.1-nano \
 --output github

Each failed or errored test becomes an inline annotation on the test file, at the line where the test is defined, and Litmus appends a results table to the run's job summary. litmus run exits non-zero when any test fails, so the step fails on a regression. See Output Formats for details.

Exit Codes

  • 0: All tests passed
  • 1: One or more tests failed or errored

Supported Models

With OpenRouter, Litmus works with any model in the OpenRouter catalog. With Cloudflare AI Gateway, it works with any model your gateway routes to, named in the same provider/model form. See the Cloudflare AI Gateway docs for the providers it supports.

License

Litmus is licensed under the MIT License.

About

Specification testing for structured LLM responses.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /