-
Notifications
You must be signed in to change notification settings - Fork 2
Token Optimization
OCC is designed to minimize LLM token usage. This guide covers every technique available.
Traditional approaches stuff everything into one giant prompt. OCC decomposes work into focused steps, each receiving only the context it needs.
Traditional: 1 prompt ×ばつ 40K tokens = 40K tokens
OCC: 6 prompts ×ばつ ~2.5K tokens = ~15K tokens
The savings come from isolation — each step only pays for the tokens it actually needs.
Each step gets its own prompt with only its dependencies injected. No conversation history accumulation.
# Step A outputs 3000 tokens of research # Step B outputs 2000 tokens of analysis # Step C only needs Step B's output — it never sees Step A's 3000 tokens steps: - id: a prompt: "Research {input.topic}" output_var: research - id: b depends_on: [a] prompt: "Analyze: {research}" output_var: analysis - id: c depends_on: [b] # Only depends on B, not A prompt: "Summarize: {analysis}" # Receives ~2000 tokens, not 5000 output_var: summary
Transform steps manipulate data without LLM calls:
# Research step returns 5000 tokens of JSON - id: research prompt: "Research and return JSON with key_findings, sources, raw_data" output_var: raw_research # Extract only key_findings — costs 0 tokens - id: extract type: transform operation: json_extract input_var: raw_research json_path: "key_findings" prompt: "extract" output_var: findings # Now ~500 tokens # Next step receives 500 tokens instead of 5000 - id: report depends_on: [extract] prompt: "Write report from: {findings}" output_var: report
Available zero-token operations: json_extract, regex_match, template, split, merge, truncate, replace, filter, map, join, to_json, from_json
Use cheap models for simple tasks, expensive models for critical ones:
steps: # Simple classification — use Haiku (0ドル.25/M input) - id: classify model: claude-haiku-4-5 prompt: "Classify this text: {input.text}. Return: positive, negative, or neutral." output_var: sentiment # Complex analysis — use Sonnet (3ドル/M input) - id: analyze model: claude-sonnet-4-6 depends_on: [classify] prompt: "Deep analysis of {input.text} (classified as {sentiment})" output_var: analysis # Critical synthesis — use Opus (15ドル/M input) - id: synthesize model: claude-opus-4-6 depends_on: [analyze] prompt: "Produce final executive report from: {analysis}" output_var: report
Cost breakdown:
| Step | Model | Input tokens | Cost |
|---|---|---|---|
| classify | Haiku | ~200 | 0ドル.00005 |
| analyze | Sonnet | ~2000 | 0ドル.006 |
| synthesize | Opus | ~3000 | 0ドル.045 |
| Total | ~5200 | 0ドル.051 |
vs. running everything on Opus: ~8000 tokens ×ばつ 15ドル/M = 0ドル.12 (2.4x more expensive)
Identical prompts skip the LLM entirely:
- id: expensive_research cache: enabled: true ttl_minutes: 120 # Cache for 2 hours prompt: "Research {input.topic}" output_var: research
Cache key = hash of (step ID + resolved prompt + model). If the same chain runs with the same inputs within the TTL, the cached result is returned instantly (0 tokens, 0 cost, ~1ms).
Skip expensive steps when they're not needed:
- id: quick_check model: claude-haiku-4-5 prompt: "Does this code have security issues? Answer YES or NO: {input.code}" output_var: has_issues - id: deep_audit depends_on: [quick_check] condition: '{has_issues} contains "YES"' # Only runs if issues found model: claude-opus-4-6 prompt: "Full security audit of: {input.code}" output_var: audit
If has_issues is "NO", the deep_audit step is skipped — saving potentially thousands of tokens.
Stop the chain when the answer is found:
- id: check_cache prompt: "Is this answer in the knowledge base? {input.question}" output_var: cached_answer early_exit_if: '{cached_answer} != "NOT_FOUND"' - id: research depends_on: [check_cache] prompt: "Research: {input.question}" # Never runs if cache hit output_var: research - id: synthesize depends_on: [research] prompt: "Answer from research: {research}" # Never runs if cache hit output_var: answer
Control how dependency outputs are compressed:
- id: final_step depends_on: [big_research, small_facts] context_strategy: big_research: "truncate:2000" # Trim to 2000 chars small_facts: "full" # Keep as-is prompt: | Research (condensed): {big_research} Key facts: {small_facts} output_var: result
Options: full (default), summarize (LLM compression), truncate:N (hard cut at N chars)
When combining parallel outputs, choose the cheapest strategy:
- id: combine type: merge inputs: [research_a, research_b, research_c] strategy: pick_best # LLM picks 1 of 3 — cheaper than summarizing all 3 prompt: "Pick the most relevant research." output_var: best_research
| Strategy | LLM cost | Output size |
|---|---|---|
concatenate |
0 tokens | Sum of all inputs |
json_array |
0 tokens | Sum of all inputs |
pick_best |
~input size | 1 input's worth |
llm_summarize |
~input size | Compressed |
Bad — Asking Claude to search inside the prompt (costs tokens for the tool call overhead):
prompt: "Search the web for {input.topic} and then analyze the results"
Good — Pre-tool does the search, data is ready:
pre_tools: - type: web_search query: "{input.topic} latest research" inject_as: data prompt: "Analyze this data: {data}"
The pre-tool approach is cleaner AND cheaper because the LLM doesn't waste tokens on tool-calling overhead.
| Chain type | Steps | Avg tokens/step | Total | Est. cost (Sonnet) |
|---|---|---|---|---|
| Simple (3 steps) | 3 | 1500 | 4.5K | 0ドル.014 |
| Medium (6 steps) | 6 | 2500 | 15K | 0ドル.045 |
| Complex (12 steps) | 12 | 3000 | 36K | 0ドル.108 |
| Pipeline (3 chains ×ばつ 5 steps) | 15 | 2000 | 30K | 0ドル.090 |
Caching can reduce repeat executions to 0ドル.
- Chain Format — Context strategy and caching configuration
- Step Types — Transform steps (zero-token operations)
- Architecture — How step isolation works internally