Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Analyze token counts for all documents in a folder using HuggingFace tokenizers. Supports 15+ models, nested folders, zip files, PDF, DOCX, images (OCR), CLI and Streamlit UI.

License

Notifications You must be signed in to change notification settings

sgciv2805/folder-tokenizer

Repository files navigation

Folder Tokenizer

Analyze token counts for all documents in a folder. Supports nested folders, zip files, and various document types.

Features

  • Multiple tokenizers: Use any HuggingFace tokenizer (GPT-2, BERT, LLaMA, etc.)
  • Recursive scanning: Handles nested folders and zip archives
  • Wide format support: Text, code, PDF, DOCX, images (OCR), and more
  • Interactive UI: Streamlit-based interface for easy use
  • Detailed reports: Export results as CSV or JSON

Supported File Types

  • Text: .txt, .md, .json, .csv, .xml, .yaml, .yml
  • Code: .py, .js, .ts, .java, .c, .cpp, .go, .rs, .rb, etc.
  • Documents: .pdf, .docx
  • Images (OCR): .png, .jpg, .jpeg, .tiff, .bmp
  • Archives: .zip (automatically extracted and processed)

Tokenizer Accuracy

Different LLMs use different tokenization mechanisms. Token counts are model-specific:

Model Accuracy Notes
LLaMA 2/3, Mistral, Gemma Accurate Official tokenizer repos from Meta, Mistral, Google
GPT-4o Close Community HuggingFace port of OpenAI's tiktoken vocabulary
Claude Approximate Anthropic hasn't published their tokenizer; counts are estimated

Important notes:

  • LLaMA and Gemma models require a HuggingFace access token (gated models)
  • GPT-3.5 is not in the supported list; use GPT-4o for OpenAI model counts

Installation

# Clone the repository
git clone https://github.com/sgciv2805/folder-tokenizer
cd folder-tokenizer
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install core dependencies (CLI only)
pip install -e .
# Or install with the Streamlit web UI
pip install -e ".[ui]"

OCR Support (Optional)

For image OCR support, install Tesseract:

# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

Web UI

Streamlit (cross-platform):

streamlit run src/folder_tokenizer/app.py

Node.js (macOS, includes native folder picker):

node server.js # opens http://localhost:3000

Command Line

Basic usage:

folder-tokenizer /path/to/folder --model gpt2

Common options:

# Show top 10 files by token count
folder-tokenizer /path/to/folder --verbose
# Quiet mode: output only the integer token count (useful for scripts)
folder-tokenizer /path/to/folder -q
# Export results as JSON or CSV
folder-tokenizer /path/to/folder --output results.json
folder-tokenizer /path/to/folder --csv results.csv
# Example: pipe token count to a variable
TOKEN_COUNT=$(folder-tokenizer /path/to/folder -q)
echo "Total tokens: $TOKEN_COUNT"

Python API

Use folder-tokenizer as a library:

from folder_tokenizer import FolderTokenizer
tokenizer = FolderTokenizer(model_name="gpt2")
result = tokenizer.process_folder("/path/to/folder")
print(f"Total tokens: {result.total_tokens}")
print(f"Total characters: {result.total_chars}")
print(f"Files processed: {len(result.files)}")

When You're Over the Context Window

If a corpus is too large for your context window, use --verbose to identify the heaviest files, then:

  • Exclude large generated files: node_modules/, dist/, .git/
  • Exclude lock files: *.lock, yarn.lock, package-lock.json
  • Use --csv to export results and manually select the most relevant files in a spreadsheet

License

MIT

About

Analyze token counts for all documents in a folder using HuggingFace tokenizers. Supports 15+ models, nested folders, zip files, PDF, DOCX, images (OCR), CLI and Streamlit UI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

AltStyle によって変換されたページ (->オリジナル) /