sgciv2805/folder-tokenizer

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
public		public
src/folder_tokenizer		src/folder_tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.js		server.js

Repository files navigation

Folder Tokenizer

Analyze token counts for all documents in a folder. Supports nested folders, zip files, and various document types.

Features

Multiple tokenizers: Use any HuggingFace tokenizer (GPT-2, BERT, LLaMA, etc.)
Recursive scanning: Handles nested folders and zip archives
Wide format support: Text, code, PDF, DOCX, images (OCR), and more
Interactive UI: Streamlit-based interface for easy use
Detailed reports: Export results as CSV or JSON

Supported File Types

Text: .txt, .md, .json, .csv, .xml, .yaml, .yml
Code: .py, .js, .ts, .java, .c, .cpp, .go, .rs, .rb, etc.
Documents: .pdf, .docx
Images (OCR): .png, .jpg, .jpeg, .tiff, .bmp
Archives: .zip (automatically extracted and processed)

Tokenizer Accuracy

Different LLMs use different tokenization mechanisms. Token counts are model-specific:

Model	Accuracy	Notes
LLaMA 2/3, Mistral, Gemma	Accurate	Official tokenizer repos from Meta, Mistral, Google
GPT-4o	Close	Community HuggingFace port of OpenAI's tiktoken vocabulary
Claude	Approximate	Anthropic hasn't published their tokenizer; counts are estimated

Important notes:

LLaMA and Gemma models require a HuggingFace access token (gated models)
GPT-3.5 is not in the supported list; use GPT-4o for OpenAI model counts

Installation

# Clone the repository
git clone https://github.com/sgciv2805/folder-tokenizer
cd folder-tokenizer
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install core dependencies (CLI only)
pip install -e .
# Or install with the Streamlit web UI
pip install -e ".[ui]"

OCR Support (Optional)

For image OCR support, install Tesseract:

# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

Web UI

Streamlit (cross-platform):

streamlit run src/folder_tokenizer/app.py

Node.js (macOS, includes native folder picker):

node server.js # opens http://localhost:3000

Command Line

Basic usage:

folder-tokenizer /path/to/folder --model gpt2

Common options:

# Show top 10 files by token count
folder-tokenizer /path/to/folder --verbose
# Quiet mode: output only the integer token count (useful for scripts)
folder-tokenizer /path/to/folder -q
# Export results as JSON or CSV
folder-tokenizer /path/to/folder --output results.json
folder-tokenizer /path/to/folder --csv results.csv
# Example: pipe token count to a variable
TOKEN_COUNT=$(folder-tokenizer /path/to/folder -q)
echo "Total tokens: $TOKEN_COUNT"

Python API

Use folder-tokenizer as a library:

from folder_tokenizer import FolderTokenizer
tokenizer = FolderTokenizer(model_name="gpt2")
result = tokenizer.process_folder("/path/to/folder")
print(f"Total tokens: {result.total_tokens}")
print(f"Total characters: {result.total_chars}")
print(f"Files processed: {len(result.files)}")

When You're Over the Context Window

If a corpus is too large for your context window, use --verbose to identify the heaviest files, then:

Exclude large generated files: node_modules/, dist/, .git/
Exclude lock files: *.lock, yarn.lock, package-lock.json
Use --csv to export results and manually select the most relevant files in a spreadsheet

License

MIT

About

Analyze token counts for all documents in a folder using HuggingFace tokenizers. Supports 15+ models, nested folders, zip files, PDF, DOCX, images (OCR), CLI and Streamlit UI.

Releases

No releases published

Packages

No packages published

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

sgciv2805/folder-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Folder Tokenizer

Features

Supported File Types

Tokenizer Accuracy

Installation

OCR Support (Optional)

Usage

Web UI

Command Line

Python API

When You're Over the Context Window

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors 3

Uh oh!

Languages

License

sgciv2805/folder-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Folder Tokenizer

Features

Supported File Types

Tokenizer Accuracy

Installation

OCR Support (Optional)

Usage

Web UI

Command Line

Python API

When You're Over the Context Window

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages