Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

rhowardstone/casestack

Repository files navigation

CaseStack

Turn any document dump into a searchable evidence database.

Built by the team behind epstein-data.com — where we turned the 218GB DOJ Epstein file release into a fully searchable, entity-linked, citation-backed research database.

License

This project is licensed under PolyForm Noncommercial License 1.0.0.

  • Noncommercial use is allowed under the terms in LICENSE.
  • Commercial use requires a separate commercial license from the project owner.
  • Required attribution notice is in NOTICE.
  • Third-party dependency notices are in THIRD_PARTY_NOTICES.md.

Install

pip install -e ".[pymupdf,nlp]"
python -m spacy download en_core_web_sm

Quickstart

# Point at a folder of PDFs, get a searchable database
casestack ingest ./my-documents --name "City Council FOIA"
# Serve it locally
casestack serve
# Check status
casestack status

Configuration

Copy case.yaml.example to case.yaml and customize. See the example for all options.

How It Works

  1. OCR — Extract text from PDFs (Docling or PyMuPDF)
  2. Entity Extraction — Find people, orgs, dates, money, phone numbers (spaCy NER)
  3. Deduplication — Identify duplicate documents (content hash + fuzzy matching)
  4. Export — SQLite database with FTS5 full-text search
  5. Serve — Datasette web interface with search, filtering, and AI Q&A

Case Presets

Pre-configured case files for known document sets:

  • presets/epstein.yaml — DOJ Jeffrey Epstein File Release (218GB, 1.38M PDFs)

About

Turn any document dump into a searchable evidence database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /