This is a small, runnable demo that:
- Ingests a sample patent dataset (CSV)
- Builds a hybrid retrieval index (BM25 + vectors via Chroma + Sentence Transformers)
- Generates a cited brief with inline
[doc_id]references - Shows a Streamlit UI to test queries
# 1) Create and activate a virtual environment python -m venv .venv # Windows: .venv\Scripts\activate # macOS/Linux: source .venv/bin/activate # 2) Install dependencies pip install --upgrade pip pip install -r requirements.txt # 3) Build the index (uses sample CSV in data/raw) python -m src.ingest # 4) Run the demo UI streamlit run app/streamlit_app.py
- LLM watermarking methods
- drone swarming computer vision
- synthetic data generation patents
- transformer optimization energy efficiency
- If
nltkcomplains about missing data, the code has a fallback sentence splitter (no internet required). - If
chromadbinstall is problematic on your system, try updating pip and setuptools:pip install --upgrade pip setuptools wheel.
Security: This starter uses public, synthetic sample data. Do not ingest client or restricted data.