Production-style multi-agent research automation over private documents.
This project has agent orchestration, tool boundaries, retrieval, critique, report generation, tracing, evaluation, CI, and reproducible artifacts.
agentic-research-ops turns a broad research request into a cited technical
report using a graph of specialized agents:
| Agent | Responsibility | Output |
|---|---|---|
| Planner | Decomposes the user request into focused research tasks | Typed task plan |
| Retriever | Searches a private markdown corpus with BM25 | Ranked evidence cards |
| Critic | Checks source coverage, citation density, and missing support | Pass/revise verdict |
| Writer | Produces a report using only approved evidence IDs | Markdown report |
| Evaluator | Scores grounding, evidence count, trace metadata, and latency | JSON metrics |
The default runtime is dependency-light and offline-friendly, so recruiters can clone and run it without API keys. The architecture is intentionally compatible with LangGraph-style orchestration, hosted LLMs, local transformer models, and vector databases.
- Source-grounded report writing with stable citation IDs.
- Critic gate before final answer delivery.
- Reproducible benchmark over a fixed private-doc corpus.
- JSON traces for every agent node.
- CI test suite covering retrieval, orchestration, and evaluation.
agentic-research-ops/
├── examples/corpus/ # Private technical knowledge base
├── src/agentic_research_ops/ # Agent graph, retrieval, CLI, evaluation
├── tests/ # Unit and end-to-end tests
├── docs/architecture.md # System design notes
├── runs/ # Generated benchmark and report artifacts
├── scripts/run_demo.py # One-command demo entry point
├── pyproject.toml
└── README.md
python -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip python -m pip install -e ".[dev]" pytest -q
Run one report:
agentic-research run \
--query "Design a reliable multi-agent research assistant for private technical documents." \
--corpus examples/corpus \
--output-dir runs \
--stem demo_reportRun the benchmark:
agentic-research benchmark \ --corpus examples/corpus \ --output-dir runs
Record CUDA hardware metadata and a lightweight FP16 matmul probe:
agentic-research gpu-probe --output-dir runs
If you do not install the package, use:
PYTHONPATH=src python -m agentic_research_ops benchmark --corpus examples/corpus --output-dir runs
Generated on this server with Python 3.13.5 and PyTorch 2.12.0+cu130. The unsandboxed runtime sees 8 NVIDIA RTX 6000 Ada Generation GPUs.
{
"queries": 3,
"device": "cuda:0",
"avg_evidence_items": 12.0,
"avg_grounded_citation_rate": 1.0,
"avg_critic_coverage_score": 1.0,
"total_latency_ms": 11.595
}CUDA probe excerpt:
{
"cuda_available": true,
"device_count": 8,
"device_name": "NVIDIA RTX 6000 Ada Generation",
"benchmark": "fp16_matmul",
"estimated_tflops": 196.466,
"max_memory_allocated_gb": 0.102
}Key artifacts:
runs/demo_report.md: generated cited report.runs/demo_report.json: full graph state, metrics, evidence, and traces.runs/benchmark_summary.json: aggregate benchmark results.runs/benchmark_*.json: per-query benchmark traces.runs/gpu_probe.json: CUDA device metadata and lightweight matmul probe.
The demo query produces a report with sections for architecture, grounding, reliability controls, deployment, and a source map. Example metric excerpt:
{
"report_words": 721,
"evidence_items": 12,
"unique_sources": 5,
"grounded_citation_rate": 1.0,
"critic_coverage_score": 1.0,
"critic_verdict": "pass",
"device": "cuda:0"
}- Replace BM25 with hybrid retrieval using embeddings and a vector database.
- Swap the deterministic writer for an LLM-backed writer while preserving the critic and evaluator interfaces.
- Add a LangGraph adapter around the existing node contracts.
- Add human approval nodes for high-impact reports or automation actions.
- Add OpenTelemetry traces and a small FastAPI service layer.
pytest -q
Current status:
5 passed