Skip to main content
Hindsight is State-of-the-Art on Memory for AI Agents | Read the paper β†’
πŸ€–
Using a coding agent? Run this to install the Hindsight docs skill:
npx skills add https://github.com/vectorize-io/hindsight --skill hindsight-docs

Monitoring

Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards.

Local Development​

For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack:

./scripts/dev/start-monitoring.sh

This starts a single Docker container providing:

Enable tracing in your API:

exportHINDSIGHT_API_OTEL_TRACES_ENABLED=true
exportHINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
Production Deployment

The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.).

Grafana Dashboards​

Pre-built dashboards are available in monitoring/grafana/dashboards/. Import these JSON files into your Grafana instance:

DashboardDescription
Hindsight OperationsOperation rates, latency percentiles, per-bank metrics
Hindsight LLM MetricsLLM calls, token usage, latency by scope/provider
Hindsight API ServiceHTTP requests, error rates, DB pool, process metrics

The dashboards are automatically provisioned when using the monitoring stack script.

Metrics Endpoint​

Hindsight exposes Prometheus metrics at /metrics:

curl http://localhost:8888/metrics

Available Metrics​

Operation Metrics​

MetricTypeLabelsDescription
hindsight.operation.durationHistogramoperation, bank_id, source, budget, max_tokens, successDuration of operations in seconds
hindsight.operation.totalCounteroperation, bank_id, source, budget, max_tokens, successTotal number of operations executed

Labels:

  • operation: Operation type (retain, recall, reflect, plus async worker task types such as consolidation)
  • bank_id: Memory bank identifier
  • source: Where the operation was triggered from (api, reflect, internal, worker)
  • budget: Budget level if specified (low, mid, high)
  • max_tokens: Max tokens if specified
  • success: Whether the operation succeeded (true, false)

The source label allows distinguishing between:

  • api: Direct API calls from clients
  • reflect: Internal recall calls made during reflect operations
  • internal: Other internal operations
  • worker: Async worker completions recorded when a claimed task reaches a terminal outcome

For source="worker", the success label is a completion-throughput signal: false means the task raised out to the poller after retries were exhausted or an unexpected error occurred. Failures handled inside the executor and returned normally still record success="true" here; use hindsight_async_operations{status="failed"} for authoritative async operation failure status.

LLM Metrics​

MetricTypeLabelsDescription
hindsight.llm.durationHistogramprovider, model, scope, successDuration of LLM API calls in seconds
hindsight.llm.calls.totalCounterprovider, model, scope, successTotal number of LLM API calls
hindsight.llm.tokens.inputCounterprovider, model, scope, success, token_bucketInput tokens for LLM calls
hindsight.llm.tokens.outputCounterprovider, model, scope, success, token_bucketOutput tokens from LLM calls

Labels:

  • provider: LLM provider (openai, anthropic, gemini, groq, ollama, lmstudio, bedrock, litellm)
  • model: Model name (e.g., gpt-4, claude-3-sonnet)
  • scope: What the LLM call is for (memory, reflect, consolidation, answer)
  • success: Whether the call succeeded (true, false)
  • token_bucket: Token count bucket for cardinality control (0-100, 100-500, 500-1k, 1k-5k, 5k-10k, 10k-50k, 50k+)

HTTP Request Metrics​

MetricTypeLabelsDescription
hindsight.http.durationHistogrammethod, endpoint, status_code, status_classDuration of HTTP requests in seconds
hindsight.http.requests.totalCountermethod, endpoint, status_code, status_classTotal number of HTTP requests
hindsight.http.requests.in_progressUpDownCountermethod, endpointNumber of HTTP requests currently being processed

Labels:

  • method: HTTP method (GET, POST, PUT, DELETE)
  • endpoint: Request path (normalized to reduce cardinality - UUIDs replaced with {id})
  • status_code: HTTP status code (200, 400, 500, etc.)
  • status_class: Status code class (2xx, 4xx, 5xx)

Database Pool Metrics​

MetricTypeLabelsDescription
hindsight.db.pool.sizeGauge-Current number of connections in the pool
hindsight.db.pool.idleGauge-Number of idle connections in the pool
hindsight.db.pool.minGauge-Minimum pool size
hindsight.db.pool.maxGauge-Maximum pool size

Process Metrics​

MetricTypeLabelsDescription
hindsight.process.cpu.secondsGaugetypeProcess CPU time in seconds
hindsight.process.memory.bytesGaugetypeProcess memory usage in bytes
hindsight.process.open_fdsGauge-Number of open file descriptors
hindsight.process.threadsGauge-Number of active threads

Labels:

  • type (CPU): user or system
  • type (Memory): rss_max (maximum resident set size)

Histogram Buckets​

Custom bucket boundaries are configured for better percentile accuracy:

Operation Duration Buckets (seconds):

0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0

LLM Duration Buckets (seconds):

0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0

HTTP Duration Buckets (seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Prometheus Configuration​

scrape_configs:
-job_name:'hindsight'
static_configs:
-targets:['localhost:8888']

Example Queries​

Average operation latency by type​

rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m])

LLM calls per minute by provider​

rate(hindsight_llm_calls_total[1m]) * 60

P95 LLM latency​

histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m]))

Total tokens consumed by model​

sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total)

Internal vs API recall operations​

sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m]))

HTTP requests per second by endpoint​

sum by (endpoint) (rate(hindsight_http_requests_total[1m]))

HTTP error rate (5xx)​

sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m]))

P95 HTTP latency​

histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m])))

Database pool utilization​

hindsight_db_pool_size / hindsight_db_pool_max

Active database connections​

hindsight_db_pool_size - hindsight_db_pool_idle

CPU usage rate​

rate(hindsight_process_cpu_seconds{type="user"}[1m])

Distributed Tracing​

Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+.

Configuration​

See Configuration - OpenTelemetry Tracing for environment variables.

Quick Start:

# Enable tracing
exportHINDSIGHT_API_OTEL_TRACES_ENABLED=true
exportHINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# View traces with Grafana LGTM (local dev)
./scripts/dev/start-monitoring.sh
# Open http://localhost:3000 β†’ Explore β†’ Tempo

Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, Pydantic Logfire, etc.).

Span Hierarchy​

Parent Spans (Operations):

  • hindsight.retain - Memory ingestion
  • hindsight.recall - Memory retrieval
    • hindsight.recall_embedding - Query embedding
    • hindsight.recall_retrieval - Parallel search (semantic, BM25, graph, temporal)
    • hindsight.recall_fusion - Reciprocal Rank Fusion
    • hindsight.recall_rerank - Cross-encoder reranking
  • hindsight.reflect - Agentic reasoning
    • hindsight.reflect_tool_call - Tool execution (recall, lookup, etc.)
  • hindsight.consolidation - Observation synthesis
  • hindsight.mental_model_refresh - Mental model updates

Child Spans (LLM Calls):

  • Named by scope (e.g., hindsight.memory, hindsight.reflect)
  • Contain full prompts/completions as events
  • Follow GenAI semantic conventions for attributes

Span Attributes​

Operation Spans:

  • hindsight.operation - Operation type
  • hindsight.bank_id - Memory bank ID
  • hindsight.query - Query text (truncated to 100 chars)
  • hindsight.fact_types - Fact types for recall
  • hindsight.thinking_budget - Budget allocation
  • hindsight.max_tokens - Token limit

LLM Spans (GenAI Semantic Conventions):

  • gen_ai.operation.name - Always "chat"
  • gen_ai.provider.name - Provider (openai, anthropic, google, etc.)
  • gen_ai.request.model - Model name
  • gen_ai.usage.input_tokens - Input tokens
  • gen_ai.usage.output_tokens - Output tokens
  • hindsight.scope - LLM call purpose (memory, reflect, consolidation, etc.)

Events:

  • gen_ai.client.inference.operation.details - Full prompts and completions

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /