Animesh Chowdhury

AI & Data Product Leader · Product, Evaluation & Quality (AI/ML) · Conversational & Agentic AI

AI/Data product leader, 10+ years building data & GenAI products end-to-end — data product lead for Walmart's AI shopping assistant, owning the evaluation, experimentation, and quality systems that steer the roadmap.

LinkedIn · Streamlit · Tableau Public · RPubs · Email

👋 About

I take customer-facing AI products from ambiguous problem to launch, and own the evaluation, experimentation, and quality systems that decide what ships next. My edge is hands-on technical depth — LLM evaluation, RAG, observability, experimentation infrastructure — paired with the product judgment to weigh customer experience, safety, cost, and scale in one call.

Currently data product lead for Sparky, Walmart's AI shopping assistant (used by ~50% of Walmart app users; a publicly cited driver of ~35% larger orders), where I defined the platform's first standardized quality KPI and its greenfield evaluation standards from zero.

🧪 Featured projects

Five runnable apps spanning the AI-product lifecycle — build → evaluate → experiment → monitor → explain. Each is live, self-contained, and built on synthetic or real public data.

Project	What it shows	Demo
🛰️ LLM Observability & Evals	Model-health monitoring across quality, safety, performance, cost & drift — SQL-backed pipeline, alerting, and PDF/PPTX export	Live ↗
💬 Chat Quality Score (CQS)	LLM-as-a-judge evaluation scoring conversations on a 4-dimension rubric, calibrated against human labels	Live ↗
🛒 Product Recommendation Quality	Tracks AI recommendation relevance week over week and surfaces the drivers behind any change	Live ↗
🧪 A/B Experimentation Framework	Hypothesis design, randomization, guardrail metrics, and ship / iterate / stop decisioning	Live ↗
🔎 LedgerIQ — Finance RAG Agent	Finance-ops RAG over two sources — real SEC EDGAR filings and FP&A planning documents — grounded, cited answers that refuse when out-of-corpus, with token-minimization controls and MCP retrieval servers	Live ↗

LedgerIQ runs on real public SEC EDGAR data (SEC source) plus synthetic FP&A documents (FP&A source); the other apps use fabricated or synthetic data — no proprietary, confidential, or employer-specific information.

_{Built with Streamlit · RAG & MCP · SQLite · LLM-as-a-judge · Python}

🛠️ What I work with

Product: product strategy & roadmap · feature prioritization · MVP scoping · PRDs & requirements · experimentation & A/B testing · KPI ownership · stakeholder management GenAI & AI/ML: LLM evaluation (LLM-as-a-judge) · RAG & grounding · agentic AI & tool use (MCP) · prompt evaluation · conversational & agentic AI · retrieval / recommendation relevance · human-in-the-loop governance · model observability · AI safety evaluation · token & cost–quality optimization Data & Platform: SQL · Python · R · BigQuery · Snowflake · PostgreSQL · Kafka · telemetry & experimentation infrastructure BI & Tools: Tableau · Power BI · Streamlit · Jira · Miro

🏆 Selected recognition

Bravo Award (×ばつ2) — for GenAI initiatives delivering ~1ドルM in annual savings, and for analytics spanning 30+ conversational-AI domains
Innovation Challenge Winner — top RPA solution selected from 218 ideas across 340 professionals, funded and rolled out across the US, Europe, and India

Open to AI/GenAI Product Management roles. Let's talk → chowdhuryanimesh1@gmail.com

Pinned Loading

llm-observability-dashboard Public

SQL-backed LLM observability & evals dashboard for a conversational AI assistant — model-health monitoring across quality, safety, performance, cost, and drift, with executive summary, alerting, an...

Python

cqs-evaluation Public

LLM-as-a-judge evaluation demo for conversational AI: scores chats on a 4-dimension rubric into a single 0–100 quality score and calibrates the automated judge against human labels. Synthetic demo ...

Python

product-recommendation-quality Public

A measurement framework and live report for the quality of an AI shopping assistant's product recommendations — graded relevance, weekly tracking, root-cause analysis, and judge calibration. Built ...

HTML

ab-experimentation-framework Public

Interactive Streamlit dashboard for A/B experiment design, statistical analysis, guardrails, and launch decisions. Worked example: a delivery-estimate chatbot. Fabricated demo data.

Python

sec-filings-rag-agent Public

Finance-ops RAG agent over two sources — real SEC EDGAR filings and FP&A planning documents — with answers grounded in retrieved passages, citations to the exact section, and a refusal when out-of-...

Python

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Animesh Chowdhury animesh01

Block or report animesh01

Animesh Chowdhury

👋 About

🧪 Featured projects

🛠️ What I work with

🏆 Selected recognition

Pinned Loading

Uh oh!