Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
@animesh01
animesh01
Follow

Animesh Chowdhury animesh01

AI & GenAI Product Leader · LLM evaluation, observability & experimentation · building trustworthy AI at consumer scale

Block or report animesh01

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
animesh01 /README.md

Animesh Chowdhury

AI & Data Product Leader · Product, Evaluation & Quality (AI/ML) · Conversational & Agentic AI

AI/Data product leader, 10+ years building data & GenAI products end-to-end — data product lead for Walmart's AI shopping assistant, owning the evaluation, experimentation, and quality systems that steer the roadmap.

LinkedIn · Streamlit · Tableau Public · RPubs · Email


👋 About

I take customer-facing AI products from ambiguous problem to launch, and own the evaluation, experimentation, and quality systems that decide what ships next. My edge is hands-on technical depth — LLM evaluation, RAG, observability, experimentation infrastructure — paired with the product judgment to weigh customer experience, safety, cost, and scale in one call.

Currently data product lead for Sparky, Walmart's AI shopping assistant (used by ~50% of Walmart app users; a publicly cited driver of ~35% larger orders), where I defined the platform's first standardized quality KPI and its greenfield evaluation standards from zero.


🧪 Featured projects

Five runnable apps spanning the AI-product lifecycle — build → evaluate → experiment → monitor → explain. Each is live, self-contained, and built on synthetic or real public data.

Project What it shows Demo
🛰️ LLM Observability & Evals Model-health monitoring across quality, safety, performance, cost & drift — SQL-backed pipeline, alerting, and PDF/PPTX export Live ↗
💬 Chat Quality Score (CQS) LLM-as-a-judge evaluation scoring conversations on a 4-dimension rubric, calibrated against human labels Live ↗
🛒 Product Recommendation Quality Tracks AI recommendation relevance week over week and surfaces the drivers behind any change Live ↗
🧪 A/B Experimentation Framework Hypothesis design, randomization, guardrail metrics, and ship / iterate / stop decisioning Live ↗
🔎 LedgerIQ — Finance RAG Agent Finance-ops RAG over two sources — real SEC EDGAR filings and FP&A planning documents — grounded, cited answers that refuse when out-of-corpus, with token-minimization controls and MCP retrieval servers Live ↗

LedgerIQ runs on real public SEC EDGAR data (SEC source) plus synthetic FP&A documents (FP&A source); the other apps use fabricated or synthetic data — no proprietary, confidential, or employer-specific information.

Built with Streamlit · RAG & MCP · SQLite · LLM-as-a-judge · Python


🛠️ What I work with

Product: product strategy & roadmap · feature prioritization · MVP scoping · PRDs & requirements · experimentation & A/B testing · KPI ownership · stakeholder management GenAI & AI/ML: LLM evaluation (LLM-as-a-judge) · RAG & grounding · agentic AI & tool use (MCP) · prompt evaluation · conversational & agentic AI · retrieval / recommendation relevance · human-in-the-loop governance · model observability · AI safety evaluation · token & cost–quality optimization Data & Platform: SQL · Python · R · BigQuery · Snowflake · PostgreSQL · Kafka · telemetry & experimentation infrastructure BI & Tools: Tableau · Power BI · Streamlit · Jira · Miro


🏆 Selected recognition

  • Bravo Award (×ばつ2) — for GenAI initiatives delivering ~1ドルM in annual savings, and for analytics spanning 30+ conversational-AI domains
  • Innovation Challenge Winner — top RPA solution selected from 218 ideas across 340 professionals, funded and rolled out across the US, Europe, and India

Open to AI/GenAI Product Management roles. Let's talk → chowdhuryanimesh1@gmail.com

Pinned Loading

  1. llm-observability-dashboard llm-observability-dashboard Public

    SQL-backed LLM observability & evals dashboard for a conversational AI assistant — model-health monitoring across quality, safety, performance, cost, and drift, with executive summary, alerting, an...

    Python

  2. cqs-evaluation cqs-evaluation Public

    LLM-as-a-judge evaluation demo for conversational AI: scores chats on a 4-dimension rubric into a single 0–100 quality score and calibrates the automated judge against human labels. Synthetic demo ...

    Python

  3. product-recommendation-quality product-recommendation-quality Public

    A measurement framework and live report for the quality of an AI shopping assistant's product recommendations — graded relevance, weekly tracking, root-cause analysis, and judge calibration. Built ...

    HTML

  4. ab-experimentation-framework ab-experimentation-framework Public

    Interactive Streamlit dashboard for A/B experiment design, statistical analysis, guardrails, and launch decisions. Worked example: a delivery-estimate chatbot. Fabricated demo data.

    Python

  5. sec-filings-rag-agent sec-filings-rag-agent Public

    Finance-ops RAG agent over two sources — real SEC EDGAR filings and FP&A planning documents — with answers grounded in retrieved passages, citations to the exact section, and a refusal when out-of-...

    Python

AltStyle によって変換されたページ (->オリジナル) /