LLM-as-a-judge evaluation demo for conversational AI: scores chats on a 4-dimension rubric into a single 0–100 quality score and calibrates the automated judge against human labels. Synthetic demo data.
nlp model-evaluation human-in-the-loop product-analytics conversational-ai streamlit chatbot-evaluation ai-evaluation llm-evaluation llm-as-a-judge rubric-scoring chat-quality
-
Updated
Jun 10, 2026 - Python