Goal

Understanding the process of implementing AI Evals with the case of Coursera

1. Scaling Automation

Coursera Coach is available 24x7 with 90% satisfaction rating.
Users engaging with Coach complete courses faster and finish more courses overall and has become an integral part of learning.

Assessment by teachers or peers, difficult to scale, I personally didn't like to score a fellow course-taker
Teachers needed to be paid and peers gave variable feedback, not scalable.

The automated system now provides consistent, fair assessment with actionable feedback, significantly reducing grading time while maintaining educational quality.
Grading within a minute and 45x more feedback, led to 16.7% increase in course completion.
NOTE - There's a "How useful was this feedback" and 👍🏼👎🏼button.

Define what is "good enough", for AI, it's just tokens
Evaluation rubrics for an AI are like:
- Appropriateness
- Content relevance
- Performance standards
Based on these criteria, AI makes an Evaluation.
To test how effective the AI evaluation system is, following metrics are used:
- Human Evaluation feedback: what does the human think about it.
- Feedback effectiveness: Is the feedback helpful, actionable, and clear?
- Clarity of assessment criteria: Are the rules or standards being applied transparently and understandably?
- Equitable evaluation across diverse submissions: Is the evaluation fair, unbiased, and consistent regardless of content, style, or background?

Dataset Quality and Evaluation Quality/reliability goes hand in hand.
Team manually reviews anonymized chatbot transcripts and human-graded assignments, paying special attention to interactions with explicit user feedback (like thumbs up/down ratings)
They supplement this example data with synthetic datasets generated by LLMs to test edge cases and extract challenging real-world examples that might expose weaknesses.
This balances both typical use cases and edge case scenarios as well.

Heuristic checks provide deterministic evaluation of objective criteria, like format and response structure
For more subjective responses, LLM as a judge is used to assess across multiple sections.

Their online monitoring logs production traffic through evaluation scorers, tracking real-time performance against established metrics and alerting on significant deviations.
Batch evaluations are mainly for testing model performance or prompt experimentation and checking regression before deployment.

Create a mix of deterministic checks and AI-based evaluations to balance strict requirements with nuanced quality assessment.

Establish both real-time monitoring and batch testing processes to continuously validate AI performance.

Start with clear success criteria: Define what "good" looks like before building, not after.
Balance evaluation methods: Use both deterministic checks for non-negotiable requirements and AI-based evaluation for more subjective quality aspects.
Build realistic test data: Invest in dataset curation that reflects actual use cases, including edge cases where AI typically struggles.
Consider the full spectrum of metrics: Evaluate not just output quality but also operational aspects like latency and resource usage.
Integrate evaluation throughout development: Make testing a continuous process, not just a final validation step.