Goal
Understanding the process of implementing AI Evals with the case of Coursera
1. Scaling Automation
1.1. Pre-Framework Challenges
- Fragmented offline jobs in spreadsheets
- Human Labeling
- Manual Data Review for Error finding.
- Difficulty in the collaboration between teams.
1.2. Impact of Coursera Coach
- Coursera Coach is available 24x7 with 90% satisfaction rating.
- Users engaging with Coach complete courses faster and finish more courses overall and has become an integral part of learning.
2. Automating Grading
2.1. Pre-AI Era
- Assessment by teachers or peers, difficult to scale, I personally didn't like to score a fellow course-taker
- Teachers needed to be paid and peers gave variable feedback, not scalable.
2.2. AI-Powered Grading System
- The automated system now provides consistent, fair assessment with actionable feedback, significantly reducing grading time while maintaining educational quality.
- Grading within a minute and 45x more feedback, led to 16.7% increase in course completion.
- NOTE - There's a "How useful was this feedback" and 👍🏼👎🏼button.
3. Evaluating AI Features with Braintrust
3.1. Define Success Before Building
- Define what success looks like before building, not after.
3.1.1. Clear Evaluation Criteria
- Define what is "good enough", for AI, it's just tokens
- Evaluation rubrics for an AI are like:
- Appropriateness
- Content relevance
- Performance standards
- Based on these criteria, AI makes an Evaluation.
- To test how effective the AI evaluation system is, following metrics are used:
- Human Evaluation feedback: what does the human think about it.
- Feedback effectiveness: Is the feedback helpful, actionable, and clear?
- Clarity of assessment criteria: Are the rules or standards being applied transparently and understandably?
- Equitable evaluation across diverse submissions: Is the evaluation fair, unbiased, and consistent regardless of content, style, or background?
3.1.2. Curate Targeted Datasets
- Dataset Quality and Evaluation Quality/reliability goes hand in hand.
- Team manually reviews anonymized chatbot transcripts and human-graded assignments, paying special attention to interactions with explicit user feedback (like thumbs up/down ratings)
- They supplement this example data with synthetic datasets generated by LLMs to test edge cases and extract challenging real-world examples that might expose weaknesses.
- This balances both typical use cases and edge case scenarios as well.
3.1.3. Heuristic and Model-Based Scorers
- Heuristic checks provide deterministic evaluation of objective criteria, like format and response structure
- For more subjective responses, LLM as a judge is used to assess across multiple sections.
3.1.4. Run Evaluations and Iterate Rapidly
- Their online monitoring logs production traffic through evaluation scorers, tracking real-time performance against established metrics and alerting on significant deviations.
- Batch evaluations are mainly for testing model performance or prompt experimentation and checking regression before deployment.
3.1.5. Mix of Deterministic and AI-Based Evaluations
- Create a mix of deterministic checks and AI-based evaluations to balance strict requirements with nuanced quality assessment.
3.1.6. Deploying New Features Faster
- Build test cases
- Test in playground to catch issues early
3.1.7. Continuous Validation
- Establish both real-time monitoring and batch testing processes to continuously validate AI performance.
4. Practical Lessons
- Start with clear success criteria: Define what "good" looks like before building, not after.
- Balance evaluation methods: Use both deterministic checks for non-negotiable requirements and AI-based evaluation for more subjective quality aspects.
- Build realistic test data: Invest in dataset curation that reflects actual use cases, including edge cases where AI typically struggles.
- Consider the full spectrum of metrics: Evaluate not just output quality but also operational aspects like latency and resource usage.
- Integrate evaluation throughout development: Make testing a continuous process, not just a final validation step.