Sujay Kapadnis
Writing

Notes on Coursera building learning tools


Goal

Understanding the process of implementing AI Evals with the case of Coursera

1. Scaling Automation

1.1. Pre-Framework Challenges

  • Fragmented offline jobs in spreadsheets
  • Human Labeling
  • Manual Data Review for Error finding.
  • Difficulty in the collaboration between teams.

1.2. Impact of Coursera Coach

  • Coursera Coach is available 24x7 with 90% satisfaction rating.
  • Users engaging with Coach complete courses faster and finish more courses overall and has become an integral part of learning.

2. Automating Grading

2.1. Pre-AI Era

  • Assessment by teachers or peers, difficult to scale, I personally didn't like to score a fellow course-taker
  • Teachers needed to be paid and peers gave variable feedback, not scalable.

2.2. AI-Powered Grading System

  • The automated system now provides consistent, fair assessment with actionable feedback, significantly reducing grading time while maintaining educational quality.
  • Grading within a minute and 45x more feedback, led to 16.7% increase in course completion.
  • NOTE - There's a "How useful was this feedback" and 👍🏼👎🏼button.

3. Evaluating AI Features with Braintrust

3.1. Define Success Before Building

  • Define what success looks like before building, not after.

3.1.1. Clear Evaluation Criteria

  • Define what is "good enough", for AI, it's just tokens
  • Evaluation rubrics for an AI are like:
    • Appropriateness
    • Content relevance
    • Performance standards
  • Based on these criteria, AI makes an Evaluation.
  • To test how effective the AI evaluation system is, following metrics are used:
    • Human Evaluation feedback: what does the human think about it.
    • Feedback effectiveness: Is the feedback helpful, actionable, and clear?
    • Clarity of assessment criteria: Are the rules or standards being applied transparently and understandably?
    • Equitable evaluation across diverse submissions: Is the evaluation fair, unbiased, and consistent regardless of content, style, or background?

3.1.2. Curate Targeted Datasets

  • Dataset Quality and Evaluation Quality/reliability goes hand in hand.
  • Team manually reviews anonymized chatbot transcripts and human-graded assignments, paying special attention to interactions with explicit user feedback (like thumbs up/down ratings)
  • They supplement this example data with synthetic datasets generated by LLMs to test edge cases and extract challenging real-world examples that might expose weaknesses.
  • This balances both typical use cases and edge case scenarios as well.

3.1.3. Heuristic and Model-Based Scorers

  • Heuristic checks provide deterministic evaluation of objective criteria, like format and response structure
  • For more subjective responses, LLM as a judge is used to assess across multiple sections.

3.1.4. Run Evaluations and Iterate Rapidly

  • Their online monitoring logs production traffic through evaluation scorers, tracking real-time performance against established metrics and alerting on significant deviations.
  • Batch evaluations are mainly for testing model performance or prompt experimentation and checking regression before deployment.

3.1.5. Mix of Deterministic and AI-Based Evaluations

  • Create a mix of deterministic checks and AI-based evaluations to balance strict requirements with nuanced quality assessment.

3.1.6. Deploying New Features Faster

  • Build test cases
  • Test in playground to catch issues early

3.1.7. Continuous Validation

  • Establish both real-time monitoring and batch testing processes to continuously validate AI performance.

4. Practical Lessons

  • Start with clear success criteria: Define what "good" looks like before building, not after.
  • Balance evaluation methods: Use both deterministic checks for non-negotiable requirements and AI-based evaluation for more subjective quality aspects.
  • Build realistic test data: Invest in dataset curation that reflects actual use cases, including edge cases where AI typically struggles.
  • Consider the full spectrum of metrics: Evaluate not just output quality but also operational aspects like latency and resource usage.
  • Integrate evaluation throughout development: Make testing a continuous process, not just a final validation step.