quality_assurance · saas · workflow
How GitHub evaluates AI models for GitHub Copilot: offline evaluation methodology
With many AI models available from proprietary and open-source providers, GitHub needed a rigorous evaluation process to determine which models to support in GitHub Copilot, since newer models do not always perform better for specific use cases.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Pre-production evaluation trigger
Offline evaluations are run before making any change to the production environment.
Tools used
GitHub ActionsApache KafkaMicrosoft Azure
Outcome
GitHub built an offline evaluation system with more than 4,000 automated tests, around 100 containerized repositories, and more than 1,000 technical chat questions, enabling rapid model iteration without product code changes.
Results
Volumemore than 4,000
Grounding & classification
Source type: technical build writeup
20 fields verified against source quotes.
ai agentcode generationcode diff prhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assurancemonitor detect alert