quality_assurance · saas · workflow

How GitHub evaluates AI models for GitHub Copilot: offline evaluation methodology

With many AI models available from proprietary and open-source providers, GitHub needed a rigorous evaluation process to determine which models to support in GitHub Copilot, since newer models do not always perform better for specific use cases.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Pre-production evaluation trigger

Offline evaluations are run before making any change to the production environment.

Tools used

GitHub ActionsApache KafkaMicrosoft Azure

Outcome

GitHub built an offline evaluation system with more than 4,000 automated tests, around 100 containerized repositories, and more than 1,000 technical chat questions, enabling rapid model iteration without product code changes.

Results

Volumemore than 4,000

Source

https://github.blog/ai-and-ml/generative-ai/how-we-evaluate-models-for-github-copilot/

How we source this →

Grounding & classification

Source type: technical build writeup

20 fields verified against source quotes.

ai agentcode generationcode diff prhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assurancemonitor detect alert