How GitHub built Copilot: a globally-distributed LLM code completion service serving 400M+ requests at under 200ms
GitHub needed to serve LLM-based code completions with latency competitive against locally-run IDE autocomplete, despite the overhead of network latency, shared server resources, and cloud outages. Authentication at scale and efficient request cancellation were also unsolved challenges.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User pauses typing in IDE
Whenever the user stops typing, Copilot initiates a completion request.
GitHub Copilot serves more than 400 million completion requests with a mean response time under 200 milliseconds and peaks at 8,000 requests per second, achieving global resilience through regional proxy colocation and self-healing DNS routing.
What failed first
The alpha required users to supply their own OpenAI API keys and scaled to only dozens of users. Standard HTTP/1-based cancellation forced costly TCP reconnections after every cancelled request. A point-of-presence model caused traffic tromboning and high operational burden. Most cloud load balancers downgraded HTTP/2 to HTTP/1 on the backend, undermining stream-level cancellation.