Understanding Copilot's Internal Architecture

When you type a prompt into Copilot and watch a response stream back, a surprisingly sophisticated system springs into action beneath the surface. Understanding how Copilot is wired together — from the moment you hit Enter to the last token streamed to your editor — makes you a sharper user and helps you build better integrations on top of it.

Availability: All GitHub Copilot plans including Free.

The three-tier architecture

Copilot's architecture spans three tiers, each with a distinct job:

Tier	What lives here	Examples
Client	Surface-specific extensions and CLI tooling that capture your intent and render responses	VS Code extension, JetBrains plugin, `gh copilot` CLI, GitHub.com UI
Copilot Proxy	GitHub's backend layer that authenticates requests, selects the model, assembles context, and routes to the AI provider	api.githubcopilot.com
AI Provider	The underlying large language model that generates completions and tool-call decisions	OpenAI (GPT series), Anthropic (Claude), Google (Gemini), xAI (Grok), Alibaba (Qwen)

The Copilot Proxy is the hidden workhorse. It shields you from provider complexity, handles authentication via your GitHub token, and ensures all model calls conform to GitHub's content policies regardless of which model is selected.

The request flow, step by step

Here is what actually happens when you submit a prompt:

YOU (prompt)
  │
  ▼
CLIENT EXTENSION
  │  • Captures editor context (open file, cursor, selection)
  │  • Attaches #file / @workspace / #fetch references
  │  • Sends authenticated request to Copilot Proxy
  ▼
COPILOT PROXY
  │  • Validates GitHub OAuth token
  │  • Applies content safety filters (pre-flight)
  │  • Selects model (explicit choice, or Auto)
  │  • Assembles the final prompt (system prompt + history + context)
  │  • Forwards to AI Provider via provider-specific API
  ▼
AI PROVIDER (LLM)
  │  • Generates tokens one by one
  │  • May emit tool-call requests (function calling)
  │  • Streams response back through the Proxy
  ▼
COPILOT PROXY
  │  • Applies content safety filters (post-flight)
  │  • Streams tokens to the client
  ▼
CLIENT EXTENSION
     • Renders streamed tokens in the editor / terminal / chat panel

Model selection: Auto vs. explicit

Copilot supports a growing roster of models from multiple providers. You can pick one explicitly, or let Copilot choose for you with Auto mode.

Selection mode	How it works	Best for
Auto	Copilot picks the best available model based on availability and task type	Most everyday workflows — set it and forget it
General-purpose (e.g., GPT-4.1, Claude Sonnet)	Balanced speed and quality	Code completions, explanations, reviews
Deep reasoning (e.g., GPT-5.1, Claude Opus, Gemini Pro)	Extended reasoning chains before answering	Architecture decisions, complex debugging, multi-step refactors
Agentic (e.g., GPT-5.1 Codex Max, GPT-5.2-Codex)	Optimised for multi-step tool use and autonomous task execution	Coding agent tasks, long-running edits
Fast / lightweight (e.g., Claude Haiku, Gemini Flash)	Minimal latency, lower premium cost	Repetitive edits, quick lookups, high-volume usage

Note that different models carry different premium request multipliers. If you're on a plan with a monthly usage allowance, choosing a deep-reasoning model consumes more of your budget per request than a general-purpose one.

How the prompt is assembled

Before any tokens reach the LLM, the Copilot Proxy constructs a composite prompt from several sources — most of which are invisible to you:

System prompt — GitHub's instructions to the model (persona, content policy, output format expectations). You can extend this with custom instructions in a .github/copilot-instructions.md file in your repository.
Conversation history — previous turns in the chat session, subject to the context window limit.
Explicit context — files, symbols, URLs, or terminal output you attach via #file, #fetch, @workspace, or similar references.
Implicit context — the active editor file, cursor position, and surrounding code that the extension automatically includes.
Tool definitions — the function signatures for any tools the model is allowed to call (repo search, terminal, MCP servers, etc.).

All of this is packed into a single token-counted payload and sent to the AI provider. The context window — the total token capacity of the model — determines how much of this payload can fit.

Tool calls and the ReAct loop

Modern LLMs support function calling: instead of answering immediately, the model can emit a structured tool-call request. The Copilot runtime intercepts this, executes the tool, and feeds the result back into the context window. The model then continues from where it left off. This cycle repeats until the model emits a final response — the ReAct loop (Reason → Act → Observe → Reason…).

LLM decides to call a tool
  │
  ▼
Tool executed by runtime
  (e.g. read_file, run_terminal, search_codebase, MCP server call)
  │
  ▼
Result appended to context window
  │
  ▼
LLM resumes generation
  │
  ├── another tool call? → repeat
  └── final answer? → stream to client

This is why Copilot can perform multi-step tasks like "find all usages of getUserById, check for null checks, and write a test" — it issues multiple tool calls within a single response cycle without any additional prompting from you.

Streaming and latency

Copilot uses server-sent events to stream tokens from the AI provider through the Proxy to your client the moment they are generated. This is why you see responses appear word-by-word rather than all at once. The trade-off is that deep-reasoning models introduce a longer thinking delay before the first token appears — the model silently reasons before writing. General-purpose models start streaming faster.

Content safety: two checkpoints

GitHub runs content safety filters at two points:

Pre-flight — the incoming prompt is checked before it reaches the LLM. Requests that violate GitHub's usage policies are rejected before any tokens are consumed.
Post-flight — the model's response is checked before it is streamed to the client. Outputs that fail the safety check are blocked even if the model produced them.

Both checkpoints are transparent to the end user — you simply see a refusal message if either check fires.

Custom instructions: your hook into the system prompt

The one place where you can directly influence the assembled prompt is the custom instructions file. Drop a .github/copilot-instructions.md in your repository and Copilot will automatically append its contents to the system prompt for every request made in that repo. This is the recommended way to enforce project conventions, coding styles, or tool preferences without manually including them in every chat message.