Servers, in Earnest
Four layers sit between the loop and your handler. Each one adds opinions; each one leaks under load.
Part 2 ended with a scheduler that fits in fifty lines. A real production server adds a protocol (ASGI), a web server that speaks it (uvicorn), a framework that hides the protocol (Starlette), and a framework that hides the framework (FastAPI). That is four layers between the socket and your handler. Each one is small in isolation. The trouble lives where they meet — and the trouble is rarely exotic. It is a thread pool you did not know about, a queue you did not see, a shutdown that doesn't actually shut down. This post is about the seams.
Each layer is small. The interesting failures live at the joins — and most of this post is a walk down them.
§1ASGI is three messages
The first piece worth knowing about, just below FastAPI, is the protocol uvicorn uses to talk to your framework. It is called ASGI, and it is shockingly small. Reading it once makes the rest of the stack legible — every layer above is a different opinion about what to do with three pieces of data.
An ASGI application is an async callable. It takes three arguments: a scope dict describing the connection, an async receive() that returns the next inbound event, and an async send() that emits an outbound event. That is the entire interface.
async def app(scope, receive, send): assert scope["type"] == "http" await send({ "type": "http.response.start", "status": 200, "headers": [(b"content-type", b"text/plain")], }) await send({ "type": "http.response.body", "body": b"hello\n", })
A complete ASGI HTTP application. Save it as app.py, run uvicorn app:app, curl it.
The scope is a plain dict — method, path, headers, query string, client address. receive yields events like http.request (with a body and a more_body flag for streaming). send takes events like http.response.start and http.response.body. The whole specification is around five pages.
Starlette is a few thousand lines on top of that contract — routing, body parsing, middleware, the request and response objects you actually program against. FastAPI is a few thousand more on top of Starlette: dependency injection, Pydantic validation, OpenAPI generation. Eventually, both of them hand you exactly the three-argument callable that uvicorn invokes. The protocol is dict-passing all the way down. There are no base classes to inherit, no metaclasses. A framework is just a different opinion about what to do with scope, receive, and send.
The analogue is http.Handler: a single interface with one method, ServeHTTP(w, r). Everything in net/http and every Go web framework reduces to that signature. ASGI is the same idea, async, and dict-shaped rather than struct-shaped. Frameworks compete on ergonomics above a tiny contract.
§2What uvicorn does between the socket and your code
Above the loop and below ASGI is uvicorn, and its job is to turn TCP bytes into the events the protocol describes. The mechanism is unromantic, which is the point.
At startup, uvicorn opens a listening socket and registers it with the asyncio loop via loop.add_reader(fd, callback). Each readability event on that fd is a pending accept(). The new connection's fd is added to the selector. As bytes arrive, h11 — or httptools, the C parser, when it is enabled — parses them incrementally into HTTP messages. When request headers are complete, uvicorn assembles a scope dict, builds receive and send callables bound to this connection, and does this:
loop.create_task(app(scope, receive, send))
That Task lives until your handler returns. While it is parked on a database query, the loop services every other connection's Task. One thread, ten thousand fds, ten thousand small Task objects — the multiplexer from Part 2 §1 underneath, doing its job.
The useful mental model is one-to-one: one TCP connection is one fd, one parser, one Task. Memory cost scales with active connections, not with maximum theoretical concurrency. Ten thousand idle keep-alive sockets is on the order of tens of megabytes total — most of which is kernel-side socket buffers, not Python objects. The reason async exists, in one line.
§3Sync handlers, and the thread pool you didn't ask for
FastAPI lets you write handlers as either async def or def. The convenience hides a switch. Starlette inspects the function when you register the route. If it is async def, the framework awaits it directly — the handler runs on the loop. If it is def, the framework wraps the call in anyio.to_thread.run_sync — the handler runs in a thread pool, and your async code awaits the Future the threadpool produces.
That is the FastAPI sync-route mechanism. It is not magic. It is a to_thread.run_sync.
# Starlette/_utils.py, paraphrased if asyncio.iscoroutinefunction(handler): response = await handler(request) else: response = await anyio.to_thread.run_sync(handler, request)
One branch in route dispatch. The whole sync-handler story.
The pool is bounded. In current anyio the default cap is forty threads — anyio.to_thread.current_default_thread_limiter().total_tokens, raisable, but forty is what you get out of the box. Forty-one concurrent sync handlers and the forty-second waits on the limiter's semaphore. While it waits, the request makes no progress at all. And the forty threads that are running pay the GIL tax from Part 1: only one Python interpreter executes at a time, and they take turns at the bytecode-count boundary every five milliseconds.
The right choice is not aesthetic; it is dictated by your libraries. asyncpg, httpx, aioredis are async-native — write async def, stay on the loop. psycopg2, requests, hashlib, pandas are sync-only — write def and accept that you have forty slots. The bad outcomes are the mixes: a def handler that ends up awaiting nothing useful occupies a thread for no reason, and an async def handler that calls a blocking C library freezes every other request on the loop, which is the cooperative trap from Part 2 §6 wearing a different hat.
| handler | where it runs | concurrency cap | fails when |
|---|---|---|---|
| async def, async-only I/O | event loop | ~10k (memory) | any sync blocking call inside |
| def, sync libraries | anyio thread pool | 40 (default) | more than 40 in flight |
| async def, sync blocking call | event loop, frozen | 1, effectively | immediately, but quietly |
The worst row is the last. An async def handler that calls requests.get(...) or time.sleep(...) looks correct, passes review, runs in tests. Under load it stops the entire server every time it is called. You discover this in production because p99 latency jumps and p50 doesn't move — Part 2 §6 again. The defence is mechanical: in async def handlers, every blocking-shaped call is either async-native or wrapped in run_in_executor.
§4The connection pool is a queue
Production async handlers usually talk to a database, and the database does not have ten thousand free connections. The pool — asyncpg's is the canonical Python one — is the accommodation: a fixed number of physical connections, lent out to whichever Task needs one next. The internals are smaller than the surface suggests.
class Pool: async def acquire(self): if self._available: return self._available.popleft() fut = loop.create_future() self._waiters.append(fut) return await fut def release(self, conn): if self._waiters: fut = self._waiters.popleft() fut.set_result(conn) # wake the longest waiter else: self._available.append(conn)
asyncpg/pool.py, paraphrased. The waiters list is FIFO. The whole queue is around fifty lines including the rest of the surface.
Acquire returns immediately if a connection is free; otherwise it parks the Task on a Future and queues it. Release either hands the connection to the longest-waiting Task or returns it to the pool. Producer/consumer over Futures, with fairness. The mechanism is benign; the dynamics under load are what bite.
Picture a burst — a hundred requests arrive in the same hundred milliseconds, each needing roughly fifty milliseconds of database time, against a pool of max_size=10. The first ten run immediately. The next ninety park on the waiters list. By the time request one hundred reaches Postgres, it has spent around four hundred and fifty milliseconds in the queue alone. Aggregate p50 latency triples, but the loop is unstressed — its stall is negligible — and the database is unstressed because every connection it has is busy, not slow. The bottleneck is inside a Python object that nothing on a typical dashboard reports.
Twelve requests, a pool of ten. The last two are parked in _waiters until a slot opens. Their latency is queue time, then DB time — and queue time is invisible to most dashboards.
The first instinct is to raise max_size. Sometimes this is right. Often it is not. Postgres charges per connection — each backend is a forked process holding around ten megabytes of resident memory plus its shared-buffer share. A hundred application replicas each carrying twenty connections is two thousand Postgres backends, and at that scale Postgres itself starts to suffer. The classical fix is PgBouncer in transaction mode, which sits between the application and the database and multiplexes many client connections onto few server connections; its own dynamics deserve a separate post. The practical lesson here is that pool sizing is a coupling between two queues — yours and the database's — and the right answer is rarely the largest one.
§5The blocking that hides in libraries
Three classes of accident catch otherwise careful code. They are worth naming together because they share a detector and a fix.
The first is DNS resolution. socket.getaddrinfo is a libc call that does not yield to the event loop; on most platforms it does not release the GIL either. asyncio is aware of this and routes its own name lookups through loop.getaddrinfo, which punts the call to the default thread-pool executor. asyncpg and httpx both go through that path. The trap is requests, which uses urllib3, which calls getaddrinfo directly on whatever thread invoked it. A single requests.get inside an async def handler stalls the loop for the duration of the DNS lookup — usually low single-digit milliseconds, occasionally seconds when your resolver is unhappy.
The second is parsing or serialising large payloads. json.loads on a fifty-megabyte body is around two hundred milliseconds of pure-Python CPU, and it does not yield in the middle. orjson is faster but still atomic. If your handler ingests big bodies, the right shape is to read the body into bytes asynchronously and then punt the parse to a thread.
The third is regular-expression backtracking. A user-supplied input matched against an unwise pattern can pin a CPU for seconds, and the re module does not yield mid-match. The defence is to design patterns deliberately, prefer anchored variants where possible, and reach for the re2 package when your inputs come from outside.
These three look unrelated. Underneath, they are the same shape: an operation whose running time is proportional to its input, executing on the loop thread, with no await inside. Every such operation can become a loop stall. The detector from Part 2 §6 — measuring how late a periodic timer callback actually fires — catches all of them. The fix is the same as it was there: run_in_executor, or the async-native version of the library.
§6Shutdown, and the request that doesn't know it's over
Eventually the orchestrator decides your pod is done. SIGTERM arrives. Kubernetes gives the process a grace window — thirty seconds by default — and then SIGKILL. A well-behaved server uses the window to drain in-flight work, and the mechanism in uvicorn is a five-step sequence:
- A SIGTERM handler sets
should_exit = True. - The accept loop sees the flag and closes the listening socket. No new connections.
- uvicorn emits the ASGI
lifespan.shutdownevent. Starlette runs your registered shutdown handlers — closing the asyncpg pool, flushing logs, draining Kafka producers. - uvicorn waits for in-flight request Tasks to finish on their own.
- If they do not finish within
--timeout-graceful-shutdownseconds (default 30), uvicorn callstask.cancel()on each. Cancellation, as in Part 2 §7, is cooperative.
The happy case is clean. The failure modes are not exotic.
A handler in the middle of a slow database query gets cancelled at its next await, which is the query itself. CancelledError propagates up. If the handler catches Exception — the common shape for "log and return 500" — it swallows the cancellation, because CancelledError inherits from BaseException on the language side but the framework's exception handlers do not always know that. The cancellation is lost. The Task keeps running. The shutdown timeline blows past its budget.
The connection pool, meanwhile, was closed back in step three. A handler still using a connection in step four hits a "pool is closing" error rather than a clean cancellation, and the shape of that error is library-specific and rarely covered by tests. Worse, the application's grace period and the orchestrator's grace period are configured independently. Kubernetes ships with terminationGracePeriodSeconds: 30; many helm charts override it to five. uvicorn's default is thirty. The smaller number wins silently — the larger one never gets a chance.
SIGTERM at t=0. Socket closes immediately, lifespan shutdown runs, in-flight requests drain — most cleanly, one not. The grey region is where SIGKILL would land if grace expired.
Practical consequences: align terminationGracePeriodSeconds with uvicorn's --timeout-graceful-shutdown, with the latter slightly smaller so the application — not the orchestrator — controls the shutdown. Audit handlers for except Exception clauses around await points; either re-raise asyncio.CancelledError explicitly or write except Exception as except (Exception, ) - {CancelledError} in spirit. Close pools after handler drainage rather than before, in a finally-style shutdown handler that runs last.
§7What to watch
The metrics that fail this stack are specific, and most aggregate dashboards do not show them. Four numbers, between them, surface everything in this post:
- Event-loop stall. Schedule a callback for every ten milliseconds; measure how late it actually fires. Healthy is under two milliseconds p99. Anything in §3, §5, or a misbehaving extension shows up here first.
- Connection-pool wait time. Time between
pool.acquire()and the connection being returned. Healthy is essentially zero. Pool saturation (§4) shows up here, never in database CPU. - Active threads in the anyio pool. A gauge counting how many threads are currently running sync handlers. Approaching forty means the next sync request is waiting in the limiter.
- Per-handler request count and duration, split by
defversusasync def. They scale differently, fail differently, and aggregating them hides which kind is in trouble.
If the only thing on the wall is p50 latency, none of the failure modes in this post are visible until they are already serious. Two of the four — loop stall and pool wait — are five lines of code to instrument and the most valuable five lines you can add.
§8Things to take with you
- ASGI is a three-argument async callable:
scope,receive,send. Everything above is opinion; everything below is uvicorn and asyncio. defhandlers in FastAPI run in a forty-thread anyio pool.async defhandlers run on the loop. They are not interchangeable, and the worst mistake is mixing them — a sync blocking call inside an async handler freezes everything.- The connection pool is a FIFO queue over Futures. Watch wait time, not connection count. Raising
max_sizeshifts the queue to Postgres rather than removing it. - Anything that runs in linear time on input without an
awaitcan become a loop stall. The detector from Part 2 §6 catches them all. - Graceful shutdown is cooperative — your handlers have to participate. Handlers that swallow
CancelledErrordefeat it silently.
Three posts of mechanism, and the stack is finally small enough to hold in one head. The destinations from here, in rough order of value:
- Starlette's source —
starlette/middleware/base.pyandstarlette/routing.py. A few hundred lines each. Trace a request through and the ASGI contract becomes concrete. - asyncpg's
pool.py. The waiters queue is fifty lines including comments. - The ASGI specification at asgi.readthedocs.io. Thirty minutes, no jargon.
- Nathaniel J. Smith's Notes on Structured Concurrency (2018) — the philosophical underpinning of
TaskGroup, and worth the read regardless of language. - PEP 703 and the free-threaded Python rollout. When the GIL goes, the calculus for sync handlers changes: there is no longer a reason to prefer async-everything if CPU work can finally parallelise.
None of this stack is exotic in isolation. Each piece fits in a head. The complexity emerges at the joins — which is, in the end, the only place worth looking when something is wrong.