Python Concurrency · Part 3 of 3

Servers, in Earnest

Four layers sit between the loop and your handler. Each one adds opinions; each one leaks under load.

~15 min read Python 3.11+ Assumes: Parts 1 & 2

Part 2 ended with a scheduler that fits in fifty lines. A real production server adds a protocol (ASGI), a web server that speaks it (uvicorn), a framework that hides the protocol (Starlette), and a framework that hides the framework (FastAPI). That is four layers between the socket and your handler. Each one is small in isolation. The trouble lives where they meet — and the trouble is rarely exotic. It is a thread pool you did not know about, a queue you did not see, a shutdown that doesn't actually shut down. This post is about the seams.

The stack at a glance

Each layer is small. The interesting failures live at the joins — and most of this post is a walk down them.

FastAPI DI · validation · OpenAPI Starlette routing · middleware · request/response ASGI scope · receive · send (a contract, not a layer) uvicorn accept · h11/httptools · Task per request asyncio _run_once · selectors · Futures epoll / kqueue kernel · the only place we block a request travels down

§1ASGI is three messages

The first piece worth knowing about, just below FastAPI, is the protocol uvicorn uses to talk to your framework. It is called ASGI, and it is shockingly small. Reading it once makes the rest of the stack legible — every layer above is a different opinion about what to do with three pieces of data.

An ASGI application is an async callable. It takes three arguments: a scope dict describing the connection, an async receive() that returns the next inbound event, and an async send() that emits an outbound event. That is the entire interface.

async def app(scope, receive, send):
    assert scope["type"] == "http"
    await send({
        "type": "http.response.start",
        "status": 200,
        "headers": [(b"content-type", b"text/plain")],
    })
    await send({
        "type": "http.response.body",
        "body": b"hello\n",
    })

A complete ASGI HTTP application. Save it as app.py, run uvicorn app:app, curl it.

The scope is a plain dict — method, path, headers, query string, client address. receive yields events like http.request (with a body and a more_body flag for streaming). send takes events like http.response.start and http.response.body. The whole specification is around five pages.

Starlette is a few thousand lines on top of that contract — routing, body parsing, middleware, the request and response objects you actually program against. FastAPI is a few thousand more on top of Starlette: dependency injection, Pydantic validation, OpenAPI generation. Eventually, both of them hand you exactly the three-argument callable that uvicorn invokes. The protocol is dict-passing all the way down. There are no base classes to inherit, no metaclasses. A framework is just a different opinion about what to do with scope, receive, and send.

From Go

The analogue is http.Handler: a single interface with one method, ServeHTTP(w, r). Everything in net/http and every Go web framework reduces to that signature. ASGI is the same idea, async, and dict-shaped rather than struct-shaped. Frameworks compete on ergonomics above a tiny contract.

§2What uvicorn does between the socket and your code

Above the loop and below ASGI is uvicorn, and its job is to turn TCP bytes into the events the protocol describes. The mechanism is unromantic, which is the point.

At startup, uvicorn opens a listening socket and registers it with the asyncio loop via loop.add_reader(fd, callback). Each readability event on that fd is a pending accept(). The new connection's fd is added to the selector. As bytes arrive, h11 — or httptools, the C parser, when it is enabled — parses them incrementally into HTTP messages. When request headers are complete, uvicorn assembles a scope dict, builds receive and send callables bound to this connection, and does this:

loop.create_task(app(scope, receive, send))

That Task lives until your handler returns. While it is parked on a database query, the loop services every other connection's Task. One thread, ten thousand fds, ten thousand small Task objects — the multiplexer from Part 2 §1 underneath, doing its job.

The useful mental model is one-to-one: one TCP connection is one fd, one parser, one Task. Memory cost scales with active connections, not with maximum theoretical concurrency. Ten thousand idle keep-alive sockets is on the order of tens of megabytes total — most of which is kernel-side socket buffers, not Python objects. The reason async exists, in one line.

One TCP connection is one fd is one Task. Above that, every layer is an opinion about what to do with three messages.

§3Sync handlers, and the thread pool you didn't ask for

FastAPI lets you write handlers as either async def or def. The convenience hides a switch. Starlette inspects the function when you register the route. If it is async def, the framework awaits it directly — the handler runs on the loop. If it is def, the framework wraps the call in anyio.to_thread.run_sync — the handler runs in a thread pool, and your async code awaits the Future the threadpool produces.

That is the FastAPI sync-route mechanism. It is not magic. It is a to_thread.run_sync.

# Starlette/_utils.py, paraphrased
if asyncio.iscoroutinefunction(handler):
    response = await handler(request)
else:
    response = await anyio.to_thread.run_sync(handler, request)

One branch in route dispatch. The whole sync-handler story.

The pool is bounded. In current anyio the default cap is forty threads — anyio.to_thread.current_default_thread_limiter().total_tokens, raisable, but forty is what you get out of the box. Forty-one concurrent sync handlers and the forty-second waits on the limiter's semaphore. While it waits, the request makes no progress at all. And the forty threads that are running pay the GIL tax from Part 1: only one Python interpreter executes at a time, and they take turns at the bytecode-count boundary every five milliseconds.

The right choice is not aesthetic; it is dictated by your libraries. asyncpg, httpx, aioredis are async-native — write async def, stay on the loop. psycopg2, requests, hashlib, pandas are sync-only — write def and accept that you have forty slots. The bad outcomes are the mixes: a def handler that ends up awaiting nothing useful occupies a thread for no reason, and an async def handler that calls a blocking C library freezes every other request on the loop, which is the cooperative trap from Part 2 §6 wearing a different hat.

handler where it runs concurrency cap fails when
async def, async-only I/O event loop ~10k (memory) any sync blocking call inside
def, sync libraries anyio thread pool 40 (default) more than 40 in flight
async def, sync blocking call event loop, frozen 1, effectively immediately, but quietly
Trap

The worst row is the last. An async def handler that calls requests.get(...) or time.sleep(...) looks correct, passes review, runs in tests. Under load it stops the entire server every time it is called. You discover this in production because p99 latency jumps and p50 doesn't move — Part 2 §6 again. The defence is mechanical: in async def handlers, every blocking-shaped call is either async-native or wrapped in run_in_executor.

§4The connection pool is a queue

Production async handlers usually talk to a database, and the database does not have ten thousand free connections. The pool — asyncpg's is the canonical Python one — is the accommodation: a fixed number of physical connections, lent out to whichever Task needs one next. The internals are smaller than the surface suggests.

class Pool:
    async def acquire(self):
        if self._available:
            return self._available.popleft()
        fut = loop.create_future()
        self._waiters.append(fut)
        return await fut

    def release(self, conn):
        if self._waiters:
            fut = self._waiters.popleft()
            fut.set_result(conn)        # wake the longest waiter
        else:
            self._available.append(conn)

asyncpg/pool.py, paraphrased. The waiters list is FIFO. The whole queue is around fifty lines including the rest of the surface.

Acquire returns immediately if a connection is free; otherwise it parks the Task on a Future and queues it. Release either hands the connection to the longest-waiting Task or returns it to the pool. Producer/consumer over Futures, with fairness. The mechanism is benign; the dynamics under load are what bite.

Picture a burst — a hundred requests arrive in the same hundred milliseconds, each needing roughly fifty milliseconds of database time, against a pool of max_size=10. The first ten run immediately. The next ninety park on the waiters list. By the time request one hundred reaches Postgres, it has spent around four hundred and fifty milliseconds in the queue alone. Aggregate p50 latency triples, but the loop is unstressed — its stall is negligible — and the database is unstressed because every connection it has is busy, not slow. The bottleneck is inside a Python object that nothing on a typical dashboard reports.

Diagram · the pool under burst load

Twelve requests, a pool of ten. The last two are parked in _waiters until a slot opens. Their latency is queue time, then DB time — and queue time is invisible to most dashboards.

req 1req 2req 3 req 4req 5req 6 req 7req 8req 9 req 10 req 11 req 12 t=0 25 ms 50 ms 75 ms 100 ms holding a connection (DB time) parked in _waiters (queue time) finally running

The first instinct is to raise max_size. Sometimes this is right. Often it is not. Postgres charges per connection — each backend is a forked process holding around ten megabytes of resident memory plus its shared-buffer share. A hundred application replicas each carrying twenty connections is two thousand Postgres backends, and at that scale Postgres itself starts to suffer. The classical fix is PgBouncer in transaction mode, which sits between the application and the database and multiplexes many client connections onto few server connections; its own dynamics deserve a separate post. The practical lesson here is that pool sizing is a coupling between two queues — yours and the database's — and the right answer is rarely the largest one.

§5The blocking that hides in libraries

Three classes of accident catch otherwise careful code. They are worth naming together because they share a detector and a fix.

The first is DNS resolution. socket.getaddrinfo is a libc call that does not yield to the event loop; on most platforms it does not release the GIL either. asyncio is aware of this and routes its own name lookups through loop.getaddrinfo, which punts the call to the default thread-pool executor. asyncpg and httpx both go through that path. The trap is requests, which uses urllib3, which calls getaddrinfo directly on whatever thread invoked it. A single requests.get inside an async def handler stalls the loop for the duration of the DNS lookup — usually low single-digit milliseconds, occasionally seconds when your resolver is unhappy.

The second is parsing or serialising large payloads. json.loads on a fifty-megabyte body is around two hundred milliseconds of pure-Python CPU, and it does not yield in the middle. orjson is faster but still atomic. If your handler ingests big bodies, the right shape is to read the body into bytes asynchronously and then punt the parse to a thread.

The third is regular-expression backtracking. A user-supplied input matched against an unwise pattern can pin a CPU for seconds, and the re module does not yield mid-match. The defence is to design patterns deliberately, prefer anchored variants where possible, and reach for the re2 package when your inputs come from outside.

These three look unrelated. Underneath, they are the same shape: an operation whose running time is proportional to its input, executing on the loop thread, with no await inside. Every such operation can become a loop stall. The detector from Part 2 §6 — measuring how late a periodic timer callback actually fires — catches all of them. The fix is the same as it was there: run_in_executor, or the async-native version of the library.

§6Shutdown, and the request that doesn't know it's over

Eventually the orchestrator decides your pod is done. SIGTERM arrives. Kubernetes gives the process a grace window — thirty seconds by default — and then SIGKILL. A well-behaved server uses the window to drain in-flight work, and the mechanism in uvicorn is a five-step sequence:

  1. A SIGTERM handler sets should_exit = True.
  2. The accept loop sees the flag and closes the listening socket. No new connections.
  3. uvicorn emits the ASGI lifespan.shutdown event. Starlette runs your registered shutdown handlers — closing the asyncpg pool, flushing logs, draining Kafka producers.
  4. uvicorn waits for in-flight request Tasks to finish on their own.
  5. If they do not finish within --timeout-graceful-shutdown seconds (default 30), uvicorn calls task.cancel() on each. Cancellation, as in Part 2 §7, is cooperative.

The happy case is clean. The failure modes are not exotic.

A handler in the middle of a slow database query gets cancelled at its next await, which is the query itself. CancelledError propagates up. If the handler catches Exception — the common shape for "log and return 500" — it swallows the cancellation, because CancelledError inherits from BaseException on the language side but the framework's exception handlers do not always know that. The cancellation is lost. The Task keeps running. The shutdown timeline blows past its budget.

The connection pool, meanwhile, was closed back in step three. A handler still using a connection in step four hits a "pool is closing" error rather than a clean cancellation, and the shape of that error is library-specific and rarely covered by tests. Worse, the application's grace period and the orchestrator's grace period are configured independently. Kubernetes ships with terminationGracePeriodSeconds: 30; many helm charts override it to five. uvicorn's default is thirty. The smaller number wins silently — the larger one never gets a chance.

Diagram · the shutdown timeline

SIGTERM at t=0. Socket closes immediately, lifespan shutdown runs, in-flight requests drain — most cleanly, one not. The grey region is where SIGKILL would land if grace expired.

SIGTERM socket close lifespan.shutdown SIGKILL (if needed) 30-second grace period req A req B req C req D done done done cancel() t=0 8 s 15 s 22 s 30 s A handler that catches except Exception can swallow the cancel and overrun.

Practical consequences: align terminationGracePeriodSeconds with uvicorn's --timeout-graceful-shutdown, with the latter slightly smaller so the application — not the orchestrator — controls the shutdown. Audit handlers for except Exception clauses around await points; either re-raise asyncio.CancelledError explicitly or write except Exception as except (Exception, ) - {CancelledError} in spirit. Close pools after handler drainage rather than before, in a finally-style shutdown handler that runs last.

§7What to watch

The metrics that fail this stack are specific, and most aggregate dashboards do not show them. Four numbers, between them, surface everything in this post:

If the only thing on the wall is p50 latency, none of the failure modes in this post are visible until they are already serious. Two of the four — loop stall and pool wait — are five lines of code to instrument and the most valuable five lines you can add.

§8Things to take with you

End of series · further reading
Where the rabbit hole goes

Three posts of mechanism, and the stack is finally small enough to hold in one head. The destinations from here, in rough order of value:

  • Starlette's source — starlette/middleware/base.py and starlette/routing.py. A few hundred lines each. Trace a request through and the ASGI contract becomes concrete.
  • asyncpg's pool.py. The waiters queue is fifty lines including comments.
  • The ASGI specification at asgi.readthedocs.io. Thirty minutes, no jargon.
  • Nathaniel J. Smith's Notes on Structured Concurrency (2018) — the philosophical underpinning of TaskGroup, and worth the read regardless of language.
  • PEP 703 and the free-threaded Python rollout. When the GIL goes, the calculus for sync handlers changes: there is no longer a reason to prefer async-everything if CPU work can finally parallelise.

None of this stack is exotic in isolation. Each piece fits in a head. The complexity emerges at the joins — which is, in the end, the only place worth looking when something is wrong.