Python Concurrency · Part 2 of 3

The Loop is the Trick

asyncio is a hand-rolled scheduler over one syscall and one language feature. The rest is mechanism — and the mechanism is roughly fifty lines.

~16 min read Python 3.11+ Assumes: Part 1, epoll, generators

In Part 1 we established that the GIL releases on I/O syscalls, which means CPython threads can already do I/O concurrency. So why does asyncio exist? Take ten thousand idle HTTP connections. With threads, that is ten thousand kernel-scheduled stacks; the 8 MB-per-thread number people quote is a pthread default, lazily allocated, and configurable via threading.stack_size() — so the tens of gigabytes of virtual address space is a number a 64-bit kernel shrugs off. The binding cost is the scheduler. Ten thousand mostly-idle threads means ten thousand wakeups competing for one GIL, paid in context switches and cache-line traffic on a mutex that only one of them can hold. With asyncio, the same ten thousand connections are ten thousand small Python objects sitting on a single OS thread, and the kernel is asked exactly one question: "of these file descriptors, which are ready?"

That is the trade. asyncio swaps preemption for memory, and kernel scheduling for a tighter inner loop. Everything else in this post is mechanism — the specific shape of the scheduler Python ships in Lib/asyncio/. Once you can see the mechanism, the famous gotchas stop being mystical. They become predictable consequences of the design.

So we'll take a tour: down to the syscall, up through the coroutine, into the event loop, and out the other side with a clear map of what blocks, what doesn't, and why. By the end you should be able to read Lib/asyncio/base_events.py and recognise every moving part.

§1One syscall holds it up

The entire edifice rests on a single Linux primitive: epoll_wait. (On macOS and BSD it's kqueue; on Windows, I/O completion ports. Python's selectors module picks the best one for the platform.) The shape of the syscall is what matters. You register a set of file descriptors with the kernel, and then you ask: of all these, which are readable, writable, or errored right now? The kernel answers in time proportional to the number of ready fds, not the number registered. It maintains a ready list internally as events arrive; epoll_wait just drains it.

This is why one Python thread can shepherd ten thousand sockets. The work the thread does is proportional to the events that actually happened, not the connections that exist. An idle asyncio server, under strace, looks exactly like this:

# strace -p <pid> on an idle asyncio HTTP server
epoll_wait(3, [], 64, -1) = 0   # blocked, no events
# ... later, a connection arrives ...
epoll_wait(3, [{EPOLLIN, fd=7}], 64, -1) = 1

Simplified strace output. The point: a whole asyncio server, idling, is one process parked in one syscall.

Diagram A · the one syscall

The kernel watches N file descriptors and returns only the ones with pending events. The scheduler's per-iteration work is O(ready), not O(N).

registered fds (N) fd 7 socket fd 8 socket fd 9 socket fd 10 pipe fd 11 socket ... 9,995 more kernel epoll_wait O(ready) ready list fd 7 EPOLLIN fd 11 EPOLLIN just the 2 that fired

If you have written Go, you have already paid for this primitive — you just never saw it. Go's runtime netpoller is epoll/kqueue under the hood, hidden behind goroutines that look like they do blocking I/O. asyncio is the same machinery without the curtain. The loop is in your code; you can read it.

§2A coroutine is a resumable function

The other primitive asyncio needs is a function that can be paused mid-execution and resumed later, with its local variables intact. CPython has had this since 2001, in the form of generators. async def is a thin syntactic skin over the same protocol.

Watch carefully:

>>> async def f():
...     return 42
...
>>> coro = f()
>>> coro
<coroutine object f at 0x10...>
>>> coro.send(None)
Traceback (most recent call last):
  ...
StopIteration: 42

Calling an async function does not execute it. It returns a coroutine object — a paused frame waiting to be driven.

Two things to absorb. First, f() does not run the body. It allocates a coroutine object — a heap-allocated frame containing the function's bytecode pointer, locals, and value stack. Second, the way you "run" it is to call .send(), the same method generators have had for two decades. When the coroutine returns, it does so by raising StopIteration with the value attached. This is not an error; it is the generator protocol. CPython has used exceptions for control flow here since before it was fashionable.

Now look at what await compiles to:

>>> import dis
>>> async def f():
...     await g()
...
>>> dis.dis(f)
   RESUME       0
   LOAD_GLOBAL  g
   CALL         0
   GET_AWAITABLE  0
   LOAD_CONST   None
   SEND         3      # <-- the suspension point
   YIELD_VALUE  2
   RESUME       3
   ...

SEND yields the awaitable up to whoever called .send() on us. The frame is parked. Locals stay on the heap.

A coroutine, then, is two things: a chunk of bytecode, and a heap-allocated frame that knows where in that bytecode it is. The frame is a few hundred bytes. Ten thousand parked coroutines is a few megabytes. Ten thousand parked OS threads is gigabytes of address space the kernel must track. That gap is the whole game.

From Go

A goroutine is a real (growable) stack the runtime can park — starts ~2 KB and grows. A coroutine is a state-machine struct on the heap with no stack at all. Same observable behaviour (park, resume), very different memory shape. Goroutines feel like cheap threads. Coroutines feel like fancy callbacks. Both descriptions are correct.

§3What await actually does

Most tutorials say "await waits for the result." This is a lie of convenience. await x does not wait. It yields x up the call chain — all the way out, through every await frame between here and the function that originally called .send(). That function is the driver. In production, the driver is the event loop. But there is nothing magical about it; you can write one yourself in ten lines.

class Awaitable:
    def __await__(self):
        yield self          # this value travels up to the driver
        return 42            # this becomes the result of `await`

async def child():
    x = await Awaitable()
    return x + 1

async def parent():
    return await child()

# --- the manual driver ---
coro = parent()
while True:
    try:
        yielded = coro.send(None)    # run until next yield
    except StopIteration as e:
        print(e.value)              # 43 — we're done
        break
    # `yielded` is the value Awaitable yielded up the stack.
    # The real loop inspects it: is it a Future? schedule a callback.
    # Here we just resume immediately.
    print("got:", yielded)         # <Awaitable object>

A complete asyncio program with no asyncio. The driver is six lines.

Look at what happened. parent() awaits child() which awaits Awaitable(). When Awaitable.__await__ yields self, that value travels through both intermediate frames and lands back at our manual driver — the outermost .send(). The two intermediate coroutines are now parked. Their frames live on the heap. When we call .send(None) a second time, execution resumes inside Awaitable.__await__ at the line after yield, runs the return 42, which becomes the value of the await expression in child, which adds one and returns 43, which parent returns, which raises StopIteration(43).

The lie tutorials tell

"await waits for the result." It doesn't. await is a yield. The thing that does the waiting is the driver — whoever is at the top calling .send(). The coroutine itself is unconscious during the wait, sitting in memory.

So what does await asyncio.sleep(1) actually yield? It yields a Future. The driver — the event loop — sees the Future, schedules a callback for one second from now that will resolve it, then moves on to whatever else needs running. When the timer fires, the Future is resolved, which schedules a callback that resumes the coroutine, which gets None as the value of the await expression and continues. There is no waiting happening inside the coroutine. It is dead to the world. The waiting is happening in the loop, and the loop is waiting on epoll_wait.

The event loop is just coro.send(None) in a while loop, with opinions about what to do with the values it gets back.

§4The event loop, dissected

If you open Lib/asyncio/base_events.py and find the method called _run_once, you are looking at the entire scheduler. It is six steps, around fifty lines of Python. Stripped down, it looks like this:

def _run_once(self):
    # 1. Decide how long to block in the selector.
    #    - work pending? don't sleep at all.
    #    - timer pending? sleep until it fires.
    #    - neither? sleep forever (wake on I/O).
    if self._ready:
        timeout = 0
    elif self._scheduled:
        timeout = max(0, self._scheduled[0]._when - self.time())
    else:
        timeout = None

    # 2. The syscall. This is where the thread actually parks.
    event_list = self._selector.select(timeout)

    # 3. Translate I/O events into ready callbacks.
    for key, mask in event_list:
        self._ready.append(key.data)      # a Handle

    # 4. Promote expired timers to ready.
    now = self.time()
    while self._scheduled and self._scheduled[0]._when <= now:
        handle = heapq.heappop(self._scheduled)
        self._ready.append(handle)

    # 5. Snapshot ready (new ones go to next iteration).
    ntodo = len(self._ready)

    # 6. Run each callback. They may schedule more work.
    for _ in range(ntodo):
        handle = self._ready.popleft()
        handle._run()

cpython/Lib/asyncio/base_events.py — _run_once, paraphrased. This is the whole scheduler.

Two data structures: a deque _ready of callbacks to run right now, and a heap _scheduled of timers ordered by deadline. One organ: the _selector, which is epoll_wait in disguise. Six steps per turn of the loop. That is all asyncio is, at the bottom. Everything else — Tasks, Futures, gather, TaskGroup, transports, protocols — is built on top.

Diagram B · one turn of the loop

The six steps of _run_once. The selector is the only place the thread actually blocks. Everything else is in-process bookkeeping.

_ready (deque) cb₁ cb₂ cb₃ FIFO → _scheduled (heap by deadline) t+0.01s t+0.4s t+1.0s selector.select(timeout) epoll_wait — the only place we sleep 1 decide timeout 2 block in selector 3 I/O events → _ready 4 expired timers → _ready 5 snapshot _ready (ntodo) 6 run each callback EPOLLIN, EPOLLOUT… deadline ≤ now handle._run() × ntodo each callback can schedule more — they queue for next turn next iteration

Three details worth lingering on. First, if _ready has callbacks in it already, timeout is zero — the selector returns immediately without sleeping. The loop never blocks if there is work to do. Second, the snapshot in step 5 matters: callbacks that schedule new callbacks during this iteration do not get to run in the same turn. They wait for the next one. This bounds the work per turn and prevents one chatty callback from starving the selector.

Third — the inverse of the first — when both queues are empty, timeout is None and the selector sleeps forever. This is not a bug; it is the point. In a single-threaded loop, the only things that can produce new work are (1) the kernel telling us an fd is ready, and (2) one of our own timers firing. The kernel is watching the fds for us, so we are not missing I/O by being asleep. And there are no timers, by assumption. There is nothing else — no other thread can call loop.call_soon because there are no other threads. Sleeping forever is safe by construction. The timeout argument exists at all because of case (2): if a timer is due in 0.4 seconds, the selector needs to wake up then even if no I/O arrives. Take away timers and you would never pass a timeout to epoll_wait.1

1 This is also why loop.call_soon_threadsafe exists and is implemented with a self-pipe (an eventfd on Linux) the loop is registered to read from. When another thread schedules a callback, it writes a byte to the pipe; the kernel marks the fd readable; epoll_wait returns; the loop drains the new callback. Without that mechanism, the loop would sleep through cross-thread work — exactly the case the "nothing else can wake us" argument depends on excluding.

§5Future, Coroutine, Task — three things constantly confused

The single biggest tax of learning asyncio is keeping these three straight. They are not interchangeable, and the documentation often uses them as if they are. Here is the shape of each:

name what it is what it holds who resolves it
Future A placeholder for a value-not-yet-known. state, result, exception, callback list Anyone with the object can call set_result.
Coroutine A paused function frame. bytecode pointer, locals, value stack You drive it with .send() until it raises StopIteration.
Task A Future that drives a Coroutine. everything a Future has, plus the coro it owns It resolves itself when its coroutine returns.

The Task is the glue. It is what you actually get back from asyncio.create_task(coro), and it is what makes "fire and forget" work. Here is what it does, in a form short enough to read in one sitting:

class Task(Future):
    def __init__(self, coro, loop):
        super().__init__(loop)
        self._coro = coro
        loop.call_soon(self._step)       # schedule first tick

    def _step(self, exc=None):
        try:
            if exc is None:
                result = self._coro.send(None)        # run until next await
            else:
                result = self._coro.throw(exc)
        except StopIteration as e:
            self.set_result(e.value)              # coro returned: I am done
        except BaseException as e:
            self.set_exception(e)
        else:
            # `result` is the Future the coro yielded. Resume me when it's done.
            result.add_done_callback(self._wakeup)

    def _wakeup(self, fut):
        try:
            value = fut.result()
        except BaseException as e:
            self._step(exc=e)
        else:
            self._step()                         # next .send(None)

A Task in twenty lines. The full version in asyncio handles cancellation, context, and exception groups — but the spine is this.

Walk it once. Task gets a coroutine. It schedules _step for the next loop turn. _step calls .send(None) on the coroutine. The coroutine runs until it hits an await on something not-yet-ready, which yields a Future. _step catches that Future and registers _wakeup as its done-callback. The Task is now dormant until the Future resolves. When the Future resolves — because the selector saw an I/O event, or a timer fired, or another task called set_result_wakeup runs, which calls _step again, which calls .send(None) again. Repeat until the coroutine raises StopIteration, at which point the Task sets its own result. Done.

This is the entire scheduler-to-coroutine bridge. Every await in your program is a step through this state machine, one tick of _run_once at a time.

Sidebar — Queue

asyncio.Queue is the producer/consumer primitive built directly on Futures. Internally, get() checks if the queue is empty; if it is, it creates a Future, parks it on a waiters list, and returns it (which the caller awaits). put() appends to the deque, then pops the oldest waiting Future and calls set_result on it — which resumes whichever coroutine was awaiting get(). A Go channel without the type system.

§6The cooperative trap, with numbers

Here is the single most-important fact about the loop: nothing preempts a running callback. Once a coroutine starts executing between two await points, it owns the loop until it yields again. If the work between yields takes ten milliseconds of CPU, every other task on that loop waits ten milliseconds. If it takes two hundred, every other task waits two hundred.

This is the cost of cooperative scheduling and it is the easiest mistake to make. The Python committed in handler code does not advertise its blocking-ness. hashlib.pbkdf2_hmac looks innocent. So does json.loads on a 50 MB payload. So does a regex on user input. None of them await. None of them yield. All of them freeze the loop.

A concrete demo. Two endpoints, served by one asyncio loop:

async def fast(request):
    return {"ok": True}

async def slow(request):
    # 200 ms of CPU. No I/O. No await.
    hashlib.pbkdf2_hmac("sha256", b"p", b"s", 1_000_000)
    return {"ok": True}

The honest way to measure the trap isn't average request latency — most requests will be fine, since they don't overlap with the blocking call. The right metric is loop stall: how late a callback that wanted to fire on a 10 ms tick actually fires. A healthy loop has stalls on the order of timer-resolution noise. A loop in the trap has one enormous stall per blocking call. Measured on my laptop, with pbkdf2_hmac taking ~150 ms:

scenario p50 stall p99 stall worst
loop idle 1.1 ms 1.2 ms 1.2 ms
blocking /slow (as written) 1.1 ms 173.8 ms 173.8 ms
/slow via run_in_executor 1.0 ms 6.0 ms 6.0 ms

Look at the p50 column. It is essentially identical across the three scenarios — which is exactly how this trap kills you in production. Aggregate latency dashboards look fine. The single request that happened to land while the loop was frozen sees the entire blocking duration as its latency. The next request after that is fine again. p99 explodes; p50 doesn't move. If you are oncall and the only dashboard you watch is the average, you will not see this until someone notices the timeouts piling up.

The fix is one line:

async def slow(request):
    loop = asyncio.get_running_loop()
    await loop.run_in_executor(
        None,                              # default ThreadPoolExecutor
        hashlib.pbkdf2_hmac, "sha256", b"p", b"s", 1_000_000,
    )
    return {"ok": True}

run_in_executor punts the work to a thread pool and returns a Future the loop can await. The worker thread runs the function; when it returns, the Future resolves and the coroutine resumes. The loop keeps cycling the whole time. This works because pbkdf2_hmac is a C routine that releases the GIL for the duration of the hash — the worker thread is running native code with no interpreter to lock, so the loop thread is free to take the GIL and process other requests. If you replace pbkdf2 with the equivalent in pure Python, this fix stops working. The next two paragraphs are about that case, because it is the one every server author runs into and the one the diagram above quietly assumes away.

Diagram C · the trap, and the way out

Above: the loop is frozen while /slow burns CPU. Below: run_in_executor punts to a worker thread; the loop keeps serving /fast the whole time.

blocking call — loop frozen loop /f /slow — pbkdf2 (~200 ms, loop frozen) /f /fast queue 7 requests waiting — none can run run_in_executor — loop keeps cycling loop /f /f /f /f /f /f /f thread pool worker thread: pbkdf2 (~200 ms) punt resume /fast served continuously

Split the CPU work into two camps. C code that releases the GIL: hash functions, compression, numpy array ops, image codecs, most database drivers' parsing — they call Py_BEGIN_ALLOW_THREADS before the long compute and reacquire on the way out. For these, run_in_executor with the default thread pool is the right answer. The worker thread holds no GIL while computing; the loop thread is unimpeded. Pure-Python CPU work: a parser written in Python, a business-logic loop, a hand-rolled compression routine, anything you would profile with cProfile and see Python frames at the top. The interpreter holds the GIL the whole time it is executing bytecode. Putting it on a worker thread does not give the loop air — the worker thread takes the GIL, runs for about 5 ms, the interpreter forces a handoff, the loop thread gets a brief slice, hands back, and so on. The loop is not frozen, but it is sharing one CPU's worth of execution with the worker, contending on a mutex on every switch. Latency for the served requests goes up; throughput collapses. You have moved the problem, not solved it.

The fix for pure-Python CPU work is to leave the process entirely:

from concurrent.futures import ProcessPoolExecutor

executor = ProcessPoolExecutor(max_workers=4)  # at startup, long-lived

async def parse(request):
    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(executor, expensive_pure_python_parse, payload)
    return result

A separate process has its own interpreter and its own GIL. Two processes run pure-Python bytecode in actual parallel. The price is paid at the boundary: arguments and return values are pickled, written through a pipe, unpickled on the other side. For small inputs and outputs this is cheap; for a 50 MB payload it is a copy and a serialise. Workers are long-lived — you do not spawn one per request, you keep a pool and submit to it. On Linux the workers are forked (cheap, copy-on-write), on macOS and Windows they are spawned (slower, fresh import of your code). And there is no shared state: globals, connection pools, in-memory caches in the parent do not exist in the worker. Anything the worker needs, you send.

Trap

FastAPI's sync-handler convenience makes this worse, not better. A def handler runs in anyio's thread pool — forty slots by default. If the handler does pure-Python CPU work, "forty concurrent requests" means forty threads contending on one GIL, doing roughly one CPU's worth of work between them, plus context-switch tax. The dashboard says the pool is healthy; the latency dashboard says it is not. The threadpool is the right tool for sync I/O-bound libraries (psycopg2, requests — they release the GIL on the socket call). It is the wrong tool for CPU work. Move CPU to a ProcessPoolExecutor, or, for anything heavier than a few hundred milliseconds, off the request path entirely into a job queue (Celery, RQ, dramatiq) so the web process does not own the work at all.

The medium-term answer is free-threaded Python — PEP 703, the python3.13t build, where the GIL is gone and pure-Python threads finally run in parallel. When that lands in production the calculus changes: ProcessPoolExecutor becomes a tool for isolation rather than parallelism, and pure-Python CPU work in a thread pool becomes a real option. That world is not here yet. The ecosystem (C extensions, refcounting assumptions, package wheels) is still catching up. For code you write today: C extensions that release the GIL go to the thread pool, pure-Python CPU work goes to a process pool or out-of-process worker, and the question to ask before either is "does this need to be in the request path at all?"

From Go

Go has it easier here, in a narrow sense. Since Go 1.14 the runtime preempts goroutines mid-CPU-burn — it sends SIGURG to the OS thread, and the signal handler reschedules the goroutine. CPython has no equivalent. The bytecode interpreter cannot be safely interrupted mid-instruction without breaking refcount invariants (see Part 1), and the event loop has no signal of its own. If your coroutine doesn't await, the loop cannot make it.

§7Cancellation

You have a Task. You want it to stop. task.cancel() does not stop it immediately. It sets a flag. The next time the loop tries to step the Task, instead of calling coro.send(None), it calls coro.throw(CancelledError()) — injecting an exception at the suspension point. The coroutine can catch it (and probably shouldn't), do some cleanup, and re-raise; or just let it propagate.

This means a Task that is not currently parked at an await — say, because it's in the middle of a CPU-bound loop — cannot be cancelled until it reaches the next yield. Cancellation, like everything else in cooperative scheduling, is cooperative.

Trap

Since Python 3.8, CancelledError inherits from BaseException, not Exception. This is deliberate. It means try: ... except Exception: blocks will not swallow cancellations — they propagate up through your generic handlers and stop the task as intended. If you find yourself catching BaseException in async code, ask yourself whether you really mean to suppress cancellation. Almost never.

The companion primitive is asyncio.timeout, added in 3.11:

async with asyncio.timeout(5.0):
    response = await fetch(url)

It schedules a cancel() on the current task five seconds from now. If the block exits first, the cancel is removed. If the timer fires first, a CancelledError is injected; the context manager catches it and re-raises as TimeoutError. Same machinery, friendlier surface.

And occasionally you have a coroutine you genuinely want to protect — usually a cleanup that must finish even if its caller is being cancelled. That's asyncio.shield(coro): cancellations on the outer task do not propagate into the shielded coroutine. Use it sparingly; shielded work that runs forever is a common source of hung shutdowns.

§8TaskGroup — structured concurrency, finally

For most of asyncio's life, the way to run several coroutines concurrently was asyncio.gather. It has a famously rough edge: if one task raises, the others keep running. You could opt into return_exceptions=True, or wrap things in asyncio.wait with a clever FIRST_EXCEPTION mode, but the default left tasks running after their siblings had already failed. Easy to leak. Easy to misread.

Python 3.11 borrowed Trio's nursery pattern and called it TaskGroup:

async with asyncio.TaskGroup() as tg:
    tg.create_task(fetch("https://a"))
    tg.create_task(fetch("https://b"))
    tg.create_task(fetch("https://c"))
# control reaches here only when all three are done — or all three are cancelled.

The invariant the async with guarantees is strong: when the block exits, every task created inside it has finished, errored, or been cancelled. If one task raises (other than CancelledError), every sibling is cancelled and the exceptions are collected into an ExceptionGroup (PEP 654) and re-raised. No tasks survive the block. No exceptions are lost.

This is the structured-concurrency idea: a function's concurrent children belong to the function, not to the program, and they cannot outlive it. Nathaniel J. Smith's Notes on Structured Concurrency (2018) is the canonical write-up — the analogy to goto-versus-block-scope is genuinely useful. Old gather code, where Tasks could leak past the function that created them, is now technical debt. Reach for TaskGroup by default.

§9A short list to take with you

Up next · Part 3 of 3
Servers, in earnest

We have a scheduler. Now: what does a real server do inside it? Part 3 is the production stack — FastAPI on Starlette on uvicorn on asyncio, plus asyncpg, Redis, JWT, and the specific places where the abstraction leaks. Connection pools that pretend to be infinite. Blocking calls hiding in libraries you trust. Graceful shutdown that almost never is. The interesting questions are no longer about asyncio mechanics; they are about where the mechanics meet code that did not necessarily expect them.