Python Concurrency · Part 1 of 3

The GIL is Not Your Enemy
(But It Is Your Problem)

A systems engineer's tour of CPython's Global Interpreter Lock — why it exists, what it actually protects, and where the rough edges are.

~14 min read Python 3.12+ Assumes: OS threads, Go concurrency

Take a CPU-bound loop, split it across four Python threads on a four-core machine, and watch it run slower than the single-threaded version. Then take the same loop, replace it with reading from four sockets in parallel, and watch the threads work beautifully. The piece of machinery responsible for both outcomes is the same: a single mutex inside CPython called the Global Interpreter Lock.

If you've come from a language where threads do what you'd expect — Go, Rust, Java, C — Python's behaviour feels wrong on first encounter. It isn't. It's the consequence of a few reasonable engineering decisions made under real constraints, the kind every implementer faces. The constraints are still there; the decisions just locked in early.

So this post isn't a defence of the GIL, exactly. It's an unwinding. We'll look at what CPython chose, why it chose that, what the GIL actually guards (and, more interestingly, what it doesn't), and how it shapes the way you must reason about parallelism in Python. By the end you should be able to explain the GIL to a skeptic at a whiteboard, and predict — without running it — which of your "parallel" Python programs will speed up and which will silently slow down.

§1A PyObject is a refcounted box

To understand the GIL, you first have to understand what a Python value is. Every Python object — every int, every list, every function, every None — is a heap-allocated C struct whose header looks like this:

// Include/object.h — simplified
typedef struct _object {
    Py_ssize_t     ob_refcnt;   // reference count
    PyTypeObject *ob_type;     // pointer to type
} PyObject;

cpython/Include/object.h — every object in Python begins with this header.

That highlighted field is the original sin and the entire reason this post exists. CPython does not have a tracing garbage collector as its primary memory reclamation mechanism. It has reference counting, with a cycle-detecting GC bolted on later to handle the cases refcounting can't.

Every time you bind a name, pass an argument, store something in a list, or even look up an attribute, CPython has to Py_INCREF the involved object. Every time a reference goes away, it has to Py_DECREF, and if the count hits zero, the object is freed immediately. This has nice properties: it's deterministic, it's cache-friendly, dead objects are reclaimed the moment they die, and there are no stop-the-world pauses for most workloads. Go traded all of that for a concurrent tracing collector with write barriers. CPython didn't.

It's the kind of choice that makes sense for a 1991 interpreter written for single-core Unix workstations, when an atomic compare-and-swap on every name binding would have absolutely tanked performance. The catch shows up only when you try to share the interpreter between two threads.

§2The race condition that justifies a lock

Picture two OS threads — real kernel-scheduled threads, the same kind your operating system would use for any program — both running Python code that does x = some_global. From Python's perspective, that's an attribute load followed by a name bind. From CPython's perspective, it's Py_INCREF(some_global) twice — a read of ob_refcnt, an add, and a write back.

Two reads, two adds, two writes, against one shared 64-bit field. If the two threads land on different CPU cores and run truly in parallel, the writes can interleave. This is the classic shared-memory hazard that every C, Java, or Go programmer has seen — same race, just one layer deeper.

Diagram 1 · refcount race

Two threads each try to Py_INCREF the same object. Without a lock, the writes serialise and one increment is lost. The object's true refcount diverges from what Python believes.

Thread A Thread B ob_refcnt read sees 5 add 1 local = 6 write stores 6 read sees 5 add 1 local = 6 write stores 6 5 6 6 expected 7 · got 6 · leak

Two increments, one survives. The other thread's view of the world has been silently overwritten. The refcount is now one too low, which means the object will eventually be freed while a live reference still exists — a classic use-after-free, courtesy of an arithmetic race. The fix, in normal C code, is a __atomic_add_fetch or a mutex.

CPython does have a fix. It's just a much, much bigger lock than you'd guess.

§3One coarse lock instead of a million fine ones

The pragmatic question for the early CPython implementers was: how do we make refcounts safe without paying for atomics on every single name binding? You can imagine the alternatives:

CPython picked the third one in 1992 and called it the Global Interpreter Lock. The decision predates SMP-as-default by more than a decade. It made Python easy to embed, made the C API trivial to write extensions against (numpy, lxml, pillow — none of them would exist in their current form without it), and made the interpreter blisteringly fast on the single-core hardware of the time.

The thing to internalise is what the GIL is actually a lock on: it's a mutex that any thread must hold to execute Python bytecode or touch interpreter state. It is not a lock on "Python" the language, or on user data structures, or on anything you can name from inside Python. It's a lock on the interpreter loop itself.

Python threads are real OS threads. On Linux, threading.Thread.start() creates a kernel thread through the standard POSIX threads library — the same primitive the C runtime uses. The OS scheduler treats them like any other thread. They just happen to spend most of their time blocked waiting for one specific mutex.

Python threads are real OS threads. They just take turns holding a token, and only the token-holder is allowed to touch the interpreter.

§4The release dance: where threads actually help

If the GIL were held continuously, threading in Python would be useless — worse than useless, because you'd pay the OS scheduling overhead and gain nothing. It isn't. The GIL is released around any operation where the interpreter is going to wait on something external.

The most important case: blocking I/O syscalls. When you call socket.recv(4096) in Python, the eventual C path looks like this, paraphrased:

// Modules/socketmodule.c — heavily simplified
static PyObject* sock_recv(...) {
    Py_BEGIN_ALLOW_THREADS     // release the GIL
    n = recv(fd, buf, len, flags);
    Py_END_ALLOW_THREADS       // reacquire
    return PyBytes_FromStringAndSize(buf, n);
}

cpython/Modules/socketmodule.c — the BEGIN/END macros bracket the blocking syscall.

Those two macros are the secret. Py_BEGIN_ALLOW_THREADS saves the current thread state and releases the GIL. The thread then sits in recv(), which is just a kernel call — no Python state involved, nothing for the GIL to protect. While it blocks, any other Python thread is free to grab the GIL and run. When recv() returns, Py_END_ALLOW_THREADS reacquires the GIL (potentially waiting if another thread holds it) and we're back in business.

Anything that calls into a blocking syscall does this: read, write, send, recv, select, poll, epoll_wait, connect, accept. So does time.sleep(). So do well-behaved CPU-heavy C extensions: numpy releases the GIL around its array operations, hashlib releases it around sha256.update() on large buffers, zlib releases it around compression. This is why numpy arrays can actually saturate multiple cores from Python — the heavy lifting happens with the GIL not held.

Mental model

Treat the GIL as a token at a help desk. Only the holder can speak to the interpreter. If you want to wait on the kernel for a syscall, you put the token back on the desk while you wait. When the kernel answers, you queue up to grab the token again before you continue. C extensions can also voluntarily put the token down while doing pure C work.

§5The handoff — cooperative, not preemptive

What about pure-Python CPU work? A thread that's running for i in range(10**8): x = x * 2 never makes a syscall. It would, in principle, hold the GIL forever. CPython has to force handoffs.

Pre-3.2, the interpreter checked a counter every 100 bytecodes and if it tripped, released and reacquired the GIL. This had pathological behaviour on multicore — the released GIL would often be grabbed back by the same thread before any other OS thread woke up, leading to long convoys.

Python 3.2 (Antoine Pitrou's redesign) replaced this with a time-based system. The interpreter doesn't decide on its own; instead, when a thread is waiting for the GIL, it sets a timer (default 5 ms, tunable via sys.setswitchinterval()). If the holding thread hasn't released by then, the waiter sets a flag asking it to drop. The holder sees the flag at the next bytecode boundary and yields.

Diagram 2 · GIL handoff between two CPU-bound threads

Thread A holds the GIL and runs Python bytecode. After ~5 ms, thread B's request is honoured at the next safe point. The handoff is voluntary on A's part, but B is the one that arms the timer.

Thread A holds GIL · runs bytecode waits Thread B waits · arms 5 ms timer runs runs 5 ms · drop request handoff again solid block = holds GIL · faded block = waiting on GIL

Two things are worth noting. First, this is not preemption — Go 1.14 added genuine async preemption via signals, but CPython still relies on the holder cooperatively checking the flag. A C extension that holds the GIL and runs a long pure-C loop without checking PyErr_CheckSignals() or yielding can stall every other thread indefinitely. Second, the handoff happens at bytecode boundaries, not between any two C instructions. The unit of atomicity, from Python's perspective, is one bytecode op.

§6What the GIL does not protect

Here's where most people get burned, including people who think they understand it. The GIL serialises bytecode execution. It does not give you atomicity over the operations you care about in Python.

import threading

counter = 0

def work():
    global counter
    for _ in range(1_000_000):
        counter += 1

threads = [threading.Thread(target=work) for _ in range(8)]
for t in threads: t.start()
for t in threads: t.join()

print(counter)  # very rarely 8_000_000

A "trivially parallel" counter. Run it and you'll get a different (smaller) number every time.

The problem is that counter += 1 is not one bytecode. dis.dis tells the truth:

$ python -c "import dis; dis.dis('counter += 1')"
  0 RESUME                   0
  2 LOAD_NAME                counter
  4 LOAD_CONST               1
  6 BINARY_OP                13 (+=)
 10 STORE_NAME               counter

Three operations on the shared variable, with two potential preemption points between them. The GIL switch can land anywhere in that gap, between LOAD_NAME and STORE_NAME, and another thread can sneak in its own read-add-write before yours completes. The interleaving race is identical to the C version on ob_refcnt we drew earlier — same shape, different abstraction layer.

The trap

"The GIL means I don't need locks" is the single most expensive misconception in Python concurrency. The GIL protects the interpreter's internals. Your own state still needs threading.Lock, queue.Queue, or atomic primitives — same as any other shared-memory language. The only thing the GIL gives you for free is that individual bytecode ops are atomic, which is a much weaker guarantee than people remember.

Certain operations do happen to be a single bytecode — d[k] = v on a built-in dict, L.append(x) on a list. These are atomic on current CPython as an implementation detail. They are not guaranteed by the language spec, and they don't compose: d[k] = d[k] + 1 is two atomic ops with a window in between, and is exactly as racy as the counter example.

Before reading the next section: predict. Each of these is a real expression from real code. Which are atomic across threads, which are racy, and which are folklore you shouldn't rely on?

L.append(item)
(L is a list)
Tap to reveal
A single LIST_APPEND bytecode. Atomic on current CPython, but it's an implementation detail — not a language guarantee. FOLKLORE · don't build on it
counter += 1
Tap to reveal
Three bytecodes: LOAD, BINARY_OP, STORE. Two interleaving points. Lost updates whenever threads cross. RACY · always
d[k] = v
(d is a built-in dict)
Tap to reveal
One STORE_SUBSCR bytecode. Atomic on CPython today. Same caveat as list.append — folklore, not spec. FOLKLORE · don't build on it
d[k] = d[k] + 1
Tap to reveal
Two atomic ops with a window between. Read-modify-write, classic lost update. Same race as counter += 1. RACY · always
q.put(x)
(q is queue.Queue)
Tap to reveal
queue.Queue takes an internal mutex on every method. Safe for multi-producer, multi-consumer use by design. SAFE · designed for this
x = obj.field
(simple attribute read)
Tap to reveal
Single LOAD_ATTR. Atomic for the load itself, but says nothing about consistency with anything else. Reads of a field a writer is mid-updating are fine in isolation but stale in context. FOLKLORE · trickier than it looks

Treat the dict/list "atomicity" as folklore. Don't build on it. When you need atomicity over operations you name, use threading.Lock, queue.Queue, or — for monotonic counters — itertools.count(), whose __next__ is a single C call that holds the GIL throughout.

§7So what's threading actually good for?

For pure-Python CPU work: nothing. Worse than nothing — you'll pay GIL handoff overhead and convoy effects, and the program will slow down. The famous "multiprocessing is the only path" advice exists because of this.

For I/O-bound work: threading is fine, often the simplest choice. Each thread spends most of its life in recv() with the GIL released, so they don't actually contend much. A hundred threads serving a hundred slow database queries is a perfectly reasonable design, and historically it's what every Python web server did before the async era.

For CPU work in C extensions: threading is great, because the heavy loops drop the GIL. This is how numpy parallelises matrix operations across cores, how Pillow can resize images concurrently, how scrypt and bcrypt can hash passwords without blocking the interpreter. You get true parallelism for the C portion; the Python orchestration layer is still single-threaded but it doesn't matter because it's not the bottleneck.

Workloadthreadingmultiprocessingasyncio
Pure Python · CPU-bound Slower than serial. GIL contention dominates. True parallelism. Pay fork/spawn + pickle cost. Slower than serial. Single-threaded by design.
I/O-bound · few connections Works well. Simple. Familiar locks/queues. Overkill. IPC for every result. Works well, slightly more setup.
I/O-bound · many thousands Memory cost of stacks bites at ~1000+. Will not scale this way. The point of asyncio. Part 2 →
C extension heavy lifting Parallel if the extension releases the GIL. Parallel, with serialization overhead. Parallel via run_in_executor + threads.

§8The future — free-threaded Python

The GIL has been the subject of removal proposals for at least 25 years. The serious one is PEP 703, accepted in 2023 and shipping as an experimental build mode in Python 3.13 — python3.13t, the "t" for "no-GIL", because the names nogil and free-threaded were both somehow worse.

Removing the GIL means giving every PyObject its own internal locking for refcount updates, switching key built-in containers (dict, list, set) to fine-grained or biased reference counting, and rewriting any extension module that relied on "the GIL is held when my code runs" — which is most of them. It's a multi-year ecosystem migration, not a flag flip. Single-threaded performance regressed by around 5–10% in early measurements, has been clawed back significantly, and is expected to converge on parity by 3.15 or so.

The interesting design tension is biased reference counting (Choi et al., 2018): in the common case where an object is only ever touched by its creating thread, the refcount is a plain non-atomic counter, the fast path. Only when a second thread first encounters the object does it pay to promote the counter to a shared atomic one. It's the kind of trick that recognises something real about how programs actually use objects, and it's why PEP 703 can hit performance numbers at all.

None of this matters yet for production code. The ecosystem isn't there. But it's the direction, and once you understand why the GIL exists — refcounting was load-bearing — you can read the no-GIL design and see exactly what each piece is solving.

§9Things to take with you

Coming next · Part 2
Cooperative scheduling, an event loop, and a thousand connections on one thread

Threading is fine for tens of connections; processes are fine for CPU work. Neither scales to thousands of slow network conversations. The next post is about asyncio — what an event loop actually does (epoll with a task queue, basically), what async/await compiles down to, and where it bites.