Python Global Interpreter Lock
GIL Confusion
The Python Global Interpreter Lock (GIL) is the source of much confusion. You will see code like this:
# The following sleep is here only to allow other threads the
# opportunity to grab the Python GIL. (see pyepics/pyepics#171)
time.sleep(0)
Calling time.sleep
releases the CPU the thread is running on at the
kernel level, but it does not release the GIL, per se.
That time.sleep(0)
hands control back to the OS scheduler, but the
GIL remains held by the interpreter. The confusion stems from
mixing up parallelism and concurrency:
the GIL prevents parallel execution of Python bytecode, yet
allows threads to interleave execution cooperatively. A simple example
illustrates this.
CPU Intense Example
import threading, time
def _cpu_intense_function():
def _is_prime(n):
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
for n in range(3, 10000):
_is_prime(n)
def _thread(inc):
global shared_value
for _ in range(500):
_cpu_intense_function()
shared_value += inc
shared_value = 0
ones = threading.Thread(target=_thread, args=(1,))
ones.start()
millions = threading.Thread(target=_thread, args=(1000000,))
millions.start()
while shared_value < 500000000:
time.sleep(1)
print(shared_value)
The following output demonstrates that shared_value
increases by
ones
and millions
simultaneously:
$ python cpu_intense.py
117000121
237000241
355000358
475000475
500000500
The GIL allows the two threads to interleave so shared_value
changes
(almost) simultaneously in the ones and millions places. Neither
thread “releases” the CPU explicitly. The GIL allows the two threads
to run concurrently without corrupting the Python
interpreter. However, the GIL does not guarantee atomicity of
operations like shared_value += inc
or any other single Python
statement.
Therefore, the program is not guaranteed to stop. There is a race
condition from the time shared_value
is loaded to when it is stored
after being incremented (three Python opcodes). Since the window is
about 200 nanoseconds on modern processors, the program terminates
almost all of the time, especially since the majority of the time is
spent in _cpu_intense_function
.
You would need to use a synchronization primitive like
threading.Lock
to update shared_value
atomically and to guarantee
the program will always stop. Again, the GIL does not ensure atomic
execution of Python statements.
Parallel vs Concurrent
The GIL does guarantee that only one Python thread is executing Python statements at any one time. To demonstrate this, we’ll time the code as written:
$ time python cpu_intense.py
<snip>
real 0m4.418s
Then, with the ones.start()
commented out:
$ time python cpu_intense.py | grep real
<snip>
real 0m2.366s
The program runs in half the time, which means the two threads do not run in parallel, even though they are running concurrently. On my multicore laptop, a similar C program would run in about the same amount of time in both cases. In other words, the C program scales linearly and the Python program does not scale at all. This is why the GIL is bandied about in the scientific programming community, and why I’m writing this article: HPC is just one part of the story in experimental physics software.
Interpreter is Shared State
In order to understand the GIL, we need to understand what the Python interpreter is and isn’t. Firstly, the interpreter is not a thread; it’s just code and data. The code is the CPython program and all its libraries, and at first, the Python program it reads in is its only data. The interpreter compiles the Python program and its libraries into opcodes to be executed in the Python virtual machine.
When you execute python cpu_intense.py
, the kernel starts a process,
which is the code, data, and a thread, known as the main thread in
Python. In the example above, the main thread executes the Python code
which starts the other two threads (ones
and millions
). In Linux,
the code is read-only and the data is, of course, writable. The GIL is
what keeps this data (the program, stacks, heap, etc.) from getting
corrupted.
In other words, the GIL is an instance of a synchronization primitive, which is rather complex for efficiency reasons, that is used to protect the shared state (Python program and data) of the Python interpreter.
Preemption vs Cooperation
I have tried to use coroutines, and decided that threading is the way to go in Python. That’s why I’m writing this. I’m working on device control code right now, which uses EPICS, and the first example code I showed is from PyEpics, the Python wrapper for the EPICS library written in C.
By and large, device control code is not CPU intense. Rather, it’s
asynchronous so it needs to be concurrent. In Python, there are three
types of concurrency:
multiprocessing,
threading, and
asyncio. Confusingly,
the title of the threading
library is “Thread-based parallelism”,
which is incorrect. The GIL as noted above, prevents parallelism in
Python. Two Python threads cannot execute in parallel. More
confusingly, asyncio
does not support I/O in any sense –
concurrently or otherwise. asyncio
would have been better called
coroutines
, because that’s all it supports. multiprocessing
is
correctly described as “Process-based parallelism”, which is one way
to execute Python in parallel.
asyncio
supports cooperative multitasking using coroutines. When a
coroutine calls await
, asyncio
can switch to another coroutine, if
one is ready. Without await
, the currently running coroutine would
hog the CPU, especially when it makes calls to C libraries. To
demonstrate to yourself, create two coroutines, start them running,
and have one call time.sleep(5)
. The other coroutine will not run
for 5 seconds. However, change that to await asyncio.sleep(5)
, and
the other coroutine will run.
threading
supports kernel-level threads which are preemptable
multitasking. The kernel controls when threads run, not the Python
interpreter. That’s what the cpu_intense.py
example
demonstrates. The two threads run whenever the kernel
decides. However, the GIL prevents two threads from accessing their
(shared) Python interpreter at the same time. The GIL is designed to
release itself very frequently, not on every
opcode,
but close enough for our purposes. Even the simplest Python statement
is made up of several opcodes.
Python opcodes are to threading
as CPU level instructions are to
Python’s multiprocessing
module, that is, parallel execution of
forked instances
of the Python interpreter. Just like the GIL, the kernel switches
processes (threads) at the CPU instruction level. On single core
computers (rare nowadays), this was just fine. Today, of course, we
want to take advantage of all the cores on a processor when we need
them. A Python opcode takes many CPU instructions to execute, which is
why multiprocessing
is used to achieve CPU-level parallelism.
Python multiprocessing
programs scale while threading
programs to
not.
Asynchrony vs Polling
Our example program would run in parallel if we used multiprocessing
instead of threading
. As noted above, I’m writing this article,
because I’m writing device control code, which needs to be
asynchronous,
not necessarily parallel. For the most part, device control code ends
up waiting for device events or control requests, that is, unless the
program is implemented using
polling.
Much Python EPICS control code uses polling, even though EPICS is designed to be fully asynchronous. PyEpics Advanced Topic with Python Channel Access suggests code be written like this:
pvnamelist = read_list_pvs()
for i in range(10):
values = epics.caget_many(pvlist)
time.sleep(0.01)
While this does work, it’s inefficient from a CPU utilization
perspective, and it creates unnecessary latency. In this case, the
code blocks for 100 milliseconds even when all the devices respond
immediately to their “channel access” requests (caget
). Moreover,
it’s likely that this code will make too many caget
requests,
because the actual values have not changed from the last caget
.
The EPICS protocol is fully asynchronous, just like the vast majority of device drivers in operating system kernels. Another term for asynchronous programming is event-driven programming. In the kernel, events are device interrupts. In EPICS, events are asynchronous messages from the programs managing one or more devices. At the Python level, PyEpics allows Python code to register for callbacks to receive these messages so no polling is necessary. When the callback occurs, new data is available from the device. This is why polling shoudl be considered an anti-pattern in EPICS.
GIL Aware
While polling can reduce latency in certain situations, in Python code, it is an anti-pattern due to the GIL, and, as noted, each Python opcode requiring many CPU instructions to execute. When Python is waiting on events from the operating system, it is like any other program, compiled or interpreted. This is why writing device control code in Python can be as efficient as programming C. (Note: I’m not speaking about kernel device drivers, where C is a better choice.)
The GIL prevents scientific code from executing in parallel. EPICS is
the acronym for Experimental Physics and Industrial Control System so
the folks who use it are scientists. I think this may be one source of
the confusion about the GIL. Often, the reason to use EPICS is to
perform intense computation with the results from EPICS requests. If
the computations need to be parallel, Python threading is the wrong
tool. For HPC applications, it’s important to use packages like
multiprocessing
, dask, and
MPI in Python to avoid issues with
the GIL.
Device control code can be written in Python with threading
. For
asynchronous Python, you can generally ignore the GIL.