- rtshkmr's digital garden/
- Readings/
- Books/
- Fluent Python: Clear, Concise, and Effective Programming – Luciano Ramalho/
- Chapter 19. Concurrency Models in Python/
Chapter 19. Concurrency Models in Python
Table of Contents
concurrency vs parallelism; informally speaking
concurrency: dealing with multiple things done at once \(\implies\) it’s about structure of a solution
the structure provided by concurrent solutions may help solve a problem (though not necessarily) in a parallelized fashion.
parallelism: doing lots of things at once \(\implies\) execution of the solution
in this informal view, it’s a special case of concurrency, so parallel \(\implies\) concurrent
Python’s three approaches to concurrency: threads, processes, and native coroutine.
python’s fitness for concurrent and parallel computing is not limited to what the std lib provides. Python can scale.
What’s New in This Chapter #
The Big Picture #
factor of difficulty when writing concurrent programs: starting threads or processes is easy enough, but how do you keep track of them?
non concurrent programs, function call is blocking so useful for us
concurrent programs, non blocking, need to rely on some form of communication to get back results or errors
starting a thread is not cheap \(\implies\) amortize costs by using “worker” threads/procs \(\implies\) coordinating them is tough e.g. how to terminate?
resolved using messages and queues still
coroutines are useful:
- cheap to start
- returns values
- can be safely cancelled
- specific area to catch exceptions
But they have problems:
they’re handled by the async framework \(\implies\) hard to monitor as threads / procs
not good for CPU-intensive tasks
A Bit of Jargon #
Concurrency: ability to handle multiple pending tasks (each eventually succeeding or failing) \(\implies\) can multitask
Parallelism: ability to compute multiple computations at the same time \(\implies\) multicore CPU, multiple CPU, GPU, multiple computers in a cluster
Execution Unit: objects executing concurrent code. Each has independent state and call stack
Python execution units:
processes
definition:
instance of computer program while it’s running, using memory and CPU time-slices, all of which has its own private memory space
communication:
objects communicated as raw bytes (so must be serialised) to pass from one proc to another. Communicated via pipes, sockets or memory-mapped files
spawning:
can spawn child procs which are all isolated from the parent
scheduling:
can be pre-emptively scheduled, supposed to be that a frozen proc won’t freeze the whole system
threads
definition:
execution unit within a single process
consumes less resources than a process (if they both did the same job)
lifecycle:
@ start of process, there’s a single thread. Procs can create more threads by calling OS APIs
Shared Memory management:
Threads within a process share the same memory space \(\implies\) holds live Python object. Shared memory may be corrupted via read/write race conditions
Supervision:
Also supervised by OS Scheduler, threads can enable pre-emptive multitasking
coroutines
Definition:
A function that can suspend itself and resume later.
Classic Coroutines: built from generator functions
Native Coroutines: defined using
async defSupervising coroutines:
Typically, coroutines run within a single thread, supervised by an event loop that is in the same thread.
Async frameworks provide an event loop and supporting libs that support nonblocking, coroutine-based I/)
Scheduling & Cooperative Multitasking:
each coroutine must explicitly cede control with the
yieldorawaitkeyword, so that another may proceed concurrently (but not in parallel).so if there’s any blocking code in a coroutine block, it would block the execution of the event loop and hence all other coroutines
this contrasts preemptive multitasking supported by procs and threads.
nevertheless, coroutine consumes less resources than a thread or proc doing the same job
Mechanisms useful to us:
Queue:
purpose:
allow separate execution units to exchange application data and control messages, such as error codes and signals to terminate.
implementation:
depends on concurrency model:
python stdlib
queuegives queue classes to support threadsthis also provides non-FIFO queues like
LifoQueueandPriorityQueuemultiprocessing,asynciopackages have their own queue classesasyncioalso provides non-FIFO queues likeLifoQueueandPriorityQueue
Lock:
purpose:
Sync mechanism object for execution units to sync actions and avoid data corruption
While updating a shared data structure, the running code should hold an associated lock.
implementation:
depends on the concurrency model
simplest form of a lock is just a mutex
Contention: dispute over a limited asset
Resource Contention
When multiple exeuction units try to access a shared resoruce (e.g. a lock / storage)
CPU Contention
Compute-intensive procs / threads must wait for the OS scheduler to give them a share of CPU time
Processes, Threads, and Python’s Infamous GIL #
Here’s 10 points that consolidate info about python’s concurrency support:
Instance of python interpreter \(\implies\) a process
We can create additional Python processes \(\leftarrow\) use
multiprocessing/concurrent.futureslibrariesWe can also start sub-processes that run any other external programs. \(\leftarrow\) using
subprocesslibraryInterpreter runs user program and the GC in a single thread. We can start additional threads using
threading/concurrent.futureslibraries.GIL (Global Interpreter Lock) controls internal interpreter state (process state shared across threads) and access to object ref counts.
Only one python thread can hold the GIL at any time \(\implies\) only one thread can execute Python code at any time, regardless the number of CPU cores.
GIL is NOT part of the python language definition, it’s a CPython Implementation detail. This is critical for portability reasons.
Default release of the GIL @ an interval:
Prevents any particular thread from holding the GIL indefinitely.
It’s the bytecode interpreter that pauses the current thread every 5ms default (can be changed) and the OS Scheduler picks who (which thread) gets access to the GIL next (could be the same thread that just released the GIL also).
Python source code can’t control the GIL but extension / builtin written in C (or lang that interfaces at the Python/C API level) can release the GIL when it’s running time-consuming tasks.
Every python stdlib that does a syscall (for kernel services) will release the GIL. This avoids contention of resources (mem as well as CPU)
functions that perform I/O operations (disk, network, sleep)
functions that are CPU-intensive (e.g.
NumPy/SciPy), compressing/decompressing functions (e.g.zlib,bz2)
GIL-free threads:
can only be launched by extensions that integrate at the Python/C API level
can’t change python objects generally, but can R/W to memory objects that support buffer protocols (
bytearray,array.array,NumPyarrays)GIL-free python is under experimentation at the moment (but not mainstream)
Network I/O is GIL-insensitive
GIL minimally affects network programming because Network I/O is higher latency than memory I/O.
Each individual thread would have spent long time waiting anyway so interleaving their execution doesn’t majory impact the overall throughput.
Compute-intensive python threads \(\implies\) will be slowed down by GIL contention.
Better to use sequential, single-threaded code here. Faster and simpler.
CPU-intensive python code to be ran on multiple cores requires multiple python processes.
Extra Notes: #
Coroutines are not affected by the GIL
by default they share the same Python thread among themselves and with the supervising event loop provided by an asynchronous framework, therefore the GIL does not affect them.
We technically can use multiple therads in an async program. This is not best practice.
Typically, we have one coordinating thread running the event loops, which delegates to additional threads that carry out specific tasks.
KIV “delegating tasks to executors”
A Concurrent Hello World #
- a demo of how python can “walk and chew gum”, using multiple approaches:
multiprocessing,threading,asyncio
Spinner with Threads #
| |
Notes:
within
slow(),time.sleepblocks the calling thread but releases the GIL, so other Python threads (in this case our secondary thread for spinner) can run.spinandslowexecuted concurrently, the supervisor coordinates the threads using an instance ofthreading.Eventcreating threads:
create a new
Thread, provide a function as the target keyword argument, and positional arguments to the target as a tuple passed via argsspinner = Thread(target=spin, args=('thinking!', done)) # <3> spawn threadwe can also pass in kwargs using
kwargsnamed parameter toThreadconstructor
threading.Event:Python’s simplest signalling mechanism to coordinate threads.
Event instance has an internal boolean flag that starts as
False. CallingEvent.set()sets the flag toTrue.- when flag is
False(unset):if a thread calls
Event.wait(), the thread is blocked until another thread callsEvent.set(). When this happens,Event.wait()returnsTrueIf timeout is provided
Event.wait(s), the call returnsFalsewhen timeout elapses.As soon as another thread calls
Event.set()then the wait function will returnTrue.
- when flag is
TRICK: for text-mode animation: move the cursor back to the start of the line with the carriage return ASCII control character (
'\r').
Spinner with Processes #
| |
multiprocessingpackage supports running concurrent tasks in separate Python processes instead of threads.each instance has its own python interpreter, procs will be working in the background.
Each proc has its own GIL \(\implies\) we can exploit our multicore CPU well because of this (depends on the OS scheduler though)
multiprocessingAPI emulates thethreadingAPI \(\implies\) can easily convert between them.Comparing
multiprocessingandtheradingAPIssimilarities
Event objects are similar in how they function with the bit setting / unsetting
Event objects can wait on timeouts
differences:
Event is of different type between them
multiprocessing.Eventis a function (not a class likethreading.Event)multiprocessinghas a larger API because it’s more complexe.g. python objects that would need to be communicated across process need to be serialized/deserialized because it’s an OS-level isolation (of processes). This adds overhead.
the
Eventstate is the only cross-proccess state being shared, it’s implemented via an OS semaphorememory sharing can be done via
multiprocessing.shared_memory. Only raw bytes, can use aShareableList(mutable sequence) with a fixed number of items of some primitives up to 10MB per item.
Spinner with Coroutines #
| |
who manages the event loop?
for threads and processes, it’s the OS Scheduler
for coroutines, it’s app-level event loop
drives coroutines one by one, manages queue of pending coroutines, passes control back to corresponding coroutine when each event happens
all of these execute in a single thread: event loop, library coroutines, user coroutines
that’s why coroutines logic is blocking
Concurrency is achieved by control passing from one coroutine to another.
Python code using
asynciohas only one flow of execution, unless you’ve explicitly started additional threads or processes.means only one coroutine executes at any point in time.
Concurrency is achieved by control passing from one coroutine to another. This happens when we use the
awaitkeyword.Remember when using
asynciocoroutines, if we ever need some time for NOOPs, to use non-blocking sleep (asyncio.sleep(DELAY)) instead of blocking sleep (time.sleep())explaining the example
asyncio.runstarts the event loop, drives the coroutine (supervisor) that sets other coroutines in motion.supervisorwill block themainfunction until it’s doneasyncio.runreturns whatsupervisorreturnsawaitcallsslow, blockssupervisoruntilslowreturnsI think it’s easier to see it as a control flow handover to slow. That’s why it’s blocking and that’s why when the control flow returns, we carry on with the assignment operator.
Task.cancel()raisesCancelledErrorinside the coro task
NOTE: if we directly invoke a coro like
coro()it immediately returns (because it’s async) but doesn’t return the body of thecorofunctionthe
coroneeds to be driven by an event loop.We see 3 ways to run a coro (driven by an event loop):
asyncio.run(coro())a regular function will call this
usually the first coro is the entry point, that supervisor
return value of
runis whatever the body ofcororeturns
asyncio.create_task(coro())called from a coroutine, returns a
Taskinstance.Taskwraps the coro and provides methods to control and query its state.schedules another coroutine to be eventually run
does not suspend current coroutine
await coro()- transfers control from current coro to coro returned by
coro() - suspends the current coro until the other coro returns
- value of
awaitexpression is whatever the body of thecororeturns
- transfers control from current coro to coro returned by
Supervisors Side-by-Side #
asyncio.Taskvsthreading.Thread(roughly equivalent)Tasktrives a coroutine object,Threadinvokes a callableyielding control: coroutine yields explicitly with
awaitwe don’t instantiate
Taskobjects ourselves , we get them by usingasyncio.create_task()explicit scheduling:
create_taskgives aTaskobject that is already waiting to run,Threadinstance must be explicitly told to run viastart
Termination:
threads can’t be terminated from the outside, we can only pass in a signal (eg. setting
doneinEvent)tasks
Task.cancel()can be cancelled from the outside, raisesCancelledErrorat the await expression where the coro body is currently suspendedthis can happen because coros are always in-sync because only one of them is running at any time, that’s why the outside can come and cancel it vs outside suggesting to terminate via a signal.
Instead of holding locks to synchronize the operations of multiple threads, coroutines are “synchronized” by definition: only one of them is running at any time.
coroutines, code is protected against interruption by default because we’re in charge of driving the event loop
The Real Impact of the GIL #
Quick Quiz #
the main question here is that are the mechanisms interruptable by the entity that coordinates the control flow.
processes are controlled by OS scheduler so this is interruptable \(\implies\) the
multiprocessingversion will still carry on as usualthreads are controlled by the OS scheduler as well and the GIL lock can be released at a default interval, so this is useful to us \(\implies\) the threading approach will not have a noticeable difference.
this has negligible effect only because the number of threads were minimal (2). If any more, it may be visible.
the asyncio coroutine version will be blocked by this compute-intensive call.
we can try doing this hack though: make the
is_primea coroutine andawait asyncio.sleep(0)to yield control flow.This is slow though
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59# spinner_prime_async_nap.py # credits: Example by Luciano Ramalho inspired by # Michele Simionato's multiprocessing example in the python-list: # https://mail.python.org/pipermail/python-list/2009-February/675659.html import asyncio import itertools import math import functools # tag::PRIME_NAP[] async def is_prime(n): if n < 2: return False if n == 2: return True if n % 2 == 0: return False root = math.isqrt(n) for i in range(3, root + 1, 2): if n % i == 0: return False if i % 100_000 == 1: await asyncio.sleep(0) # <1> return True # end::PRIME_NAP[] async def spin(msg: str) -> None: for char in itertools.cycle(r'\|/-'): status = f'\r{char} {msg}' print(status, flush=True, end='') try: await asyncio.sleep(.1) except asyncio.CancelledError: break blanks = ' ' * len(status) print(f'\r{blanks}\r', end='') async def check(n: int) -> int: return await is_prime(n) async def supervisor(n: int) -> int: spinner = asyncio.create_task(spin('thinking!')) print('spinner object:', spinner) result = await check(n) spinner.cancel() return result def main() -> None: n = 5_000_111_000_222_021 result = asyncio.run(supervisor(n)) msg = 'is' if result else 'is not' print(f'{n:,} {msg} prime') if __name__ == '__main__': main()Using await
asyncio.sleep(0)should be considered a stopgap measure before you refactor your asynchronous code to delegate CPU-intensive computations to another process.
A Homegrown Process Pool #
Process-Based Solution #
- starts a number of worker processes equal to the number of CPU cores, as determined by
multiprocessing.cpu_count() - some overhead in spinning up processes and in inter-process communication
Understanding the Elapsed Times #
Code for the Multicore Prime Checker #
| |
when delegating computing to threads / procs, code doesn’t call the worker function directly
the worker is driven by the thread or proc library
the worker eventually produces a result that is stored somewhere
worker coordination & result collection are common uses of queues in concurrent programming
IDIOM: loops, sentinels and poison pills:
workerfunction useful for showing common concurrent programming pattern:we loop indefinitely while taking items from a queue and processing each with a fn that does the actual work (
check)we end the loop when the queue produces a sentinel value
the sentinel value that shuts down a worker is often called a poison pill
TRICK/IDIOM: poison pilling to signal the worker to finish
notice the use of the poison-pill in point 8 of the code above
common sentinels: (here’s a comment thread on sentinels)
None, but may not work if the data stream legitimately may produceNoneobject()is a common sentinel but Python objects must be serialised for IPC, so when we pickle.dump and pickle.load and object, the unpickled instance is distinct from the original and doesn’t compare equal.⭐️
...Ellipsisbuiltin is a good option, it will survive serialisation without losing its identity.
Debugging concurrent code is always hard, and debugging multiprocessing is even harder because of all the complexity behind the thread-like façade.
Experimenting with More or Fewer Processes #
- typically after the number of cores available to us, we should expect runtime to increase because of CPU Contention
Thread-Based Nonsolution #
Due to the GIL and the compute-intensive nature of is_prime, the threaded version is slower than the sequential code
it gets slower as the number of threads increase, because of CPU contention and the cost of context switching.
OS contention: all the stack frame changes required is what causes the extra overhead
KIV managing threads and processes using
concurrent.futures(chapter 20) and doing async programming usingasyncio(chapter 21)
Python in the Multicore World #
GIL makes the interpreter faster when running on a single core, and its implementation simpler. It was a no-brainer when CPU performance didn’t hinge on concurrency.
Despite the GIL, Python is thriving in applications that require concurrent or parallel execution, thanks to libraries and software architectures that work around the limitations of CPython.
System Administration #
use cases: manage hardware like NAS, use it for SDN (software defined networking), hacking
python scripts help with these tasks, commanding remote machines \(\implies\) aren’t really CPU bound operations \(\implies\) Threads & Coroutines are Good for this
we can use the concurrent futures to perform the same operation on multiple remote machines at the same time without much complexity
Data Science #
- compute-intensive applications, supported by an ecosystem of libs that can leverage multicore machines, GPUs / distribued parallel computing in heterogeneous clusters
- some libs:
- project jupyter
- tensorflow (Google) and pytorch (Facebook)
- dask: parallel computing lib to cordinate work on clusters
Server-Side Web/Mobile Development #
- both for app caches and HTTP caches (CDNs)
WSGI Application Servers #
WSGI a standard API for a Python framework or application to receive requests from an HTTP server and send responses to it.
WSGI apps manage one or more procs running your application, maximising the use of available CPUs
main point: all of these application servers can potentially use all CPU cores on the server by forking multiple Python processes to run traditional web apps written in good old sequential code in Django, Flask, Pyramid, etc. This explains why it’s been possible to earn a living as a Python web developer without ever studying the threading, multiprocessing, or asyncio modules: the application server handles concurrency transparently.
Distributed Task Queues #
Distributed Task Queues wrap a message queue and offer a high-level API for delegating tasks to workers, possibly running on different machines.
use cases:
run background jobs
trigger jobs after responding to the web request
async retries to ensure something is done
scheduled jobs
e.g. Django view handler produces job requests which are put in the queue to be consumed by one or more PDF rendering processes
Supports horizontal scalability
producers and consumers are decoupled
I’ve used Celery before!!
Chapter Summary #
the demo on the effect of the GIL
demonstrated graphically that CPU-intensive functions must be avoided in asyncio, as they block the event loop.
the prime demo highlighted the difference between multiprocessing and threading, proving that only processes allow Python to benefit from multicore CPUs.
GIL makes threads worse than sequential code for heavy computations.
Further Reading #
Concurrency with Threads and Processes #
this was the introduction of the
multiprocessinglibrary via a PEP, one of the longer PEPs writtendivide-and-conquer approaches for splitting jobs on clusters vs server-sider ssystems where it’s simpler and efficient to let each process work on one computation from start to finish and reducing the overhead from IPC
this will likely be a useful read for high performance python