diff --git a/docs/56-concurrency-faster.md b/docs/56-concurrency-faster.md index 2ba5018..df8d381 100644 --- a/docs/56-concurrency-faster.md +++ b/docs/56-concurrency-faster.md @@ -9,13 +9,13 @@ hide: ![](img/56-concurrency-faster.png) -A misconception among many developers is believing that a concurrent solution is always faster than a sequential one. This couldn’t be more wrong. The overall performance of a solution depends on many factors, such as the efficiency of our code structure (concurrency), which parts can be tackled in parallel, and the level of contention among the computation units. This post reminds us about some fundamental knowledge of concurrency in Go; then we will see a concrete example where a concurrent solution isn’t necessarily faster. +A misconception among many developers is that a concurrent solution is always faster than a sequential one. This couldn’t be more wrong. The overall performance of a solution depends on many factors, such as the efficiency of our code structure (concurrency), which parts can be tackled in parallel, and the level of contention among the computation units. This post reminds us about some fundamental knowledge of concurrency in Go; then we will see a concrete example where a concurrent solution isn’t necessarily faster. ## Go Scheduling A thread is the smallest unit of processing that an OS can perform. If a process wants to execute multiple actions simultaneously, it spins up multiple threads. These threads can be: -* _Concurrent_ — Two or more threads can start, run, and complete in overlapping time periods. +* _Concurrent_ — Two or more threads can start, run, and complete in overlapping periods. * _Parallel_ — The same task can be executed multiple times at once. The OS is responsible for scheduling the thread’s processes optimally so that: @@ -27,7 +27,7 @@ The OS is responsible for scheduling the thread’s processes optimally so that: The word thread can also have a different meaning at a CPU level. Each physical core can be composed of multiple logical cores (the concept of [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading)), and a logical core is also called a thread. In this post, when we use the word thread, we mean the unit of processing, not a logical core. -A CPU core executes different threads. When it switches from one thread to another, it executes an operation called _context switching_. The active thread consuming CPU cycles was in an _executing_ state and moves to a _runnable_ state, meaning it’s ready to be executed pending an available core. Context switching is considered an expensive operation because the OS needs to save the current execution state of a thread before the switch (such as the current register values). +A CPU core executes different threads. When it switches from one thread to another, it executes an operation called _context switching_. The active thread consuming CPU cycles was in an _executing_ state and moved to a _runnable_ state, meaning it was ready to be executed pending an available core. Context switching is considered an expensive operation because the OS needs to save the current execution state of a thread before the switch (such as the current register values). As Go developers, we can’t create threads directly, but we can create goroutines, which can be thought of as application-level threads. However, whereas an OS thread is context-switched on and off a CPU core by the OS, a goroutine is context-switched on and off an OS thread by the Go runtime. Also, compared to an OS thread, a goroutine has a smaller memory footprint: 2 KB for goroutines from Go 1.4. An OS thread depends on the OS, but, for example, on Linux/x86–32, the default size is 2 MB (see https://man7.org/linux/man-pages/man3/pthread_create.3.html). Having a smaller size makes context switching faster. @@ -46,21 +46,21 @@ Each OS thread (M) is assigned to a CPU core (P) by the OS scheduler. Then, each A goroutine has a simpler lifecycle than an OS thread. It can be doing one of the following: * _Executing_ — The goroutine is scheduled on an M and executing its instructions. -* _Runnable_ — The goroutine is waiting to be in an executing state. +* _Runnable_ — The goroutine is waiting to be executed. * _Waiting_ — The goroutine is stopped and pending something completing, such as a system call or a synchronization operation (such as acquiring a mutex). There’s one last stage to understand about the implementation of Go scheduling: when a goroutine is created but cannot be executed yet; for example, all the other Ms are already executing a G. In this scenario, what will the Go runtime do about it? The answer is queuing. The Go runtime handles two kinds of queues: one local queue per P and a global queue shared among all the Ps. -Figure 1 shows a given scheduling situation on a four-core machine with GOMAXPROCS equal to 4. The parts are the logical cores (Ps), goroutines (Gs), OS threads (Ms), local queues, and global queue: +Figure 1 shows a given scheduling situation on a four-core machine with GOMAXPROCS equal to 4. The parts are the logical cores (Ps), goroutines (Gs), OS threads (Ms), local queues, and global queues:
![](img/go-scheduler.png) -
Figure 1: An example of the current state of a Go application executed on a four-core machine. Goroutines that aren’t in an executing state are either runnable (pending being executed) or waiting (pending a blocking operation)
+
Figure 1: An example of the current state of a Go application executed on a four-core machine. Goroutines that aren’t executing are either runnable (pending being executed) or waiting (pending a blocking operation)
First, we can see five Ms, whereas GOMAXPROCS is set to 4. But as we mentioned, if needed, the Go runtime can create more OS threads than the GOMAXPROCS value. -P0, P1, and P3 are currently busy executing Go runtime threads. But P2 is presently idle as M3 is switched off P2, and there’s no goroutine to be executed. This isn’t a good situation because six runnable goroutines are pending being executed, some in the global queue and some in other local queues. How will the Go runtime handle this situation? Here’s the scheduling implementation in pseudocode (see [proc.go](https://github.com/golang/go/blob/go1.17.6/src/runtime/proc.go#L3291)): +P0, P1, and P3 are currently busy executing Go runtime threads. But P2 is presently idle as M3 is switched off P2, and there’s no goroutine to be executed. This isn’t a good situation because six runnable goroutines are pending execution, some in the global queue and some in other local queues. How will the Go runtime handle this situation? Here’s the scheduling implementation in pseudocode (see [proc.go](https://github.com/golang/go/blob/go1.17.6/src/runtime/proc.go#L3291)): ``` runtime.schedule() { @@ -73,9 +73,9 @@ runtime.schedule() { } ``` -Every sixty-first execution, the Go scheduler will check whether goroutines from the global queue are available. If not, it will check its local queue. Meanwhile, if both the global and local queues are empty, the Go scheduler can pick up goroutines from other local queues. This principle in scheduling is called _work stealing_, and it allows an underutilized processor to actively look for another processor’s goroutines and _steal_ some. +For every sixty-first execution, the Go scheduler will check whether goroutines from the global queue are available. If not, it will check its local queue. Meanwhile, if both the global and local queues are empty, the Go scheduler can pick up goroutines from other local queues. This principle in scheduling is called _work stealing_, and it allows an underutilized processor to actively look for another processor’s goroutines and _steal_ some. -One last important thing to mention: prior to Go 1.14, the scheduler was cooperative, which meant a goroutine could be context-switched off a thread only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex). Since Go 1.14, the Go scheduler is now preemptive: when a goroutine is running for a specific amount of time (10 ms), it will be marked preemptible and can be context-switched off to be replaced by another goroutine. This allows a long-running job to be forced to share CPU time. +One last important thing to mention: before Go 1.14, the scheduler was cooperative, which meant a goroutine could be context-switched off a thread only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex). Since Go 1.14, the Go scheduler is now preemptive: when a goroutine is running for a specific amount of time (10 ms), it will be marked preemptible and can be context-switched off to be replaced by another goroutine. This allows a long-running job to be forced to share CPU time. Now that we understand the fundamentals of scheduling in Go, let’s look at a concrete example: implementing a merge sort in a parallel manner. @@ -83,7 +83,7 @@ Now that we understand the fundamentals of scheduling in Go, let’s look at a c First, let’s briefly review how the merge sort algorithm works. Then we will implement a parallel version. Note that the objective isn’t to implement the most efficient version but to support a concrete example showing why concurrency isn’t always faster. -The merge sort algorithm works by breaking a list repeatedly into two sublists until each sublist consists of a single element and then merging these sublists so that the result is a sorted list (see figure 2). Each split operation splits the list into two sublists, whereas the merge operation merges two sublists into a sorted list. +The merge sort algorithm works by breaking a list repeatedly into two sublists until each sublist consists of a single element and then merging these sublists so that the result is a sorted list (see Figure 2). Each split operation splits the list into two sublists, whereas the merge operation merges two sublists into a sorted list.
![](img/mergesort.png) @@ -154,7 +154,7 @@ If the workload that we want to parallelize is too small, meaning we’re going So what can we conclude from this result? Does it mean the merge sort algorithm cannot be parallelized? Wait, not so fast. -Let’s try another approach. Because merging a tiny number of elements within a new goroutine isn’t efficient, let’s define a threshold. This threshold will represent how many elements a half should contain in order to be handled in a parallel manner. If the number of elements in the half is fewer than this value, we will handle it sequentially. Here’s a new version: +Let’s try another approach. Because merging a tiny number of elements within a new goroutine isn’t efficient, let’s define a threshold. This threshold will represent how many elements a half should contain to be handled in a parallel manner. If the number of elements in the half is fewer than this value, we will handle it sequentially. Here’s a new version: ```go const max = 2048 // Defines the threshold @@ -166,7 +166,7 @@ func parallelMergesortV2(s []int) { if len(s) <= max { sequentialMergesort(s) // Calls our initial sequential version - } else { // If bigger than the threshold, keeps the parallel version + } else { // If bigger than the threshold, keep the parallel version middle := len(s) / 2 var wg sync.WaitGroup @@ -188,7 +188,7 @@ func parallelMergesortV2(s []int) { } ``` -If the number of elements in the s slice is smaller than max, we call the sequential version. Otherwise, we keep calling our parallel implementation. Does this approach impact the result? Yes, it does: +If the number of elements in the s slice is smaller than the max, we call the sequential version. Otherwise, we keep calling our parallel implementation. Does this approach impact the result? Yes, it does: ``` Benchmark_sequentialMergesort-4 2278993555 ns/op