Update 56-concurrency-faster.md

This commit is contained in:
Siddharth Warrier 2023-10-14 20:36:05 +05:30 committed by GitHub
parent 668d6165fe
commit 0491611a56
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -9,13 +9,13 @@ hide:
![](img/56-concurrency-faster.png) ![](img/56-concurrency-faster.png)
A misconception among many developers is believing that a concurrent solution is always faster than a sequential one. This couldnt be more wrong. The overall performance of a solution depends on many factors, such as the efficiency of our code structure (concurrency), which parts can be tackled in parallel, and the level of contention among the computation units. This post reminds us about some fundamental knowledge of concurrency in Go; then we will see a concrete example where a concurrent solution isnt necessarily faster. A misconception among many developers is that a concurrent solution is always faster than a sequential one. This couldnt be more wrong. The overall performance of a solution depends on many factors, such as the efficiency of our code structure (concurrency), which parts can be tackled in parallel, and the level of contention among the computation units. This post reminds us about some fundamental knowledge of concurrency in Go; then we will see a concrete example where a concurrent solution isnt necessarily faster.
## Go Scheduling ## Go Scheduling
A thread is the smallest unit of processing that an OS can perform. If a process wants to execute multiple actions simultaneously, it spins up multiple threads. These threads can be: A thread is the smallest unit of processing that an OS can perform. If a process wants to execute multiple actions simultaneously, it spins up multiple threads. These threads can be:
* _Concurrent_ — Two or more threads can start, run, and complete in overlapping time periods. * _Concurrent_ — Two or more threads can start, run, and complete in overlapping periods.
* _Parallel_ — The same task can be executed multiple times at once. * _Parallel_ — The same task can be executed multiple times at once.
The OS is responsible for scheduling the threads processes optimally so that: The OS is responsible for scheduling the threads processes optimally so that:
@ -27,7 +27,7 @@ The OS is responsible for scheduling the threads processes optimally so that:
The word thread can also have a different meaning at a CPU level. Each physical core can be composed of multiple logical cores (the concept of [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading)), and a logical core is also called a thread. In this post, when we use the word thread, we mean the unit of processing, not a logical core. The word thread can also have a different meaning at a CPU level. Each physical core can be composed of multiple logical cores (the concept of [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading)), and a logical core is also called a thread. In this post, when we use the word thread, we mean the unit of processing, not a logical core.
A CPU core executes different threads. When it switches from one thread to another, it executes an operation called _context switching_. The active thread consuming CPU cycles was in an _executing_ state and moves to a _runnable_ state, meaning its ready to be executed pending an available core. Context switching is considered an expensive operation because the OS needs to save the current execution state of a thread before the switch (such as the current register values). A CPU core executes different threads. When it switches from one thread to another, it executes an operation called _context switching_. The active thread consuming CPU cycles was in an _executing_ state and moved to a _runnable_ state, meaning it was ready to be executed pending an available core. Context switching is considered an expensive operation because the OS needs to save the current execution state of a thread before the switch (such as the current register values).
As Go developers, we cant create threads directly, but we can create goroutines, which can be thought of as application-level threads. However, whereas an OS thread is context-switched on and off a CPU core by the OS, a goroutine is context-switched on and off an OS thread by the Go runtime. Also, compared to an OS thread, a goroutine has a smaller memory footprint: 2 KB for goroutines from Go 1.4. An OS thread depends on the OS, but, for example, on Linux/x8632, the default size is 2 MB (see https://man7.org/linux/man-pages/man3/pthread_create.3.html). Having a smaller size makes context switching faster. As Go developers, we cant create threads directly, but we can create goroutines, which can be thought of as application-level threads. However, whereas an OS thread is context-switched on and off a CPU core by the OS, a goroutine is context-switched on and off an OS thread by the Go runtime. Also, compared to an OS thread, a goroutine has a smaller memory footprint: 2 KB for goroutines from Go 1.4. An OS thread depends on the OS, but, for example, on Linux/x8632, the default size is 2 MB (see https://man7.org/linux/man-pages/man3/pthread_create.3.html). Having a smaller size makes context switching faster.
@ -46,21 +46,21 @@ Each OS thread (M) is assigned to a CPU core (P) by the OS scheduler. Then, each
A goroutine has a simpler lifecycle than an OS thread. It can be doing one of the following: A goroutine has a simpler lifecycle than an OS thread. It can be doing one of the following:
* _Executing_ — The goroutine is scheduled on an M and executing its instructions. * _Executing_ — The goroutine is scheduled on an M and executing its instructions.
* _Runnable_ — The goroutine is waiting to be in an executing state. * _Runnable_ — The goroutine is waiting to be executed.
* _Waiting_ — The goroutine is stopped and pending something completing, such as a system call or a synchronization operation (such as acquiring a mutex). * _Waiting_ — The goroutine is stopped and pending something completing, such as a system call or a synchronization operation (such as acquiring a mutex).
Theres one last stage to understand about the implementation of Go scheduling: when a goroutine is created but cannot be executed yet; for example, all the other Ms are already executing a G. In this scenario, what will the Go runtime do about it? The answer is queuing. The Go runtime handles two kinds of queues: one local queue per P and a global queue shared among all the Ps. Theres one last stage to understand about the implementation of Go scheduling: when a goroutine is created but cannot be executed yet; for example, all the other Ms are already executing a G. In this scenario, what will the Go runtime do about it? The answer is queuing. The Go runtime handles two kinds of queues: one local queue per P and a global queue shared among all the Ps.
Figure 1 shows a given scheduling situation on a four-core machine with GOMAXPROCS equal to 4. The parts are the logical cores (Ps), goroutines (Gs), OS threads (Ms), local queues, and global queue: Figure 1 shows a given scheduling situation on a four-core machine with GOMAXPROCS equal to 4. The parts are the logical cores (Ps), goroutines (Gs), OS threads (Ms), local queues, and global queues:
<figure markdown> <figure markdown>
![](img/go-scheduler.png) ![](img/go-scheduler.png)
<figcaption>Figure 1: An example of the current state of a Go application executed on a four-core machine. Goroutines that arent in an executing state are either runnable (pending being executed) or waiting (pending a blocking operation)</figcaption> <figcaption>Figure 1: An example of the current state of a Go application executed on a four-core machine. Goroutines that arent executing are either runnable (pending being executed) or waiting (pending a blocking operation)</figcaption>
</figure> </figure>
First, we can see five Ms, whereas GOMAXPROCS is set to 4. But as we mentioned, if needed, the Go runtime can create more OS threads than the GOMAXPROCS value. First, we can see five Ms, whereas GOMAXPROCS is set to 4. But as we mentioned, if needed, the Go runtime can create more OS threads than the GOMAXPROCS value.
P0, P1, and P3 are currently busy executing Go runtime threads. But P2 is presently idle as M3 is switched off P2, and theres no goroutine to be executed. This isnt a good situation because six runnable goroutines are pending being executed, some in the global queue and some in other local queues. How will the Go runtime handle this situation? Heres the scheduling implementation in pseudocode (see [proc.go](https://github.com/golang/go/blob/go1.17.6/src/runtime/proc.go#L3291)): P0, P1, and P3 are currently busy executing Go runtime threads. But P2 is presently idle as M3 is switched off P2, and theres no goroutine to be executed. This isnt a good situation because six runnable goroutines are pending execution, some in the global queue and some in other local queues. How will the Go runtime handle this situation? Heres the scheduling implementation in pseudocode (see [proc.go](https://github.com/golang/go/blob/go1.17.6/src/runtime/proc.go#L3291)):
``` ```
runtime.schedule() { runtime.schedule() {
@ -73,9 +73,9 @@ runtime.schedule() {
} }
``` ```
Every sixty-first execution, the Go scheduler will check whether goroutines from the global queue are available. If not, it will check its local queue. Meanwhile, if both the global and local queues are empty, the Go scheduler can pick up goroutines from other local queues. This principle in scheduling is called _work stealing_, and it allows an underutilized processor to actively look for another processors goroutines and _steal_ some. For every sixty-first execution, the Go scheduler will check whether goroutines from the global queue are available. If not, it will check its local queue. Meanwhile, if both the global and local queues are empty, the Go scheduler can pick up goroutines from other local queues. This principle in scheduling is called _work stealing_, and it allows an underutilized processor to actively look for another processors goroutines and _steal_ some.
One last important thing to mention: prior to Go 1.14, the scheduler was cooperative, which meant a goroutine could be context-switched off a thread only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex). Since Go 1.14, the Go scheduler is now preemptive: when a goroutine is running for a specific amount of time (10 ms), it will be marked preemptible and can be context-switched off to be replaced by another goroutine. This allows a long-running job to be forced to share CPU time. One last important thing to mention: before Go 1.14, the scheduler was cooperative, which meant a goroutine could be context-switched off a thread only in specific blocking cases (for example, channel send or receive, I/O, waiting to acquire a mutex). Since Go 1.14, the Go scheduler is now preemptive: when a goroutine is running for a specific amount of time (10 ms), it will be marked preemptible and can be context-switched off to be replaced by another goroutine. This allows a long-running job to be forced to share CPU time.
Now that we understand the fundamentals of scheduling in Go, lets look at a concrete example: implementing a merge sort in a parallel manner. Now that we understand the fundamentals of scheduling in Go, lets look at a concrete example: implementing a merge sort in a parallel manner.
@ -83,7 +83,7 @@ Now that we understand the fundamentals of scheduling in Go, lets look at a c
First, lets briefly review how the merge sort algorithm works. Then we will implement a parallel version. Note that the objective isnt to implement the most efficient version but to support a concrete example showing why concurrency isnt always faster. First, lets briefly review how the merge sort algorithm works. Then we will implement a parallel version. Note that the objective isnt to implement the most efficient version but to support a concrete example showing why concurrency isnt always faster.
The merge sort algorithm works by breaking a list repeatedly into two sublists until each sublist consists of a single element and then merging these sublists so that the result is a sorted list (see figure 2). Each split operation splits the list into two sublists, whereas the merge operation merges two sublists into a sorted list. The merge sort algorithm works by breaking a list repeatedly into two sublists until each sublist consists of a single element and then merging these sublists so that the result is a sorted list (see Figure 2). Each split operation splits the list into two sublists, whereas the merge operation merges two sublists into a sorted list.
<figure markdown> <figure markdown>
![](img/mergesort.png) ![](img/mergesort.png)
@ -154,7 +154,7 @@ If the workload that we want to parallelize is too small, meaning were going
So what can we conclude from this result? Does it mean the merge sort algorithm cannot be parallelized? Wait, not so fast. So what can we conclude from this result? Does it mean the merge sort algorithm cannot be parallelized? Wait, not so fast.
Lets try another approach. Because merging a tiny number of elements within a new goroutine isnt efficient, lets define a threshold. This threshold will represent how many elements a half should contain in order to be handled in a parallel manner. If the number of elements in the half is fewer than this value, we will handle it sequentially. Heres a new version: Lets try another approach. Because merging a tiny number of elements within a new goroutine isnt efficient, lets define a threshold. This threshold will represent how many elements a half should contain to be handled in a parallel manner. If the number of elements in the half is fewer than this value, we will handle it sequentially. Heres a new version:
```go ```go
const max = 2048 // Defines the threshold const max = 2048 // Defines the threshold
@ -166,7 +166,7 @@ func parallelMergesortV2(s []int) {
if len(s) <= max { if len(s) <= max {
sequentialMergesort(s) // Calls our initial sequential version sequentialMergesort(s) // Calls our initial sequential version
} else { // If bigger than the threshold, keeps the parallel version } else { // If bigger than the threshold, keep the parallel version
middle := len(s) / 2 middle := len(s) / 2
var wg sync.WaitGroup var wg sync.WaitGroup
@ -188,7 +188,7 @@ func parallelMergesortV2(s []int) {
} }
``` ```
If the number of elements in the s slice is smaller than max, we call the sequential version. Otherwise, we keep calling our parallel implementation. Does this approach impact the result? Yes, it does: If the number of elements in the s slice is smaller than the max, we call the sequential version. Otherwise, we keep calling our parallel implementation. Does this approach impact the result? Yes, it does:
``` ```
Benchmark_sequentialMergesort-4 2278993555 ns/op Benchmark_sequentialMergesort-4 2278993555 ns/op