False sharing sections

This commit is contained in:
Teiva Harsanyi 2024-03-05 18:55:11 +01:00
parent 9b132ca0db
commit 256481ca0c
32 changed files with 1542 additions and 3 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

118
docs/92-false-sharing.md Normal file
View file

@ -0,0 +1,118 @@
---
title: Writing concurrent code that leads to false sharing (#92)
comments: true
hide:
- toc
---
# Writing concurrent code that leads to false sharing
In previous sections, we have discussed the fundamental concepts of CPU caching. We have seen that some specific caches (typically, L1 and L2) arent shared among all the logical cores but are specific to a physical core. This specificity has some concrete impacts such as concurrency and the concept of false sharing, which can lead to a significant performance decrease. Lets look at what false sharing is via an example and then see how to prevent it.
In this example, we use two structs, `Input` and `Result`:
```go
type Input struct {
a int64
b int64
}
type Result struct {
sumA int64
sumB int64
}
```
The goal is to implement a `count` function that receives a slice of `Input` and computes the following:
* The sum of all the `Input.a` fields into `Result.sumA`
* The sum of all the `Input.b` fields into `Result.sumB`
For the sake of the example, we implement a concurrent solution with one goroutine that computes `sumA` and another that computes `sumB`:
```go
func count(inputs []Input) Result {
wg := sync.WaitGroup{}
wg.Add(2)
result := Result{} // Init the result struct
go func() {
for i := 0; i < len(inputs); i++ {
result.sumA += inputs[i].a // Computes sumA
}
wg.Done()
}()
go func() {
for i := 0; i < len(inputs); i++ {
result.sumB += inputs[i].b // Computes sumB
}
wg.Done()
}()
wg.Wait()
return result
}
```
We spin up two goroutines: one that iterates over each a field and another that iterates over each b field. This example is fine from a concurrency perspective. For instance, it doesnt lead to a data race, because each goroutine increments its own variable. But this example illustrates the false sharing concept that degrades expected performance.
Lets look at the main memory. Because `sumA` and `sumB` are allocated contiguously, in most cases (seven out of eight), both variables are allocated to the same memory block:
<figure markdown>
![](img/false-sharing-1.svg)
<figcaption>In this example, sumA and sumB are part of the same memory block.</figcaption>
</figure>
Now, lets assume that the machine contains two cores. In most cases, we should eventually have two threads scheduled on different cores. So if the CPU decides to copy this memory block to a cache line, it is copied twice:
<figure markdown>
![](img/false-sharing-2.svg)
<figcaption>Each block is copied to a cache line on both code 0 and core 1.</figcaption>
</figure>
Both cache lines are replicated because L1D (L1 data) is per core. Recall that in our example, each goroutine updates its own variable: `sumA` on one side, and `sumB` on the other side:
<figure markdown>
![](img/false-sharing-3.svg)
<figcaption>Each goroutine updates its own variable.</figcaption>
</figure>
Because these cache lines are replicated, one of the goals of the CPU is to guarantee cache coherency. For example, if one goroutine updates `sumA` and another reads `sumA` (after some synchronization), we expect our application to get the latest value.
However, our example doesnt do exactly this. Both goroutines access their own variables, not a shared one. We might expect the CPU to know about this and understand that it isnt a conflict, but this isnt the case. When we write a variable thats in a cache, the granularity tracked by the CPU isnt the variable: its the cache line.
When a cache line is shared across multiple cores and at least one goroutine is a writer, the entire cache line is invalidated. This happens even if the updates are logically independent (for example, `sumA` and `sumB`). This is the problem of false sharing, and it degrades performance.
???+ note
Internally, a CPU uses the [MESI protocol](https://en.wikipedia.org/wiki/MESI_protocol) to guarantee cache coherency. It tracks each cache line, marking it modified, exclusive, shared, or invalid (MESI).
One of the most important aspects to understand about memory and caching is that sharing memory across cores isnt real—its an illusion. This understanding comes from the fact that we dont consider a machine a black box; instead, we try to have mechanical sympathy with underlying levels.
So how do we solve false sharing? There are two main solutions.
The first solution is to use the same approach weve shown but ensure that `sumA` and `sumB` arent part of the same cache line. For example, we can update the `Result` struct to add _padding_ between the fields. Padding is a technique to allocate extra memory. Because an `int64` requires an 8-byte allocation and a cache line 64 bytes long, we need 64 8 = 56 bytes of padding:
```go
type Result struct {
sumA int64
_ [56]byte // Padding
sumB int64
}
```
The next figure shows a possible memory allocation. Using padding, `sumA` and `sumB` will always be part of different memory blocks and hence different cache lines.
<figure markdown>
![](img/false-sharing-4.svg)
<figcaption>sumA and sumB are part of different memory blocks.</figcaption>
</figure>
If we benchmark both solutions (with and without padding), we see that the padding solution is significantly faster (about 40% on my machine). This is an important improvement that results from the addition of padding between the two fields to prevent false sharing.
The second solution is to rework the structure of the algorithm. For example, instead of having both goroutines share the same struct, we can make them communicate their local result via channels. The result benchmark is roughly the same as with padding.
In summary, we must remember that sharing memory across goroutines is an illusion at the lowest memory levels. False sharing occurs when a cache line is shared across two cores when at least one goroutine is a writer. If we need to optimize an application that relies on concurrency, we should check whether false sharing applies, because this pattern is known to degrade application performance. We can prevent false sharing with either padding or communication.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 123 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 148 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 239 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 155 KiB

View file

@ -2080,6 +2080,8 @@ Credits: [@jeromedoucet](https://github.com/jeromedoucet)
Knowing that lower levels of CPU caches arent shared across all the cores helps avoid performance-degrading patterns such as false sharing while writing concurrency code. Sharing memory is an illusion.
Read the full section [here](92-false-sharing.md).
[:simple-github: Source code](https://github.com/teivah/100-go-mistakes/tree/master/src/12-optimizations/92-false-sharing/)
### Not taking into account instruction-level parallelism (#93)

View file

@ -73,6 +73,7 @@ nav:
- 28-maps-memory-leaks.md
- 56-concurrency-faster.md
- 89-benchmarks.md
- 92-false-sharing.md
- 98-profiling-execution-tracing.md
- community.md
markdown_extensions:

View file

@ -742,6 +742,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -742,6 +742,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -676,6 +676,27 @@
<li class="md-nav__item">
<a href="/92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="/98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -799,6 +799,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -14,7 +14,7 @@
<link rel="prev" href="../56-concurrency-faster/">
<link rel="next" href="../98-profiling-execution-tracing/">
<link rel="next" href="../92-false-sharing/">
<link rel="icon" href="../img/Go-Logo_LightBlue.svg">
@ -817,6 +817,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -799,6 +799,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

File diff suppressed because it is too large Load diff

View file

@ -11,7 +11,7 @@
<link rel="canonical" href="https://100go.co/98-profiling-execution-tracing/">
<link rel="prev" href="../89-benchmarks/">
<link rel="prev" href="../92-false-sharing/">
<link rel="next" href="../community/">
@ -730,6 +730,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

View file

@ -803,6 +803,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -874,6 +874,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -724,6 +724,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -953,6 +953,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 123 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 148 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 239 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 155 KiB

View file

@ -1934,6 +1934,27 @@
<li class="md-nav__item">
<a href="92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="98-profiling-execution-tracing/" class="md-nav__link">
@ -4756,6 +4777,7 @@ the use case. However, we should see the two options as complementary. </p>
<summary>TL;DR</summary>
<p>Knowing that lower levels of CPU caches arent shared across all the cores helps avoid performance-degrading patterns such as false sharing while writing concurrency code. Sharing memory is an illusion.</p>
</details>
<p>Read the full section <a href="92-false-sharing/">here</a>.</p>
<p><a href="https://github.com/teivah/100-go-mistakes/tree/master/src/12-optimizations/92-false-sharing/"><span class="twemoji"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 .297c-6.63 0-12 5.373-12 12 0 5.303 3.438 9.8 8.205 11.385.6.113.82-.258.82-.577 0-.285-.01-1.04-.015-2.04-3.338.724-4.042-1.61-4.042-1.61C4.422 18.07 3.633 17.7 3.633 17.7c-1.087-.744.084-.729.084-.729 1.205.084 1.838 1.236 1.838 1.236 1.07 1.835 2.809 1.305 3.495.998.108-.776.417-1.305.76-1.605-2.665-.3-5.466-1.332-5.466-5.93 0-1.31.465-2.38 1.235-3.22-.135-.303-.54-1.523.105-3.176 0 0 1.005-.322 3.3 1.23.96-.267 1.98-.399 3-.405 1.02.006 2.04.138 3 .405 2.28-1.552 3.285-1.23 3.285-1.23.645 1.653.24 2.873.12 3.176.765.84 1.23 1.91 1.23 3.22 0 4.61-2.805 5.625-5.475 5.92.42.36.81 1.096.81 2.22 0 1.606-.015 2.896-.015 3.286 0 .315.21.69.825.57C20.565 22.092 24 17.592 24 12.297c0-6.627-5.373-12-12-12"/></svg></span> Source code</a></p>
<h3 id="not-taking-into-account-instruction-level-parallelism-93">Not taking into account instruction-level parallelism (#93)</h3>
<details class="info" open="open">

View file

@ -718,6 +718,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

View file

@ -720,6 +720,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">

File diff suppressed because one or more lines are too long

View file

@ -30,6 +30,11 @@
<lastmod>2024-03-05</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://100go.co/92-false-sharing/</loc>
<lastmod>2024-03-05</lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>https://100go.co/98-profiling-execution-tracing/</loc>
<lastmod>2024-03-05</lastmod>

Binary file not shown.

View file

@ -718,6 +718,27 @@
<li class="md-nav__item">
<a href="../92-false-sharing/" class="md-nav__link">
<span class="md-ellipsis">
Writing concurrent code that leads to false sharing (#92)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../98-profiling-execution-tracing/" class="md-nav__link">