It’s pretty common for Go developers to mix slice length and capacity or not understand them thoroughly. Assimilating these two concepts is essential for efficiently handling core operations such as slice initialization and adding elements with append, copying, or slicing. This misunderstanding can lead to using slices suboptimally or even to memory leaks.
+
In Go, a slice is backed by an array. That means the slice’s data is stored contiguously in an array data structure. A slice also handles the logic of adding an element if the backing array is full or shrinking the backing array if it’s almost empty.
+
Internally, a slice holds a pointer to the backing array plus a length and a capacity. The length is the number of elements the slice contains, whereas the capacity is the number of elements in the backing array, counting from the first element in the slice. Let’s go through a few examples to make things clearer. First, let’s initialize a slice with a given length and capacity:
The first argument, representing the length, is mandatory. However, the second argument representing the capacity is optional. Figure 1 shows the result of this code in memory.
+
+
+
+Figure 1: A three-length, six-capacity slice.
+
+
In this case, make creates an array of six elements (the capacity). But because the length was set to 3, Go initializes only the first three elements. Also, because the slice is an []int type, the first three elements are initialized to the zeroed value of an int: 0. The grayed elements are allocated but not yet used.
+
If we print this slice, we get the elements within the range of the length, [0 0 0]. If we set s[1] to 1, the second element of the slice updates without impacting its length or capacity. Figure 2 illustrates this.
+
+
+
+Figure 2: Updating the slice’s second element: s[1] = 1.
+
+
However, accessing an element outside the length range is forbidden, even though it’s already allocated in memory. For example, s[4] = 0 would lead to the following panic:
+
panic: runtime error: index out of range [4] with length 3
+
+
How can we use the remaining space of the slice? By using the append built-in function:
+
s=append(s,2)
+
+
This code appends to the existing s slice a new element. It uses the first grayed element (which was allocated but not yet used) to store element 2, as figure 3 shows.
+
+
+
+Figure 3: Appending an element to s.
+
+
The length of the slice is updated from 3 to 4 because the slice now contains four elements. Now, what happens if we add three more elements so that the backing array isn’t large enough?
+
s=append(s,3,4,5)
+fmt.Println(s)
+
+
If we run this code, we see that the slice was able to cope with our request:
+
[0 1 0 2 3 4 5]
+
+
Because an array is a fixed-size structure, it can store the new elements until element 4. When we want to insert element 5, the array is already full: Go internally creates another array by doubling the capacity, copying all the elements, and then inserting element 5. Figure 4 shows this process.
+
+
+
+Figure 4: Because the initial backing array is full, Go creates another array and copies all the elements.
+
+
The slice now references the new backing array. What will happen to the previous backing array? If it’s no longer referenced, it’s eventually freed by the garbage collector (GC) if allocated on the heap. (We discuss heap memory in mistake #95, “Not understanding stack vs. heap,” and we look at how the GC works in mistake #99, “Not understanding how the GC works.”)
+
What happens with slicing? Slicing is an operation done on an array or a slice, providing a half-open range; the first index is included, whereas the second is excluded. The following example shows the impact, and figure 5 displays the result in memory:
+
s1:=make([]int,3,6)// Three-length, six-capacity slice
+s2:=s1[1:3]// Slicing from indices 1 to 3
+
+
+
+
+Figure 5: The slices s1 and s2 reference the same backing array with different lengths and capacities.
+
+
First, s1 is created as a three-length, six-capacity slice. When s2 is created by slicing s1, both slices reference the same backing array. However, s2 starts from a different index, 1. Therefore, its length and capacity (a two-length, five-capacity slice) differ from s1. If we update s1[1] or s2[0], the change is made to the same array, hence, visible in both slices, as figure 6 shows.
+
+
+
+Figure 6: Because s1 and s2 are backed by the same array, updating a common element makes the change visible in both slices.
+
+
Now, what happens if we append an element to s2? Does the following code change s1 as well?
+
s2=append(s2,2)
+
+
The shared backing array is modified, but only the length of s2 changes. Figure 7 shows the result of appending an element to s2.
+
+
+
+Figure 7: Appending an element to s2.
+
+
s1 remains a three-length, six-capacity slice. Therefore, if we print s1 and s2, the added element is only visible for s2:
+
s1=[010],s2=[102]
+
+
It’s important to understand this behavior so that we don’t make wrong assumptions while using append.
+
+Note
+
In these examples, the backing array is internal and not available directly to the Go developer. The only exception is when a slice is created from slicing an existing array.
+
+
One last thing to note: what if we keep appending elements to s2 until the backing array is full? What will the state be, memory-wise? Let’s add three more elements so that the backing array will not have enough capacity:
+
s2=append(s2,3)
+s2=append(s2,4)// At this stage, the backing is already full
+s2=append(s2,5)
+
+
This code leads to creating another backing array. Figure 8 displays the results in memory.
+
+
+
+Figure 8: Appending elements to s2 until the backing array is full.
+
+
s1 and s2 now reference two different arrays. As s1 is still a three-length, six-capacity slice, it still has some available buffer, so it keeps referencing the initial array. Also, the new backing array was made by copying the initial one from the first index of s2. That’s why the new array starts with element 1, not 0.
+
To summarize, the slice length is the number of available elements in the slice, whereas the slice capacity is the number of elements in the backing array. Adding an element to a full slice (length == capacity) leads to creating a new backing array with a new capacity, copying all the elements from the previous array, and updating the slice pointer to the new array.
When working with maps in Go, we need to understand some important characteristics of how a map grows and shrinks. Let’s delve into this to prevent issues that can cause memory leaks.
+
First, to view a concrete example of this problem, let’s design a scenario where we will work with the following map:
+
m:=make(map[int][128]byte)
+
+
Each value of m is an array of 128 bytes. We will do the following:
+
+
Allocate an empty map.
+
Add 1 million elements.
+
Remove all the elements, and run a Garbage Collection (GC).
+
+
After each step, we want to print the size of the heap (using a printAlloc utility function). This shows us how this example behaves memory-wise:
+
funcmain(){
+n:=1_000_000
+m:=make(map[int][128]byte)
+printAlloc()
+
+fori:=0;i<n;i++{// Adds 1 million elements
+m[i]=[128]byte{}
+}
+printAlloc()
+
+fori:=0;i<n;i++{// Deletes 1 million elements
+delete(m,i)
+}
+
+runtime.GC()// Triggers a manual GC
+printAlloc()
+runtime.KeepAlive(m)// Keeps a reference to m so that the map isn’t collected
+}
+
+funcprintAlloc(){
+varmruntime.MemStats
+runtime.ReadMemStats(&m)
+fmt.Printf("%d KB\n",m.Alloc/1024)
+}
+
+
We allocate an empty map, add 1 million elements, remove 1 million elements, and then run a GC. We also make sure to keep a reference to the map using runtime.KeepAlive so that the map isn’t collected as well. Let’s run this example:
+
0 MB <-- After m is allocated
+461 MB <-- After we add 1 million elements
+293 MB <-- After we remove 1 million elements
+
+
What can we observe? At first, the heap size is minimal. Then it grows significantly after having added 1 million elements to the map. But if we expected the heap size to decrease after removing all the elements, this isn’t how maps work in Go. In the end, even though the GC has collected all the elements, the heap size is still 293 MB. So the memory shrunk, but not as we might have expected. What’s the rationale? We need to delve into how a map works in Go.
+
A map provides an unordered collection of key-value pairs in which all the keys are distinct. In Go, a map is based on the hash table data structure: an array where each element is a pointer to a bucket of key-value pairs, as shown in figure 1.
+
+
+
+Figure 1: A hash table example with a focus on bucket 0.
+
+
Each bucket is a fixed-size array of eight elements. In the case of an insertion into a bucket that is already full (a bucket overflow), Go creates another bucket of eight elements and links the previous one to it. Figure 2 shows an example:
+
+
+
+Figure 2: In case of a bucket overflow, Go allocates a new bucket and links the previous bucket to it.
+
+
Under the hood, a Go map is a pointer to a runtime.hmap struct. This struct contains multiple fields, including a B field, giving the number of buckets in the map:
+
typehmapstruct{
+Buint8// log_2 of # of buckets
+// (can hold up to loadFactor * 2^B items)
+// ...
+}
+
+
After adding 1 million elements, the value of B equals 18, which means 2¹⁸ = 262,144 buckets. When we remove 1 million elements, what’s the value of B? Still 18. Hence, the map still contains the same number of buckets.
+
The reason is that the number of buckets in a map cannot shrink. Therefore, removing elements from a map doesn’t impact the number of existing buckets; it just zeroes the slots in the buckets. A map can only grow and have more buckets; it never shrinks.
+
In the previous example, we went from 461 MB to 293 MB because the elements were collected, but running the GC didn’t impact the map itself. Even the number of extra buckets (the buckets created because of overflows) remains the same.
+
Let’s take a step back and discuss when the fact that a map cannot shrink can be a problem. Imagine building a cache using a map[int][128]byte. This map holds per customer ID (the int), a sequence of 128 bytes. Now, suppose we want to save the last 1,000 customers. The map size will remain constant, so we shouldn’t worry about the fact that a map cannot shrink.
+
However, let’s say we want to store one hour of data. Meanwhile, our company has decided to have a big promotion for Black Friday: in one hour, we may have millions of customers connected to our system. But a few days after Black Friday, our map will contain the same number of buckets as during the peak time. This explains why we can experience high memory consumption that doesn’t significantly decrease in such a scenario.
+
What are the solutions if we don’t want to manually restart our service to clean the amount of memory consumed by the map? One solution could be to re-create a copy of the current map at a regular pace. For example, every hour, we can build a new map, copy all the elements, and release the previous one. The main drawback of this option is that following the copy and until the next garbage collection, we may consume twice the current memory for a short period.
+
Another solution would be to change the map type to store an array pointer: map[int]*[128]byte. It doesn’t solve the fact that we will have a significant number of buckets; however, each bucket entry will reserve the size of a pointer for the value instead of 128 bytes (8 bytes on 64-bit systems and 4 bytes on 32-bit systems).
+
Coming back to the original scenario, let’s compare the memory consumption for each map type following each step. The following table shows the comparison.
+
+
+
+
Step
+
map[int][128]byte
+
map[int]*[128]byte
+
+
+
+
+
Allocate an empty map
+
0 MB
+
0 MB
+
+
+
Add 1 million elements
+
461 MB
+
182 MB
+
+
+
Remove all the elements and run a GC
+
293 MB
+
38 MB
+
+
+
+
+Note
+
If a key or a value is over 128 bytes, Go won’t store it directly in the map bucket. Instead, Go stores a pointer to reference the key or the value.
+
+
As we have seen, adding n elements to a map and then deleting all the elements means keeping the same number of buckets in memory. So, we must remember that because a Go map can only grow in size, so does its memory consumption. There is no automated strategy to shrink it. If this leads to high memory consumption, we can try different options such as forcing Go to re-create the map or using pointers to check if it can be optimized.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/site/404.html b/site/404.html
index 6834d85..3be8016 100644
--- a/site/404.html
+++ b/site/404.html
@@ -40,6 +40,8 @@
+
+
@@ -553,6 +555,66 @@
+
In general, we should never guess about performance. When writing optimizations, so many factors may come into play that even if we have a strong opinion about the results, it’s rarely a bad idea to test them. However, writing benchmarks isn’t straightforward. It can be pretty simple to write inaccurate benchmarks and make wrong assumptions based on them. The goal of this post is to examine four common and concrete traps leading to inaccuracy:
+
+
Not resetting or pausing the timer
+
Making wrong assumptions about micro-benchmarks
+
Not being careful about compiler optimizations
+
Being fooled by the observer effect
+
+
General concepts
+
Before discussing these traps, let’s briefly review how benchmarks work in Go. The skeleton of a benchmark is as follows:
The function name starts with the Benchmark prefix. The function under test (foo) is called within the for loop. b.N represents a variable number of iterations. When running a benchmark, Go tries to make it match the requested benchmark time. The benchmark time is set by default to 1 second and can be changed with the -benchtime flag. b.N starts at 1; if the benchmark completes in under 1 second, b.N is increased, and the benchmark runs again until b.N roughly matches benchtime:
+
$ go test -bench=.
+cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
+BenchmarkFoo-4 73 16511228 ns/op
+
+
Here, the benchmark took about 1 second, and foo was executed 73 times, for an average execution time of 16,511,228 nanoseconds. We can change the benchmark time using -benchtime:
+
$ go test -bench=. -benchtime=2s
+BenchmarkFoo-4 150 15832169 ns/op
+
+
foo was executed roughly twice more than during the previous benchmark.
+
Next, let’s look at some common traps.
+
Not resetting or pausing the timer
+
In some cases, we need to perform operations before the benchmark loop. These operations may take quite a while (for example, generating a large slice of data) and may significantly impact the benchmark results:
Calling ResetTimer zeroes the elapsed benchmark time and memory allocation counters since the beginning of the test. This way, an expensive setup can be discarded from the test results.
+
What if we have to perform an expensive setup not just once but within each loop iteration?
We can’t reset the timer, because that would be executed during each loop iteration. But we can stop and resume the benchmark timer, surrounding the call to expensiveSetup:
+
funcBenchmarkFoo(b*testing.B){
+fori:=0;i<b.N;i++{
+b.StopTimer()// Pause the benchmark timer
+expensiveSetup()
+b.StartTimer()// Resume the benchmark timer
+functionUnderTest()
+}
+}
+
+
Here, we pause the benchmark timer to perform the expensive setup and then resume the timer.
+
+Note
+
There’s one catch to remember about this approach: if the function under test is too fast to execute compared to the setup function, the benchmark may take too long to complete. The reason is that it would take much longer than 1 second to reach benchtime. Calculating the benchmark time is based solely on the execution time of functionUnderTest. So, if we wait a significant time in each loop iteration, the benchmark will be much slower than 1 second. If we want to keep the benchmark, one possible mitigation is to decrease benchtime.
+
+
We must be sure to use the timer methods to preserve the accuracy of a benchmark.
+
Making wrong assumptions about micro-benchmarks
+
A micro-benchmark measures a tiny computation unit, and it can be extremely easy to make wrong assumptions about it. Let’s say, for example, that we aren’t sure whether to use atomic.StoreInt32 or atomic.StoreInt64 (assuming that the values we handle will always fit in 32 bits). We want to write a benchmark to compare both functions:
We could easily take this benchmark for granted and decide to use atomic.StoreInt64 because it appears to be faster. Now, for the sake of doing a fair benchmark, we reverse the order and test atomic.StoreInt64 first, followed by atomic.StoreInt32. Here is some example output:
This time, atomic.StoreInt32 has better results. What happened?
+
In the case of micro-benchmarks, many factors can impact the results, such as machine activity while running the benchmarks, power management, thermal scaling, and better cache alignment of a sequence of instructions. We must remember that many factors, even outside the scope of our Go project, can impact the results.
+
+Note
+
We should make sure the machine executing the benchmark is idle. However, external processes may run in the background, which may affect benchmark results. For that reason, tools such as perflock can limit how much CPU a benchmark can consume. For example, we can run a benchmark with 70% of the total available CPU, giving 30% to the OS and other processes and reducing the impact of the machine activity factor on the results.
+
+
One option is to increase the benchmark time using the -benchtime option. Similar to the law of large numbers in probability theory, if we run a benchmark a large number of times, it should tend to approach its expected value (assuming we omit the benefits of instructions caching and similar mechanics).
+
Another option is to use external tools on top of the classic benchmark tooling. For instance, the benchstat tool, which is part of the golang.org/x repository, allows us to compute and compare statistics about benchmark executions.
+
Let’s run the benchmark 10 times using the -count option and pipe the output to a specific file:
+
$ go test -bench=. -count=10 | tee stats.txt
+cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
+BenchmarkAtomicStoreInt32-4 234935682 5.124 ns/op
+BenchmarkAtomicStoreInt32-4 235307204 5.112 ns/op
+// ...
+BenchmarkAtomicStoreInt64-4 235548591 5.107 ns/op
+BenchmarkAtomicStoreInt64-4 235210292 5.090 ns/op
+// ...
+
The results are the same: both functions take on average 5.10 nanoseconds to complete. We also see the percent variation between the executions of a given benchmark: ± 1%. This metric tells us that both benchmarks are stable, giving us more confidence in the computed average results. Therefore, instead of concluding that atomic.StoreInt32 is faster or slower, we can conclude that its execution time is similar to that of atomic.StoreInt64 for the usage we tested (in a specific Go version on a particular machine).
+
In general, we should be cautious about micro-benchmarks. Many factors can significantly impact the results and potentially lead to wrong assumptions. Increasing the benchmark time or repeating the benchmark executions and computing stats with tools such as benchstat can be an efficient way to limit external factors and get more accurate results, leading to better conclusions.
+
Let’s also highlight that we should be careful about using the results of a micro-benchmark executed on a given machine if another system ends up running the application. The production system may act quite differently from the one on which we ran the micro-benchmark.
+
Not being careful about compiler optimizations
+
Another common mistake related to writing benchmarks is being fooled by compiler optimizations, which can also lead to wrong benchmark assumptions. In this section, we look at Go issue 14813 (https://github.com/golang/go/issues/14813, also discussed by Go project member Dave Cheney) with a population count function (a function that counts the number of bits set to 1):
A duration of 0.28 nanoseconds is roughly one clock cycle, so this number is unreasonably low. The problem is that the developer wasn’t careful enough about compiler optimizations. In this case, the function under test is simple enough to be a candidate for inlining: an optimization that replaces a function call with the body of the called function and lets us prevent a function call, which has a small footprint. Once the function is inlined, the compiler notices that the call has no side effects and replaces it with the following benchmark:
The benchmark is now empty — which is why we got a result close to one clock cycle. To prevent this from happening, a best practice is to follow this pattern:
+
+
During each loop iteration, assign the result to a local variable (local in the context of the benchmark function).
+
Assign the latest result to a global variable.
+
+
In our case, we write the following benchmark:
+
varglobaluint64// Define a global variable
+
+funcBenchmarkPopcnt2(b*testing.B){
+varvuint64// Define a local variable
+fori:=0;i<b.N;i++{
+v=popcnt(uint64(i))// Assign the result to the local variable
+}
+global=v// Assign the result to the global variable
+}
+
+
global is a global variable, whereas v is a local variable whose scope is the benchmark function. During each loop iteration, we assign the result of popcnt to the local variable. Then we assign the latest result to the global variable.
+
+Note
+
Why not assign the result of the popcnt call directly to global to simplify the test? Writing to a global variable is slower than writing to a local variable (these concepts are discussed in 100 Go Mistakes, mistake #95: “Not understanding stack vs. heap”). Therefore, we should write each result to a local variable to limit the footprint during each loop iteration.
+
+
If we run these two benchmarks, we now get a significant difference in the results:
BenchmarkPopcnt2 is the accurate version of the benchmark. It guarantees that we avoid the inlining optimizations, which can artificially lower the execution time or even remove the call to the function under test. Relying on the results of BenchmarkPopcnt1 could have led to wrong assumptions.
+
Let’s remember the pattern to avoid compiler optimizations fooling benchmark results: assign the result of the function under test to a local variable, and then assign the latest result to a global variable. This best practice also prevents us from making incorrect assumptions.
+
Being fooled by the observer effect
+
In physics, the observer effect is the disturbance of an observed system by the act of observation. This effect can also be seen in benchmarks and can lead to wrong assumptions about results. Let’s look at a concrete example and then try to mitigate it.
+
We want to implement a function receiving a matrix of int64 elements. This matrix has a fixed number of 512 columns, and we want to compute the total sum of the first eight columns, as shown in figure 1.
+
+
+
+Figure 1: Computing the sum of the first eight columns.
+
+
For the sake of optimizations, we also want to determine whether varying the number of columns has an impact, so we also implement a second function with 513 columns. The implementation is the following:
+
funccalculateSum512(s[][512]int64)int64{
+varsumint64
+fori:=0;i<len(s);i++{// Iterate over each row
+forj:=0;j<8;j++{// Iterate over the first eight columns
+sum+=s[i][j]// Increment sum
+}
+}
+returnsum
+}
+
+funccalculateSum513(s[][513]int64)int64{
+// Same implementation as calculateSum512
+}
+
+
We iterate over each row and then over the first eight columns, and we increment a sum variable that we return. The implementation in calculateSum513 remains the same.
+
We want to benchmark these functions to decide which one is the most performant given a fixed number of rows:
+
constrows=1000
+
+varresint64
+
+funcBenchmarkCalculateSum512(b*testing.B){
+varsumint64
+s:=createMatrix512(rows)// Create a matrix of 512 columns
+b.ResetTimer()
+fori:=0;i<b.N;i++{
+sum=calculateSum512(s)// Create a matrix of 512 columns
+}
+res=sum
+}
+
+funcBenchmarkCalculateSum513(b*testing.B){
+varsumint64
+s:=createMatrix513(rows)// Create a matrix of 513 columns
+b.ResetTimer()
+fori:=0;i<b.N;i++{
+sum=calculateSum513(s)// Calculate the sum
+}
+res=sum
+}
+
+
We want to create the matrix only once, to limit the footprint on the results. Therefore, we call createMatrix512 and createMatrix513 outside of the loop. We may expect the results to be similar as again we only want to iterate on the first eight columns, but this isn’t the case (on my machine):
The second benchmark with 513 columns is about 50% faster. Again, because we iterate only over the first eight columns, this result is quite surprising.
+
To understand this difference, we need to understand the basics of CPU caches. In a nutshell, a CPU is composed of different caches (usually L1, L2, and L3). These caches reduce the average cost of accessing data from the main memory. In some conditions, the CPU can fetch data from the main memory and copy it to L1. In this case, the CPU tries to fetch into L1 the matrix’s subset that calculateSum is interested in (the first eight columns of each row). However, the matrix fits in memory in one case (513 columns) but not in the other case (512 columns).
+
+Note
+
This isn’t in the scope of this post to explain why, but we look at this problem in 100 Go Mistakes, mistake #91: “Not understanding CPU caches.”
+
+
Coming back to the benchmark, the main issue is that we keep reusing the same matrix in both cases. Because the function is repeated thousands of times, we don’t measure the function’s execution when it receives a plain new matrix. Instead, we measure a function that gets a matrix that already has a subset of the cells present in the cache. Therefore, because calculateSum513 leads to fewer cache misses, it has a better execution time.
+
This is an example of the observer effect. Because we keep observing a repeatedly called CPU-bound function, CPU caching may come into play and significantly affect the results. In this example, to prevent this effect, we should create a matrix during each test instead of reusing one:
+
funcBenchmarkCalculateSum512(b*testing.B){
+varsumint64
+fori:=0;i<b.N;i++{
+b.StopTimer()
+s:=createMatrix512(rows)// Create a new matrix during each loop iteration
+b.StartTimer()
+sum=calculateSum512(s)
+}
+res=sum
+}
+
+
A new matrix is now created during each loop iteration. If we run the benchmark again (and adjust benchtime — otherwise, it takes too long to execute), the results are closer to each other:
Instead of making the incorrect assumption that calculateSum513 is faster, we see that both benchmarks lead to similar results when receiving a new matrix.
+
As we have seen in this post, because we were reusing the same matrix, CPU caches significantly impacted the results. To prevent this, we had to create a new matrix during each loop iteration. In general, we should remember that observing a function under test may lead to significant differences in results, especially in the context of micro-benchmarks of CPU-bound functions where low-level optimizations matter. Forcing a benchmark to re-create data during each iteration can be a good way to prevent this effect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/site/9-generics/index.html b/site/9-generics/index.html
index 1517bd1..a8aca38 100644
--- a/site/9-generics/index.html
+++ b/site/9-generics/index.html
@@ -14,7 +14,7 @@
-
+
@@ -46,6 +46,8 @@
+
+
@@ -89,7 +91,18 @@
-
+
@@ -657,6 +670,66 @@
+