Stack traces

This commit is contained in:
Felix Geisendörfer 2021-05-24 11:46:00 +02:00
parent 9e083542ad
commit c746a22cb9

View file

@ -1,6 +1,6 @@
🚧 This note is still work in progress, please come back later! 🚧
[Introduction](#intrudction) - [Goroutine Stack](#goroutine-stack) ([Stack Layout](#stack-layout), [Real Example](#real-example), [cgo](#cgo)) - [Unwinding](#unwinding) ([Frame Pointer](#frame-pointer), [gopclntab](#gopclntab), [DWARF](#dwarf)) - [Symbolization](#symbolization)
This document was last updated for `go1.16.3` but probably still applies to older/newer versions for the most parts.
This document was last updated for `go1.16` but probably still applies to older/newer versions for the most parts.
# Stack Traces in Go
@ -34,7 +34,7 @@ As the name implies, stack traces originate from "the stack". Even so the detail
Platforms like `x86-64` define a [stack layout](https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64) and [calling convention](https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf) for C and encourage other programming languages to adopt it for interoperability. Go doesn't follow these conventions, and instead uses its own idiosyncratic [calling convention](https://dr-knz.net/go-calling-convention-x86-64.html). Future versions of Go (1.17?) will adopt a more traditional [register-based](https://go.googlesource.com/proposal/+/refs/changes/78/248178/1/design/40724-register-calling.md) convention that will improve performance. However even the new convention won't be platform-compatible as that would negatively impact goroutine scalability.
Go's stack layout is slightly different on different platforms. To keep things manageable, we'll assume that we're on `x86-64` for the remainder of this note.
Go's stack layout is slightly different on different platforms. To keep things manageable, we'll assume that we're on `x86-64` for the remainder of this document.
### Stack Layout
@ -46,7 +46,7 @@ The picture below shows the stack of a sample goroutine that is currently callin
There is a lot going on in this picture, but for now let's focus on the things highlighted in red. To get a stack trace, the first thing we need is the current program counter (`pc`). This is found in a CPU register called `rip` (instruction pointer register) and points to another region of memory that holds the executable machine code of our program. Since we're currently calling `main.foo()` `rip` is pointing to an instruction within that function. If you're not familiar with registers, you can think of them as special CPU variables that are incredibly fast to access. Some of them, like `rip`, `rsp` or `rbp` have special purposes, while others can be used by compilers as they see fit.
Now that we know the program counter of the current function, it's time to find pc values of our callers, i.e. all the `return address (pc)` values that are also highlighted in red. There are various techniques for doing this, which are described in the [Unwinding](#unwinding) section. The end result is a list of program counters that represent a stack trace just like the one you can get from [`runtime.Callers()`](https://golang.org/pkg/runtime/#Callers). Last but not least, these `pc` values are usually translated into human readable file/line/function names as described in the [Symbolization](#symbolization) section below. In Go itself you can simply calll [`runtime.CallerFramers()`](https://golang.org/pkg/runtime/#CallersFrames) to symbolize a list of `pc` values.
Now that we know the program counter of the current function, it's time to find `pc` values of our callers, i.e. all the `return address (pc)` values that are also highlighted in red. There are various techniques for doing this, which are described in the [Unwinding](#unwinding) section. The end result is a list of program counters that represent a stack trace just like the one you can get from [`runtime.Callers()`](https://golang.org/pkg/runtime/#Callers). Last but not least, these `pc` values are usually translated into human readable file/line/function names as described in the [Symbolization](#symbolization) section below. In Go itself you can simply calll [`runtime.CallerFramers()`](https://golang.org/pkg/runtime/#CallersFrames) to symbolize a list of `pc` values.
### Real Example
@ -112,7 +112,7 @@ If you want to try it out yourself, perhaps modify the example program to spawn
### cgo
Go's stack implementation described above is making an important tradeoff when it comes to interacting with code written in languages that follow platform calling conventions such as C. Instead of being able to directly call such functions directly, Go has to perform [complicated rituals](https://golang.org/src/runtime/cgocall.go) for switching between goroutine stacks and OS-allocated stacks that can run C code. This comes with a certain amount of performance overhead and also poses complex issues for capturing stack traces during profiling, see [runtime.SetCgoTraceback()](https://golang.org/pkg/runtime/#SetCgoTraceback).
Go's stack implementation described above is making an important tradeoff when it comes to interacting with code written in languages that follow platform calling conventions such as C. Instead of being able to call such functions directly, Go has to perform [complicated rituals](https://golang.org/src/runtime/cgocall.go) for switching between goroutine stacks and OS-allocated stacks that can run C code. This comes with a certain amount of performance overhead and also poses complex issues for capturing stack traces during profiling, see [runtime.SetCgoTraceback()](https://golang.org/pkg/runtime/#SetCgoTraceback).
🚧 I'll try to describe this in more detail in the future.
@ -120,7 +120,7 @@ Go's stack implementation described above is making an important tradeoff when i
Unwinding (or stack walking) is the process of collecting all the return addresses (see red elements in [Stack Layout](#stack-layout)) from the stack. Together with the current instruction pointer register (`rip`) they form a list of program counter (`pc`) values that can be turned into a human readable stack trace via [Symbolization](#symbolization).
The Go runtime, including the builtin profilers, exclusively use [.gopclntab](#gopclntab) for unwinding. However, we'll start with describing [Frame Pointer](#frame-pointer) unwinding first, because it is much easier to understand and might [become supported in the future](https://github.com/golang/go/issues/16638). After that we'll also discuss [DWARF](#dwarf) which is yet another way to unwind go stacks.
The Go runtime, including the builtin profilers, exclusively use [gopclntab](#gopclntab) for unwinding. However, we'll start with describing [Frame Pointer](#frame-pointer) unwinding first, because it is much easier to understand and might [become supported in the future](https://github.com/golang/go/issues/16638). After that we'll also discuss [DWARF](#dwarf) which is yet another way to unwind Go stacks.
For those not familiar with it, below is a simple diagram showing the relevant sections of a typical Go binary file that we'll be discussing here. They are always wrapped inside either the ELF, Mach-O or PE container format, depending on the operating system.
@ -139,7 +139,7 @@ In Go you don't even need this advice. Since Go 1.7 frame pointers are enabled b
If you'd like to see a simple frame pointer unwinding implementation, you can check out [this toy project](https://github.com/felixge/gounwind) which has a light-weight alternative to `runtime.Callers()`. The simplicity should speak for itself when compared to the other unwinding methods described below. It should also be clear that frame pointer unwinding has `O(N)` time complexity where `N` is the number of stack frames that need to be traversed.
Despite the apparent simplicity, frame pointer unwinding is no panacea. Frame pointers are pushed to the stack by the callee, so for interrupt based profiling there is an inherent race condition that might cause you to miss the caller of the current function in your stack trace. Additionally frame pointer unwinding alone can't identify inlined function calls. So at least some of the complexity of [.gopclntab](#gopclntab) or [DWARF](#dwarf) is essential to enable accurate unwinding.
Despite the apparent simplicity, frame pointer unwinding is no panacea. Frame pointers are pushed to the stack by the callee, so for interrupt based profiling there is an inherent race condition that might cause you to miss the caller of the current function in your stack trace. Additionally frame pointer unwinding alone can't identify inlined function calls. So at least some of the complexity of [gopclntab](#gopclntab) or [DWARF](#dwarf) is essential to enable accurate unwinding.
### gopclntab
@ -149,7 +149,7 @@ As far as unwinding is concerned, the general idea is to embed a "virtual frame
Russ Cox initially described some of the involved data structures in his [Go 1.2 Runtime Symbol Information](https://golang.org/s/go12symtab) document, but it's very outdated by now and it's probably better to look at the current implementation directly. The relevant files are [runtime/traceback.go](https://github.com/golang/go/blob/go1.16.3/src/runtime/traceback.go) and [runtime/symtab.go](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go), so let's dive in.
There are various use cases for stack traces in Go, but they all end up hitting the [`gentraceback()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/traceback.go#L76-L86) function. If the caller is e.g. `runtime.Callers()` the function only needs to do unwinding, but e.g. `panic()` wants text output, which requires symbolization as well. Additionally the code has to deal with the difference between [link register architectures](https://en.wikipedia.org/wiki/Link_register) such as ARM that work a little different from x86. This combination of unwinding, symbolization, support for different architectures and bespoke data structures might just be a regular day in the shop for the system developers on the Go team, but it's definitely been tricky for me, so please watch out for potential inaccuracies in my description below.
The core of Go's stack trace implementation is in the [`gentraceback()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/traceback.go#L76-L86) function which is being called from various places. If the caller is e.g. `runtime.Callers()` the function only needs to do unwinding, but e.g. `panic()` wants text output, which requires symbolization as well. Additionally the code has to deal with the difference between [link register architectures](https://en.wikipedia.org/wiki/Link_register) such as ARM that work a little different from x86. This combination of unwinding, symbolization, support for different architectures and bespoke data structures might just be a regular day in the shop for the system developers on the Go team, but it's definitely been tricky for me, so please watch out for potential inaccuracies in my description below.
Each frame lookup begins with the current `pc` which is passed to [`findfunc()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) which looks up the meta data for the function that contains the `pc`. Historically this was done using `O(log N)` binary search, but [nowadays](https://go-review.googlesource.com/c/go/+/2097/) there is a hash-map-like index of [`findfuncbucket`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) structs that usually directly guides us to the right entry using an `O(1)` algorithm.
@ -157,59 +157,52 @@ The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dc
Now that that we have the stack pointer delta, we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets a virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
Putting it all together, under reasonable assumptions, the effective time complexity of `gocplntab` unwinding is the same as frame pointer unwinding, i.e. `O(N)` where `N` is the number of frames on the stack, but with higher constant overheads. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but for most applications a good rule of thumb is to assume a cost of `~1µs` per stack trace. So if you're aiming for < 1% CPU overhead in production, you should try to configure your profilers to not track more than ~10k events per second per core. That's a decent amount of data, for some tools (like the [built-in tracer](https://golang.org/pkg/runtime/trace/)) stack unwinding can become a significant bottleneck. In the future this could be overcome by the Go adding [support for frame pointer unwinding](https://github.com/golang/go/issues/16638) which might be up to [50x faster](https://github.com/felixge/gounwind) than the current `gopclntab` implementation.
Another concern when it comes to `.gopclntab` overhead is the way it increases the file size of your binary. Up until Go 1.2 this data was stored in a compressed format which negatively impacted startup time. Then the implementation was changed to eliminate the startup cost at an increase of binary size. Raphael Poss has written a [great article](https://dr-knz.net/go-executable-size-visualization-with-d3.html#what-s-this-runtime-pclntab-anyway) about how this design choice is becoming a superlinear problem for CockroachDB's growing code base.
Putting it all together, under reasonable assumptions, the effective time complexity of `gocplntab` unwinding is the same as frame pointer unwinding, i.e. `O(N)` where `N` is the number of frames on the stack, but with higher constant overheads. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but for most applications a good rule of thumb is to assume a cost of `~1µs` to unwind a stack trace. So if you're aiming for < 1% CPU profiling overhead in production, you should try to configure your profilers to not track more than ~10k events per second per core. That's a decent amount of data, but for some tools like the [built-in tracer](https://golang.org/pkg/runtime/trace/) stack unwinding can be a significant bottleneck. In the future this could be overcome by the Go core adding [support for frame pointer unwinding](https://github.com/golang/go/issues/16638) which might be up to [50x faster](https://github.com/felixge/gounwind) than the current `gopclntab` implementation.
Last but not least, it's worth noting that Go ships with two `.gopclntab` implementations. In addition to the one I've just described, there is another one in the [debug/gosym](https://golang.org/pkg/debug/gosym/) package that seems to be used by the linker, `go tool addr2line` and others. If you want, you can use it yourself in combination with [debug/elf](./examples/pclnttab/linux.go) or ([debug/macho](./examples/pclnttab/darwin.go)) as a starting point for your own [gopclntab adventures](./examples/pclnttab) for good or [evil](https://tuanlinh.gitbook.io/ctf/golang-function-name-obfuscation-how-to-fool-analysis-tools).
### DWARF
> An unwinder that is several hundred lines long is simply not even *remotely* interesting to me.
[DWARF](https://en.wikipedia.org/wiki/DWARF) is a standardized debugging format that is understood by many debuggers (e.g. [delve](https://github.com/go-delve/delve)) and profilers (e.g. Linux [perf](http://www.brendangregg.com/perf.html)). It enables a superset of features found in `gopclntab`, including unwinding and symbolization, but has a reputation for being very complex. The Linux kernel has famously refused to adopt DWARF unwinding for kernel stack traces:
> The whole (and *only*) point of unwinders is to make debugging easy when a bug occurs [...]. An unwinder that is several hundred lines long is simply not even *remotely* interesting to me.
> [Linus Torvalds](https://lkml.org/lkml/2012/2/10/356)
> ... so as long as the Go compiler emits valid DWARF symbols, then anything is possible. Erm, by "anything" I mean, specificially, "unwinding". Many things are still not possible, even with DWARF (and more frequently than you might think, many things are not possible precisely because DWARF exists-- true love, for instance)
> [David Sanchez](https://www.linkedin.com/in/david-sanchez-34289130/) (my colleague, who knows more about DWARF than any sane person probably should)
This lead to the [creation](https://lwn.net/Articles/728339/) of the [ORC unwinder](https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html) which is now available in the kernel as yet another unwinding mechanism. However, ORCs play no role for Go stack traces, we only have to fight with ELFs and DWARFs here.
The Go compiler always emits DWARF (v4) information for the binaries it produces. The format is standardized, so unlike `gopclntab`, external tools can rely on it. However, the DWARF data is also largely redundant with `gopclntab` and negatively impacts build times and binary sizes. Because of this Rob Pike is proposing to [disable it by default](https://github.com/golang/go/issues/26074), but it's still under discussion.
Unlike `gopclntab`, DWARF information can easily be stripped from binaries at build time like this:
### ORC, LBR, etc.
```
go build -ldflags=-w <pkg>
```
Just like `-fomit-frame-pointers` this is a bit of a devil's bargain, but some people don't believe in the distinction between DWARF and the devil. So if you're willing to sign a waiver of liability to your colleagues, you may proceed. Seriously so, I'd advise you to only strip DWARF symbols if it solves an important problem for you. Once DWARF information has been stripped, you won't be able to use perf, delve or other tools to profile or debug your applications in production.
As far as the inner workings of DWARF are concerned, the [official spec](http://dwarfstd.org/) has 460 pages of wisdom on the matter. For our purposes it's probably sufficient to say that DWARF is a superset of [gopclntab](#gopclntab) and works very similar. I.e. you've got tables mapping `pc` addresses to stack pointer deltas which allow you to unwind the stack without the need for frame pointers.
## Symbolization
To be written ...
Symbolization is the process of taking one or more program counter (`pc`) address and turning them into human readable symbols such a function names, file names and line numbers. For example if you have two `pc` values like this:
## Overhead
```
0x1064ac1
0x1035683
```
To be written ...
You may use symbolization to turn them into a human readable stack trace like shown below:
- Frame Pointers: 2% for Go, Linux 5-10%
```
main.foo()
/path/to/go-profiler-notes/examples/stack-trace/main.go:9
main.main()
/path/to/go-profiler-notes/examples/stack-trace/main.go:5
```
In fact, [until recently](https://lwn.net/Articles/727553/), the Linux Kernel rejected anything other than frame pointers for stack unwinding despite causing up to 5-10% slowdown for some workloads. Go reported the overhead as 2% in the Go 1.7 release notes.
In the Go runtime, symbolization always uses the symbol information contained in the [gopclntab](#gopclntab) section. This information can also be access via [`runtime.CallerFramers()`](https://golang.org/pkg/runtime/#CallersFrames).
File size.
## Accuracy
To be written ...
### Function Inlining
### Frame Pointer Race Condition
To be written ...
### Goroutine Stack Truncation
To be written ...
### cgo
To be written ...
### pprof Labels
To be written ...
3rd party profilers such a Linux perf can't use [gopclntab](#gopclntab) and have to rely on [DWARF](#dwarf) for symbolization instead.
## History