First version of gopclntab that doesn't suck

This commit is contained in:
Felix Geisendörfer 2021-04-12 12:50:47 +02:00
parent ed2fa54799
commit f0b434a394

View file

@ -112,7 +112,7 @@ If you want to try it out yourself, perhaps modify the example program to spawn
### cgo
Go's stack implementation described above is making an important tradeoff when it comes to interacting with code written in languages that follow platform calling conventions such as C. Instead of being able to directly call such functions directly, Go has to perform [complicated rituals](https://golang.org/src/runtime/cgocall.go) for switching between goroutine stacks and OS-allocated stacks that can run C code. This comes with a certain amount of performance overhead and also poses complex issues for capturing stack traces during profiling.
Go's stack implementation described above is making an important tradeoff when it comes to interacting with code written in languages that follow platform calling conventions such as C. Instead of being able to directly call such functions directly, Go has to perform [complicated rituals](https://golang.org/src/runtime/cgocall.go) for switching between goroutine stacks and OS-allocated stacks that can run C code. This comes with a certain amount of performance overhead and also poses complex issues for capturing stack traces during profiling, see [runtime.SetCgoTraceback()](https://golang.org/pkg/runtime/#SetCgoTraceback).
🚧 I'll try to describe this in more detail in the future.
@ -133,27 +133,29 @@ The main downside to frame pointers is that pushing them onto the stack adds som
In Go you don't even need this advice. Since Go 1.7 frame pointers are enabled by default for 64 bit binaries, and there is no `-fomit-frame-pointers` footgun available. This allows Go to be compatible with third party debuggers and profilers such as [Linux perf](http://www.brendangregg.com/perf.html) out of the box.
If you'd like to see frame pointer unwinding in action, you can check out [this tiny code snippet](https://github.com/felixge/gounwind/blob/5cc8505361807a22169595999689bd793ed6d391/gounwind.go) which is a fast alternative to the official `runtime.Callers()` implementation. The simplicity should speak for itself when compared to the other unwinding methods described below. It should also be clear that frame pointer unwinding has `O(N)` time complexity where `N` is the number of stack frames that need to be traversed.
If you'd like to see frame pointer unwinding in action, you can check out [this toy project](https://github.com/felixge/gounwind) which a faster and simpler alternative to the `runtime.Callers()` implementation. The simplicity should speak for itself when compared to the other unwinding methods described below. It should also be clear that frame pointer unwinding has `O(N)` time complexity where `N` is the number of stack frames that need to be traversed.
Despite the apparent simplicity, frame pointer unwinding is no panacea. Frame pointers are pushed to the stack by the callee, so for interrupt based profiling there is a race condition that might cause you to miss the caller of the current function in your stack trace. Additionally frame pointer unwinding can't unwind inlined function calls. So at least some of the complexity of [.gopclntab](#gopclntab) or [DWARF](#dwarf) is essential to enable accurate unwinding.
Despite the apparent simplicity, frame pointer unwinding is no panacea. Frame pointers are pushed to the stack by the callee, so for interrupt based profiling there is an inherent race condition that might cause you to miss the caller of the current function in your stack trace. Additionally frame pointer unwinding alone can't identify inlined function calls. So at least some of the complexity of [.gopclntab](#gopclntab) or [DWARF](#dwarf) is essential to enable accurate unwinding.
### .gopclntab
### gopclntab
Despite frame pointers being available on 64bit platforms, Go is not leveraging them for unwinding ([this might change](https://github.com/golang/go/issues/16638)). Instead Go ships with its own idiosyncratic unwinding tables that are embedded in the `.gopclntab` section of any Go binary. `.gopclntab` stands for "go program counter line table", but this is a bit of a misnomer as it contains various tables and meta data required for unwinding and symbolization. For unwinding, the general idea is to embed a table that maps every program counter (`pc`) to the current distance (delta) of the stack pointer (`rsp`) from the nearest `return address (pc)` above it. The initial lookup uses the `pc` from the `rip` instruction pointer register and then uses the `return address (pc)` for the next lookup and so on.
Despite frame pointers being available on 64bit platforms, Go is not leveraging them for unwinding ([this might change](https://github.com/golang/go/issues/16638)). Instead Go ships with its own idiosyncratic unwinding tables that are embedded in the `gopclntab` section of any Go binary. `gopclntab` stands for "go program counter line table", but this is a bit of a misnomer as it contains various tables and meta data required for unwinding and symbolization.
As far as unwinding is concerned, the general idea is to embed a "virtual frame pointer table" (called `pctab`) inside of `gopclntab` that maps program counters (`pc`) to the distance (aka `sp delta`) between the stack pointer (`rsp`) and the `return address (pc)` above it. The initial lookup in this table uses the `pc` from the `rip` instruction pointer register and then uses the `return address (pc)` for the next lookup and so on. This way you can always unwind regardless of whether or not you have physical frame pointers on the stack.
Russ Cox initially described some of the involved data structures in his [Go 1.2 Runtime Symbol Information](https://golang.org/s/go12symtab) document, but it's very outdated by now and it's probably better to look at the current implementation directly. The relevant files are [runtime/traceback.go](https://github.com/golang/go/blob/go1.16.3/src/runtime/traceback.go) and [runtime/symtab.go](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go), so let's dive in.
There are various use cases for stack traces in Go, but they all end up hitting the [`gentraceback()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/traceback.go#L76-L86) function. If the caller is e.g. `runtime.Callers()` the function only needs to do unwinding, but e.g. `panic()` wants text output, which requires symbolization as well. Additionally the code has to deal with the difference between [link register architectures](https://en.wikipedia.org/wiki/Link_register) such as ARM that work a little different from x86. This combination of unwinding, symbolization, support for different architectures and bespoke data structures might just be a regular day in the shop for the system developers on the Go team, but it's definitely been tricky for me, so please watch out for potential inaccuracies in my description below.
Each frame lookup begins with the current `pc` which is passed to [`findfunc()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) which looks up the meta data for the function that contains the `pc`. Historically this was done using `O(log N)` binary search, but [nowadays](https://go-review.googlesource.com/c/go/+/2097/) there is a hash-map-like index of [`findfuncbucket`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) structs that usually directly guides us to the right entry using an `O(1)` algorithm. So at this point the overall complexity is still the same as frame pointer unwinding, but it's worth noting that the constant overheads are already significantly higher.
Each frame lookup begins with the current `pc` which is passed to [`findfunc()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) which looks up the meta data for the function that contains the `pc`. Historically this was done using `O(log N)` binary search, but [nowadays](https://go-review.googlesource.com/c/go/+/2097/) there is a hash-map-like index of [`findfuncbucket`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) structs that usually directly guides us to the right entry using an `O(1)` algorithm.
The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dcf0a0bb/src/runtime/runtime2.go#L825) meta data that we just retrieved contains a `pcsp` offset into the `pctab` table that maps program counters to stack pointer deltas. To decode this information, we call [`funcspdelta()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L903) which does a `O(N)` linear search over all program counters that change the `sp delta` of the function until it finds the closest (`pc`, `sp delta`) pair. For stacks with recursive call cycles, a tiny program counter cache is used to avoid doing lots of duplicated work.
The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dcf0a0bb/src/runtime/runtime2.go#L825) meta data that we just retrieved contains a `pcsp` offset into the `pctab` table that maps program counters to stack pointer deltas. To decode this information, we call [`funcspdelta()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L903) which does a linear search over all program counters that change the `sp delta` of the function until it finds the closest (`pc`, `sp delta`) pair. For stacks with recursive call cycles, a tiny program counter cache is used to avoid doing lots of duplicated work.
Now that that we have the stack pointer delta, we we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
Now that that we have the stack pointer delta, we we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets a virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
Putting it all together, for non-recursive call stacks without inlining, the complexity for `gopclntab` unwinding is `O(N*M)` where `N` is the number of frames on the stack, and `M` is the average size of the generated machine code per function. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but in the real world I'd expect the average `N` and `M` to be fairly similar for most non-trivial Go applications, so unwinding a stack (without symbolization) will generally cost `1-10µs`. That being said, naive frame pointer unwinding appears to be [50x faster](https://github.com/felixge/gounwind), and does less cache thrashing, so high-resolution profiling and tracing use cases would likely benefit from seeing [support for it in the core](https://github.com/golang/go/issues/16638).
Putting it all together, under reasonable assumptions, the effective time complexity of `gocplntab` unwinding is the same as frame pointer unwinding, i.e. `O(N)` where `N` is the number of frames on the stack. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but for most applications a good rule of thumb is to assume a cost of `~1µs` per stack trace. So if you're aiming for < 1% CPU overhead in production, you should try to configure your profilers to not track more than ~10k events per second per core. That's a decent amount of data, for some tools (like the [built-in tracer](https://golang.org/pkg/runtime/trace/)) stack unwinding can become a significant bottleneck. In the future this could be overcome by the Go adding [support for frame pointer unwinding](https://github.com/golang/go/issues/16638) which might be up to [50x faster](https://github.com/felixge/gounwind) than the current `gopclntab` implementation.
Another aspect of `.gopclntab` is the way it increases the file size of your binary. Up until Go 1.2 this unwinding and symbolization table was stored in compressed form which negatively impacted startup time. Then the implementation was changed to eliminate the startup cost at an increase of binary size. Raphael Poss has written a [great article](https://dr-knz.net/go-executable-size-visualization-with-d3.html#what-s-this-runtime-pclntab-anyway) about how this design choice is becoming a superlinear problem for CockroachDB's growing code base.
Another concern when it comes to `.gopclntab` overhead is the way it increases the file size of your binary. Up until Go 1.2 this data was stored in a compressed format which negatively impacted startup time. Then the implementation was changed to eliminate the startup cost at an increase of binary size. Raphael Poss has written a [great article](https://dr-knz.net/go-executable-size-visualization-with-d3.html#what-s-this-runtime-pclntab-anyway) about how this design choice is becoming a superlinear problem for CockroachDB's growing code base.
Last but not least, it's worth noting that Go ships with two `.gopclntab` implementations. In addition to the one I've just described, there is another one in the [debug/gosym](https://golang.org/pkg/debug/gosym/) package that seems to be used by the linker, `go tool addr2line` and others. If you want, you can use it yourself in combination with [debug/elf](./examples/pclnttab/linux.go) or ([debug/macho](./examples/pclnttab/darwin.go)) as a starting point for your own [gopclntab adventures](./examples/pclnttab) for good or [evil](https://tuanlinh.gitbook.io/ctf/golang-function-name-obfuscation-how-to-fool-analysis-tools).
@ -209,7 +211,9 @@ To be written ...
In order to support 3rd profilers such as [perf](http://www.brendangregg.com/perf.html) the [Go 1.7](https://golang.org/doc/go1.7) (2016-08-15) release started to enable frame pointers by default for [64bit binaries](https://sourcegraph.com/search?q=framepointer_enabled+repo:%5Egithub%5C.com/golang/go%24+&patternType=literal).
## Credits
A big thanks goes to [Michael Pratt](https://github.com/prattmic) for [reviewing](https://github.com/DataDog/go-profiler-notes/commit/6a62d5908079ddac9c92d319f49fde846f329c55#r49179154) parts of the `gopclntab` section in this document and catching some significant errors in my analysis.
## Disclaimers