mirror of
https://github.com/DataDog/go-profiler-notes.git
synced 2026-06-21 00:46:51 +08:00
Wording
This commit is contained in:
parent
1be84098ce
commit
1503f608a0
1 changed files with 9 additions and 5 deletions
|
|
@ -120,20 +120,24 @@ Go's stack implementation described above is making an important tradeoff when i
|
|||
|
||||
Unwinding (or stack walking) is the process of collecting all the return addresses (see red elements in [Stack Layout](#stack-layout)) from the stack. Together with the current instruction pointer register (`rip`) they form a list of program counter (`pc`) values that can be turned into a human readable stack trace via [Symbolization](#symbolization).
|
||||
|
||||
The Go runtime, including the builtin profilers, exclusively use [.gopclntab](#gopclntab) for unwinding. However, we'll start with describing [Frame Pointer](#frame-pointer) unwinding first, because it is much easier to understand and might [become supported in the future](https://github.com/golang/go/issues/16638).
|
||||
The Go runtime, including the builtin profilers, exclusively use [.gopclntab](#gopclntab) for unwinding. However, we'll start with describing [Frame Pointer](#frame-pointer) unwinding first, because it is much easier to understand and might [become supported in the future](https://github.com/golang/go/issues/16638). After that we'll also discuss [DWARF](#dwarf) which is yet another way to unwind go stacks.
|
||||
|
||||
For those not familiar with it, below is a simple diagram showing the relevant sections of a typical Go binary file that we'll be discussing here. They are always wrapped inside either the ELF, Mach-O or PE container format, depending on the operating system.
|
||||
|
||||
<img src="./go-binary.png" width="200"/>
|
||||
|
||||
### Frame Pointer
|
||||
|
||||
Frame pointer unwinding is the simple process of following the base pointer register (`rbp`) to the first frame pointer on the stack which points to the next frame pointer and so on. In other words, it is following the orange lines in the [Stack Layout](#stack-layout) graphic. For each visited frame pointer, the return address (pc) sitting 8 bytes above the frame pointer is collected along the way. That's it : ).
|
||||
|
||||
The main downside to frame pointers is that pushing them onto the stack adds some performance overhead to every function call during normal program execution. The Go authors estimated an average 2% execution overhead for an average program in the [Go 1.7 release notes](https://golang.org/doc/go1.7). Because of this compilers such as `gcc` offer options such as `-fomit-frame-pointers` to omit them for better performance. However, it's a devil's bargain: It gives you small performance win right away, but it reduces your ability to debug and diagnose performance issues in the future. Because of this the general advice is:
|
||||
The main downside to frame pointers is that pushing them onto the stack adds some performance overhead to every function call during normal program execution. The Go authors estimated an average 2% execution overhead for an average program in the [Go 1.7 release notes](https://golang.org/doc/go1.7). Another data point is the Linux kernel where overheads of [5 - 10% were observed](https://lore.kernel.org/lkml/20170602104048.jkkzssljsompjdwy@suse.de/T/#u) for e.g. sqlite and pgbench. Because of this compilers such as `gcc` offer options such as `-fomit-frame-pointers` to omit them for better performance. However, it's a devil's bargain: It gives you small performance win right away, but it reduces your ability to debug and diagnose performance issues in the future. Because of this the general advice is:
|
||||
|
||||
> Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default.
|
||||
> – [Brendan Gregg](http://www.brendangregg.com/perf.html)
|
||||
|
||||
In Go you don't even need this advice. Since Go 1.7 frame pointers are enabled by default for 64 bit binaries, and there is no `-fomit-frame-pointers` footgun available. This allows Go to be compatible with third party debuggers and profilers such as [Linux perf](http://www.brendangregg.com/perf.html) out of the box.
|
||||
|
||||
If you'd like to see frame pointer unwinding in action, you can check out [this toy project](https://github.com/felixge/gounwind) which a faster and simpler alternative to the `runtime.Callers()` implementation. The simplicity should speak for itself when compared to the other unwinding methods described below. It should also be clear that frame pointer unwinding has `O(N)` time complexity where `N` is the number of stack frames that need to be traversed.
|
||||
If you'd like to see a simple frame pointer unwinding implementation, you can check out [this toy project](https://github.com/felixge/gounwind) which has a light-weight alternative to `runtime.Callers()`. The simplicity should speak for itself when compared to the other unwinding methods described below. It should also be clear that frame pointer unwinding has `O(N)` time complexity where `N` is the number of stack frames that need to be traversed.
|
||||
|
||||
Despite the apparent simplicity, frame pointer unwinding is no panacea. Frame pointers are pushed to the stack by the callee, so for interrupt based profiling there is an inherent race condition that might cause you to miss the caller of the current function in your stack trace. Additionally frame pointer unwinding alone can't identify inlined function calls. So at least some of the complexity of [.gopclntab](#gopclntab) or [DWARF](#dwarf) is essential to enable accurate unwinding.
|
||||
|
||||
|
|
@ -151,9 +155,9 @@ Each frame lookup begins with the current `pc` which is passed to [`findfunc()`]
|
|||
|
||||
The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dcf0a0bb/src/runtime/runtime2.go#L825) meta data that we just retrieved contains a `pcsp` offset into the `pctab` table that maps program counters to stack pointer deltas. To decode this information, we call [`funcspdelta()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L903) which does a linear search over all program counters that change the `sp delta` of the function until it finds the closest (`pc`, `sp delta`) pair. For stacks with recursive call cycles, a tiny program counter cache is used to avoid doing lots of duplicated work.
|
||||
|
||||
Now that that we have the stack pointer delta, we we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets a virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
|
||||
Now that that we have the stack pointer delta, we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets a virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
|
||||
|
||||
Putting it all together, under reasonable assumptions, the effective time complexity of `gocplntab` unwinding is the same as frame pointer unwinding, i.e. `O(N)` where `N` is the number of frames on the stack. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but for most applications a good rule of thumb is to assume a cost of `~1µs` per stack trace. So if you're aiming for < 1% CPU overhead in production, you should try to configure your profilers to not track more than ~10k events per second per core. That's a decent amount of data, for some tools (like the [built-in tracer](https://golang.org/pkg/runtime/trace/)) stack unwinding can become a significant bottleneck. In the future this could be overcome by the Go adding [support for frame pointer unwinding](https://github.com/golang/go/issues/16638) which might be up to [50x faster](https://github.com/felixge/gounwind) than the current `gopclntab` implementation.
|
||||
Putting it all together, under reasonable assumptions, the effective time complexity of `gocplntab` unwinding is the same as frame pointer unwinding, i.e. `O(N)` where `N` is the number of frames on the stack, but with higher constant overheads. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but for most applications a good rule of thumb is to assume a cost of `~1µs` per stack trace. So if you're aiming for < 1% CPU overhead in production, you should try to configure your profilers to not track more than ~10k events per second per core. That's a decent amount of data, for some tools (like the [built-in tracer](https://golang.org/pkg/runtime/trace/)) stack unwinding can become a significant bottleneck. In the future this could be overcome by the Go adding [support for frame pointer unwinding](https://github.com/golang/go/issues/16638) which might be up to [50x faster](https://github.com/felixge/gounwind) than the current `gopclntab` implementation.
|
||||
|
||||
Another concern when it comes to `.gopclntab` overhead is the way it increases the file size of your binary. Up until Go 1.2 this data was stored in a compressed format which negatively impacted startup time. Then the implementation was changed to eliminate the startup cost at an increase of binary size. Raphael Poss has written a [great article](https://dr-knz.net/go-executable-size-visualization-with-d3.html#what-s-this-runtime-pclntab-anyway) about how this design choice is becoming a superlinear problem for CockroachDB's growing code base.
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue