mirror of
https://github.com/pkivolowitz/asm_book.git
synced 2026-06-21 02:16:49 +08:00
284 lines
14 KiB
Markdown
284 lines
14 KiB
Markdown
# Section 1 / Atomics
|
|
|
|

|
|
|
|
David Lynch's Dune.
|
|
|
|
## Threads
|
|
|
|
Suppose you run two copies of the same program at the same time. They
|
|
both have a variable named `i`. Can a change to `i` made by one copy of
|
|
the program impact the value of `i` in the other running copy of the
|
|
program? Of course not.
|
|
|
|
Think of running two copies of a program at the same time as having
|
|
two identical, but distinct, homes. What happens inside one house does
|
|
not impact what happens inside the house next door.
|
|
|
|
Threads are a different way of getting more than one "copy" of a program
|
|
to run at the same time. Threads are different, however, in that they
|
|
all live within the same household. All of the housemates share the
|
|
living space and all housemates have access to any global or shared
|
|
resource. This makes for great gains in performance for a broad class
|
|
of problems but also introduces great hazards.
|
|
|
|
Suppose you buy a carton of milk and place it in the fridge. If you were
|
|
the only member of the household you would expect that when we next went
|
|
to the fridge your milk would still be there, right? If you share the
|
|
household with other people, this might not be the case.
|
|
|
|
Consider the following program:
|
|
|
|
```c++
|
|
#include <iostream> // 1
|
|
#include <thread> // 2
|
|
#include <atomic> // 3
|
|
#include <vector> // 4
|
|
// 5
|
|
using std::cout; // 6
|
|
using std::endl; // 7
|
|
using std::atomic; // 8
|
|
using std::vector; // 9
|
|
using std::thread; // 10
|
|
// 11
|
|
const uint32_t MAX_LOOPS = 10000; // 12
|
|
const uint32_t NUM_THREADS = 16; // 13
|
|
// 14
|
|
/* volatile is necessary if any use of the optimizer // 15
|
|
is to be made. // 16
|
|
*/ // 17
|
|
volatile uint32_t naked_int; // 18
|
|
atomic<uint32_t> atomic_integer(0); // 19
|
|
// 20
|
|
extern "C" void LoadLinkedStoreConditional(uint32_t * value); // 21
|
|
// 22
|
|
void LLSCWorker() { // 23
|
|
extern volatile uint32_t naked_int; // 24
|
|
// 25
|
|
for (uint32_t i = 0; i < MAX_LOOPS; i++) { // 26
|
|
LoadLinkedStoreConditional((uint32_t *) &naked_int); // 27
|
|
} // 28
|
|
} // 29
|
|
// 30
|
|
void NakedWorker() { // 31
|
|
extern volatile uint32_t naked_int; // 32
|
|
// 33
|
|
for (uint32_t i = 0; i < MAX_LOOPS; i++) { // 34
|
|
naked_int++; // 35
|
|
} // 36
|
|
} // 37
|
|
// 38
|
|
void AtomicWorker() { // 39
|
|
extern atomic<uint32_t> atomic_integer; // 40
|
|
// 41
|
|
for (uint32_t i = 0; i < MAX_LOOPS; i++) { // 42
|
|
atomic_integer++; // 43
|
|
} // 44
|
|
} // 45
|
|
// 46
|
|
void DoNaked() { // 47
|
|
vector<thread *> threads; // 48
|
|
// 49
|
|
naked_int = 0; // 50
|
|
for (uint32_t i = 0; i < NUM_THREADS; i++) { // 51
|
|
threads.push_back(new thread(NakedWorker)); // 52
|
|
} // 53
|
|
// 54
|
|
for (auto &t : threads) { // 55
|
|
t->join(); // 56
|
|
} // 57
|
|
} // 58
|
|
// 59
|
|
void DoLLSC() { // 60
|
|
vector<thread *> threads; // 61
|
|
naked_int = 0; // 62
|
|
// 63
|
|
for (uint32_t i = 0; i < NUM_THREADS; i++) { // 64
|
|
threads.push_back(new thread(LLSCWorker)); // 65
|
|
} // 66
|
|
// 67
|
|
for (auto &t : threads) { // 68
|
|
t->join(); // 69
|
|
} // 70
|
|
} // 71
|
|
// 72
|
|
void DoAtomic() { // 73
|
|
vector<thread *> threads; // 74
|
|
// 75
|
|
for (uint32_t i = 0; i < NUM_THREADS; i++) { // 76
|
|
threads.push_back(new thread(AtomicWorker)); // 77
|
|
} // 78
|
|
// 79
|
|
for (auto &t : threads) { // 80
|
|
t->join(); // 81
|
|
} // 82
|
|
} // 83
|
|
// 84
|
|
int main() { // 85
|
|
// 86
|
|
DoNaked(); // 87
|
|
DoAtomic(); // 88
|
|
// 89
|
|
cout << "Correct sum is: "; // 90
|
|
cout << NUM_THREADS * MAX_LOOPS << endl; // 91
|
|
cout << "Naked sum: " << naked_int << endl; // 92
|
|
cout << "Atomic sum: " << atomic_integer << endl; // 93
|
|
// 94
|
|
DoLLSC(); // 95
|
|
cout << "LLSC sum: " << naked_int << endl; // 96
|
|
return 0; // 97
|
|
} // 98
|
|
```
|
|
|
|
This program will spawn 16 threads which will each loop 10,000 times,
|
|
adding one to a zero-initialized integer each loop. At the end, when all
|
|
the threads complete, the integer should have the value 160,000.
|
|
|
|
Alas, this is an example of the class "Hidden Update" bug. The shared
|
|
resource, the integer, will get clobbered in unpredictable ways.
|
|
|
|
For example, multiple runs might produce (snipped to show only the
|
|
output from `NakedWorker()`):
|
|
|
|
- Naked sum: 74291
|
|
- Naked sum: 79390
|
|
- Naked sum: 89115
|
|
- etc
|
|
|
|
Not only are the results wrong, they are wrong in a different way each
|
|
time.
|
|
|
|
## Serializing Access to Integer Types
|
|
|
|
C++11 introduced the notion of *atomic integers*. These do not glow.
|
|
Rather, access to them is guaranteed to be atomic... as in, cannot be
|
|
broken down.
|
|
|
|
The hidden update problem's root cause is that adding (for example) to a
|
|
value in memory involves three instructions at the assembly language
|
|
level. A load, an addition, and a store. A hidden update occurs when a
|
|
thread is yanked from the CPU in the middle of these instructions. When
|
|
the thread returns to the CPU, the store causes its stale data to
|
|
overwrite (hide) correct data.
|
|
|
|
There are many ways to avoid the hidden update problem including a large
|
|
array of synchronization mechanisms. Alternatively, one can avoid the
|
|
hidden update problem by ensuring the three instruction sequence isn't
|
|
interrupted. This can be done using atomic integer types.
|
|
|
|
## Using Atomics
|
|
|
|
First, make the appropriate include:
|
|
|
|
`#include <atomic>`
|
|
|
|
Next, make the appropriate declaration and initialize the variable.
|
|
Here, replace "integral type" with some integer type:
|
|
|
|
`atomic<integral type> atomic_integer(0);`
|
|
|
|
Notice how the initial value is provided to the atomic variable's
|
|
constructor.
|
|
|
|
Finally, use the atomic variable as you would any other integer.
|
|
|
|
## A General Implementation
|
|
|
|
The newer ARM architectures provide a single instruction solution for
|
|
addition, subtraction and various bitwise operations. These will be
|
|
described below.
|
|
|
|
For ARMv8 and for later ARM versions (to perform operations other than
|
|
those listed above), there is a general solution: an is an example of
|
|
Load Linked / Store Conditional. It involves a loop but starvation is
|
|
unlikely because a rescheduled thread is unlikely to be descheduled
|
|
again within the same loop owing to the relatively long time alloted
|
|
to each freshly scheduled entity.
|
|
|
|
```asm
|
|
.text // 1
|
|
.p2align 2 // 2
|
|
// 3
|
|
#if defined(__APPLE__) // 4
|
|
.global _LoadLinkedStoreConditional // 5
|
|
_LoadLinkedStoreConditional: // 6
|
|
#else // 7
|
|
.global LoadLinkedStoreConditional // 8
|
|
LoadLinkedStoreConditional: // 9
|
|
#endif // 10
|
|
1: ldaxr w1, [x0] // 11
|
|
add w1, w1, 1 // 12
|
|
stlxr w2, w1, [x0] // 13
|
|
cbnz w2, 1b // 14
|
|
ret // 15
|
|
```
|
|
|
|
Lines 1 and 2 are boilerplate.
|
|
|
|
The conditional assembly block from line 4 through line 10 declare the
|
|
label `LoadLinkedStoreConditional` as global for both Linux and Apple
|
|
assemblers. The label itself is also stated.
|
|
|
|
Labels marked as global must have an underscore prefix for Apple
|
|
assembly. This code does not make use of the macro package we have
|
|
written for sake of brevity and clarity.
|
|
|
|
This function is passed the address of an `int32_t`.
|
|
|
|
Line 11 loads the value found at that address into `w1` and also marks
|
|
the address as needing watching (by the hardware).
|
|
|
|
Line 12 can be expanded and / or replaced with whatever operation needed
|
|
to be done on the value.
|
|
|
|
Line 13 puts the potato on the fork. It is a store conditional which may
|
|
or may not actually write anything to memory.
|
|
|
|
To understand this instruction, [Kristien et
|
|
al](../../reference_material/USENIX2020.pdf) et al. provide this helpful
|
|
picture:
|
|
|
|

|
|
|
|
The value returned in `w2` will be 0 of the store actually took place
|
|
and will have the value of 1 if the store was rejected.
|
|
|
|
Imagine the following:
|
|
|
|
| T1 | T2 |
|
|
| -- | -- |
|
|
| Executes line 11. Gets value *N*. Location is marked. | |
|
|
| T1 is descheduled. The code to implement the context switch clears the watch.| |
|
|
| | Executes line 11. Gets value *N*. Location is marked. |
|
|
| | `w1` goes up to *N + 1* on line 12. |
|
|
| | Line 13 succeeds in storing *N + 1* to memory and the location marking is cleared |
|
|
| | T2 is descheduled. |
|
|
| T1 is scheduled - recall it has stale values. | |
|
|
| Executes line 12 making *N + 1* which is now wrong. | |
|
|
| Executes line 13 which fails because the marking is now gone. | |
|
|
| Loops around, this time picking up *N + 1* | |
|
|
| Correctly makes *N + 2* | |
|
|
|
|
Switching contexts is but one of the ways in which any watches held by
|
|
the descheduled thread are cleared. This is OK and causes the thread to
|
|
loop when it is rescheduled. It is likely to succeed the next time
|
|
around.
|
|
|
|
## Implementation of ARMv8.1A and Newer
|
|
|
|
Implementations of operations on atomic variables were improved in the
|
|
second version of ARMv8, called ARMv8.1. The load linked and store
|
|
conditional instructions are still available but several new
|
|
instructions were added which perform certain operations such as
|
|
addition, subtraction and various bitwise operations in a single atomic
|
|
instruction.
|
|
|
|
For example:
|
|
|
|
```asm
|
|
mov w1, 1
|
|
ldaddal w1, w0, [x0]
|
|
```
|
|
|
|
does the same work of atomically adding one to the value in memory
|
|
pointed to by `x0`.
|