# Another use for the instructions used in **atomics** In the section on **atomics** we saw how the ARM V8 load linked / store conditional instructions can be used to create atomic operations on variables in memory. Here, for review, we present an atomic increment: ```text .text // 1 .p2align 2 // 2 // 3 #if defined(__APPLE__) // 4 .global _LoadLinkedStoreConditional // 5 _LoadLinkedStoreConditional: // 6 #else // 7 .global LoadLinkedStoreConditional // 8 LoadLinkedStoreConditional: // 9 #endif // 10 1: ldaxr w1, [x0] // 11 add w1, w1, 1 // 12 stlxr w2, w1, [x0] // 13 cbnz w2, 1b // 14 ret // 15 ``` The nonsense between lines 4 and 10 declare the label in ways compatible with both Apple M and Linux. The interesting part happens from line 11 through line 14. Line 11 dereferences a pointer to an `int32_t` putting its current value into `w1`. Line 12 is the increment. Notice the dereference instruction is not the usual `ldr`. Instead it is `ldaxr` which is a dereference that marks the memory location in `x0` as a load for which we're hoping for exclusivity. Hoping. We don't actually know if we had exclusive access to the memory location until the `stlxr` returns 0, meaning no one else has attempted to change the value at the location. If `stlxr` doesn't return 0, then the value WE have is stale. So, we try again. ## Making a spin-lock When one has a shared resource used by more than one thread it must be protected. This is the nugget to be aware of when working with threads. Take a look at this thread worker: ```text void Worker(int32_t id) { // 1 int32_t counter = 0; // 2 while (counter < 4) { // 3 Lock(&lock_variable); // 4 counter++; // 5 cout << "thread: " << id << " counter: " << counter << endl;// 6 std::this_thread::sleep_for(chrono::milliseconds(5)); // 7 Unlock(&lock_variable); // 8 sched_yield(); // 9 } // 10 } ``` The purpose of the worker is to print something to the console 4 times then exit. The shared resource is the console itself. Without protecting the console, threads will step over each other trying to print to it. Here is a sample of what could happen without our spin-lock: ```text thread: 0thread: 3 counter: 1 thread: 7 counter: 1 counter: thread: thread: thread: 10thread: 5 counter: 1 thread: counter: thread: 121 counter: thread: 8 counter: 113 thread: thread: 2thread: counter: 151 counter: ``` With our spin-lock, here's what we might get: ```text thread: 12 counter: 3 thread: 4 counter: 2 thread: 7 counter: 4 thread: 3 counter: 2 thread: 1 counter: 4 thread: 2 counter: 4 thread: 13 counter: 3 thread: 12 counter: 4 ``` Line 7 stresses the lock. Line 9 causes the currently running thread to voluntarily deschedule. This makes the output more interesting. With out it, after unlocking, the same thread may regain the lock immediately. Now let's look at the spin-lock. But first, a spin-lock is called a spin-lock because a thread that doesn't get the lock will `spin` trying to get it. This wastes time and generates heat, using electricity. Bummer. Here is the source code to the spin-lock for ARM V8. ```text #if defined(__APPLE__) // 1 _Lock: // 2 #else // 3 Lock: // 4 #endif // 5 START_PROC // 6 mov w3, 1 // 7 1: ldaxr w1, [x0] // 8 cbnz w1, 1b // lock taken - spin. // 9 stlxr w2, w3, [x0] // 10 cbnz w2, 1b // shucks - somebody meddled. // 11 ret // 12 END_PROC // 13 ``` Line 8 does a `ldaxr` dereferencing the lock itself (once again an `int32_t`) and marks the location of the lock as being hopefully, exclusive. Having gotten the value of the lock, on line 8, its value is inspected and if found to be non-zero, we branch back to attempting to get it again - this is the spin. If the contents of the lock is 0, its value in `w1` is changed to non-zero. Note, this could be made a bit better if a value of 1 was stored in another `w` register and simply used directly on line 10. Line 10 conditionally stores the changed value back to the location of the lock. If the `stlxr` returns 0, we got the lock. If not, we start over - somebody else got in there ahead of us. Perhaps this happened because we were descheduled. Perhaps we lost the lock to another thread running on a different core. The unlock looks like this: ```text #if defined(__APPLE__) // 1 _Unlock: // 2 #else // 3 Unlock: // 4 #endif // 5 START_PROC // 6 str wzr, [x0] // 7 dmb ish // 8 ret // 9 END_PROC // 10 ``` All it does is set to value of the lock to zero. The correct operation of the lock requires that no bad actor simply stomps on the lock by calling `Unlock` without first owning the lock. Just say no to lock stompers. Line 8 sets up a data memory barrier across each processor - it makes sure threads running on different cores see the update correctly. This code seemed to work without this line but intuition suggests it could be important. In `Lock()` the `stlxr` instruction has an implied data memory barrier. Please see the source code located [here](./spin_lock.S) for some additional comments regarding the implementation.