added chapter on regs

2026-06-21 01:56:47 +08:00 · 2022-05-20 09:52:55 -05:00 · 2022-05-20 09:52:55 -05:00 · 6ed5185eba
commit 6ed5185eba
parent 531b3dd2ff
30 changed files with 1621 additions and 0 deletions
--- a/section_1/regs/Neon-lanes-and-elements.png
+++ b/section_1/regs/Neon-lanes-and-elements.png
--- a/section_1/regs/README.md
+++ b/section_1/regs/README.md
@ -0,0 +1,75 @@
+# Section 1 / Chapter 5 / Interlude - Registers
+
+We have discussed and used registers in the previous chapters without explanation. This chapter
+introduces the concept of registers and explains why registers are critical.
+
+## Types of Registers
+
+Of general interest, the ARM 64 bit ISA offers a large register set for integer types and another
+for floating point types.
+
+The register set designed for integer types are indicated by `x` and `w` variants. The two
+variations are coincident - `w0` for example, is the same underlying register as `x0`. The
+choice of letter (`x` or `w`) determines how the register is interpreted.
+
+* The `x` registers are for `long` integers and addresses.
+* The `w` registers are used for the narrower integer types.
+
+The registers used for floating point types (and vector operations) are coincident:
+
+* `q` registers are a massive 16 bytes wide.
+* `v` registers are also 16 bytes wide and are synonyms for the `q` registers.
+* `d` registers for `doubles` which are 8 bytes wide. 2 per `v`.
+* `s` registers for `floats` which are 4 bytes wide. 4 per `v`.
+* `h` registers for `half precisions floats` which are 2 bytes wide. 8 per `v`.
+* `b` registers for byte operations. 16 per `v`.
+
+## Why Registers
+
+In most (all?) of the programming you may have done prior to learning assembly language, you've taken it for granted that
+variables are located somewhere in RAM. This has been a convenient fiction. In reality, virtually all interaction with a
+variable takes place in a register rather than in RAM. Indeed, a well written assembly language program can often do away
+with RAM to a large degree by careful use of registers.
+
+What you think of as:
+
+```c++
+i = i + 1;
+```
+
+is really:
+
+```text
+1. load the address of `i` into an `x` register
+2. dereference the `x` register to get the value of `i` from RAM into another register
+3. add one to it
+4. use the address previously loaded to store the value back to RAM 
+```
+
+Modern processors that are not slaves to backwards compatibility have fewer and fewer instructions that operate directly upon data in RAM. This is largely because of the stupendous difference in speed at which a processor can access its registers versus the speed at which a processor can access memory.
+
+The following two images are from the [Formulus Black Blog](https://formulusblack.com/blog/compute-performance-distance-of-data-as-a-measure-of-latency/) and are quite informative.
+
+This image relates typical latency (delay) in gaining access to data in various places.
+
+![Latency](./latency.png)
+
+This says that if we liken accessing a register (which can be done at *least* once per CPU Clock Cycle) to one second, accessing RAM would be like a 3.5 to 5.5 minute wait.
+
+In the next image, the relative latencies within a computer are expressed in a different way: What is the *effective distance* of a device from the CPU expressed in terms of the *speed of light.* Here, registers can be thought of as being less than 4 inches away from the CPU. Main memory, on the other hand, would be 70 to 100 feet away.
+
+![Latency 2](./latency2.png)
+
+Resist the urge to cling tightly to the idea of data being only found in RAM.
+
+In order to manipulate data, the data must be loaded into registers. 
+
+**YOU ARE THE HUMAN!** With planning and forethought YOU can arrange for the data you need most to be resident in registers rather than in RAM. In fact, ideally, you can organize your code and algorithms to minimize the dependence upon RAM and in some cases, you can write whole sophisticated programs using RAM for little more than a place to store string literals.
+
+## Summary
+
+Registers are truly "where the action is."
+
+Very few instructions on the ARM 64 bit ISA can manipulate the contents of RAM directly because of the enormous disparity between processor and memory speeds.
+
+Foreshadowing a more advanced topic, whenever CS people encounter devices operating at hugely different speeds, they feel compelled to create `caching systems`. Caches are, to use a technical term, *good*.
--- a/section_1/regs/align.s
+++ b/section_1/regs/align.s
@ -0,0 +1,17 @@
+        .global	main
+        .text
+        .align	2
+
+main:	mov     x0, xzr
+        ldr     x1, =ram
+        strb    w0, [x1]
+        strh    w0, [x1]
+        str     w0, [x1]
+        str     x0, [x1]
+        ret
+    
+        .data
+ram:    .quad   0xFFFFFFFFFFFFFFFF
+
+        .end
+
--- a/section_1/regs/array01.c
+++ b/section_1/regs/array01.c
@ -0,0 +1,9 @@
+long Sum(long * values, long length)
+{
+    long sum = 0;
+    for (long i = 0; i < length; i++)
+    {
+        sum += values[i];
+    }
+    return sum;
+}
--- a/section_1/regs/array01.s
+++ b/section_1/regs/array01.s
@ -0,0 +1,42 @@
+	.arch armv8-a
+	.file	"array01.c"
+	.text
+	.align	2
+	.global	Sum
+	.type	Sum, %function
+Sum:
+.LFB0:
+	.cfi_startproc
+	sub	sp, sp, #32
+	.cfi_def_cfa_offset 32
+	str	x0, [sp, 8]
+	str	x1, [sp]
+	str	xzr, [sp, 16]
+	str	xzr, [sp, 24]
+	b	.L2
+.L3:
+	ldr	x0, [sp, 24]
+	lsl	x0, x0, 3
+	ldr	x1, [sp, 8]
+	add	x0, x1, x0
+	ldr	x0, [x0]
+	ldr	x1, [sp, 16]
+	add	x0, x1, x0
+	str	x0, [sp, 16]
+	ldr	x0, [sp, 24]
+	add	x0, x0, 1
+	str	x0, [sp, 24]
+.L2:
+	ldr	x1, [sp, 24]
+	ldr	x0, [sp]
+	cmp	x1, x0
+	blt	.L3
+	ldr	x0, [sp, 16]
+	add	sp, sp, 32
+	.cfi_def_cfa_offset 0
+	ret
+	.cfi_endproc
+.LFE0:
+	.size	Sum, .-Sum
+	.ident	"GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
+	.section	.note.GNU-stack,"",@progbits
--- a/section_1/regs/array02.c
+++ b/section_1/regs/array02.c
@ -0,0 +1,10 @@
+long Sum(long * values, long length)
+{
+    long sum = 0;
+    long * end = values + length;
+    while (values < end)
+    {
+        sum += *(values++);
+    }
+    return sum;
+}
--- a/section_1/regs/array02.s
+++ b/section_1/regs/array02.s
@ -0,0 +1,28 @@
+	.arch armv8-a
+	.file	"array02.c"
+	.text
+	.align	2
+	.p2align 3,,7
+	.global	Sum
+	.type	Sum, %function
+Sum:
+.LFB0:
+	.cfi_startproc
+	mov	x2, x0
+	add	x1, x0, x1, lsl 3
+	cmp	x2, x1
+	mov	x0, 0
+	bcs	.L1
+	.p2align 3,,7
+.L3:
+	ldr	x3, [x2], 8
+	add	x0, x0, x3
+	cmp	x1, x2
+	bhi	.L3
+.L1:
+	ret
+	.cfi_endproc
+.LFE0:
+	.size	Sum, .-Sum
+	.ident	"GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
+	.section	.note.GNU-stack,"",@progbits
--- a/section_1/regs/array03.s
+++ b/section_1/regs/array03.s
@ -0,0 +1,23 @@
+    .global Sum
+    .text
+    .align  2
+
+//  x0 is the pointer to data
+//  x1 is the length and is reused as `end`
+//  x2 is the sum
+//  x3 is the current dereferenced value
+
+Sum:
+    mov     x2, xzr
+    add     x1, x0, x1, lsl 3
+    b       2f
+
+1:  ldr     x3, [x0], 8
+    add     x2, x2, x3
+2:  cmp     x0, x1
+    blt     1b
+
+    mov     x0, x2
+    ret
+
+    .end
--- a/section_1/regs/array10.c
+++ b/section_1/regs/array10.c
@ -0,0 +1,51 @@
+#include <stdio.h>
+
+struct Person
+{
+    char * fname;
+    char * lname;
+    int age;
+};
+
+extern int rand();
+extern struct Person * FindOldestPerson(struct Person *, int);
+
+struct Person * OriginalFindOldestPerson(struct Person * people, int length)
+{
+    int oldest_age = 0;
+    struct Person * oldest_ptr = NULL;
+
+    if (people)
+    {
+        struct Person * end_ptr = people + length;
+        while (people < end_ptr)
+        {
+            if (people->age > oldest_age)
+            {
+                oldest_age = people->age;
+                oldest_ptr = people;
+            }
+            people++;
+        }
+    }
+    return oldest_ptr;
+}
+
+#define LENGTH  20
+
+int main()
+{
+    struct Person array[LENGTH];
+    for (int i = 0; i < LENGTH; i++)
+    {
+        array[i].age = rand() & 5000;
+    }
+    struct Person * oldest = FindOldestPerson(array, LENGTH);
+    for (int i = 0; i < LENGTH; i++)
+    {
+        printf("%d", array[i].age);
+        if (oldest == &array[i])
+            printf("*");
+        printf("\n");
+    }
+}
--- a/section_1/regs/array10.s
+++ b/section_1/regs/array10.s
@ -0,0 +1,142 @@
+	.arch armv8-a
+	.file	"array10.c"
+	.text
+	.align	2
+	.p2align 3,,7
+	.global	OriginalFindOldestPerson
+	.type	OriginalFindOldestPerson, %function
+OriginalFindOldestPerson:
+.LFB23:
+	.cfi_startproc
+	mov	x2, x0
+	cbz	x0, .L5
+	mov	w4, 24
+	smaddl	x4, w1, w4, x0
+	mov	x0, 0
+	cmp	x2, x4
+	bcs	.L1
+	mov	w1, 0
+	.p2align 3,,7
+.L4:
+	ldr	w3, [x2, 16]
+	cmp	w1, w3
+	csel	x0, x0, x2, ge
+	add	x2, x2, 24
+	csel	w1, w1, w3, ge
+	cmp	x4, x2
+	bhi	.L4
+.L1:
+	ret
+	.p2align 2,,3
+.L5:
+	mov	x0, 0
+	ret
+	.cfi_endproc
+.LFE23:
+	.size	OriginalFindOldestPerson, .-OriginalFindOldestPerson
+	.section	.rodata.str1.8,"aMS",@progbits,1
+	.align	3
+.LC0:
+	.string	"%d"
+	.section	.text.startup,"ax",@progbits
+	.align	2
+	.p2align 3,,7
+	.global	main
+	.type	main, %function
+main:
+.LFB24:
+	.cfi_startproc
+	sub	sp, sp, #560
+	.cfi_def_cfa_offset 560
+	stp	x29, x30, [sp]
+	.cfi_offset 29, -560
+	.cfi_offset 30, -552
+	mov	x29, sp
+	stp	x23, x24, [sp, 48]
+	.cfi_offset 23, -512
+	.cfi_offset 24, -504
+	adrp	x23, :got:__stack_chk_guard
+	add	x24, sp, 568
+	ldr	x0, [x23, #:got_lo12:__stack_chk_guard]
+	stp	x21, x22, [sp, 32]
+	.cfi_offset 21, -528
+	.cfi_offset 22, -520
+	add	x21, sp, 72
+	ldr	x1, [x0]
+	str	x1, [sp, 552]
+	mov	x1,0
+	stp	x19, x20, [sp, 16]
+	.cfi_offset 19, -544
+	.cfi_offset 20, -536
+	add	x20, x21, 16
+	mov	x19, x21
+	mov	w22, 5000
+	.p2align 3,,7
+.L10:
+	bl	rand
+	and	w0, w0, w22
+	str	w0, [x20], 24
+	cmp	x20, x24
+	bne	.L10
+	mov	x0, x21
+	mov	w1, 20
+	adrp	x22, .LC0
+	bl	FindOldestPerson
+	add	x21, x21, 480
+	mov	x20, x0
+	add	x22, x22, :lo12:.LC0
+	.p2align 3,,7
+.L14:
+	ldr	w2, [x19, 16]
+	mov	x1, x22
+	mov	w0, 1
+	bl	__printf_chk
+	cmp	x20, x19
+	beq	.L18
+	add	x19, x19, 24
+	mov	w0, 10
+	bl	putchar
+	cmp	x19, x21
+	bne	.L14
+.L13:
+	ldr	x23, [x23, #:got_lo12:__stack_chk_guard]
+	ldr	x0, [sp, 552]
+	ldr	x1, [x23]
+	subs	x0, x0, x1
+	mov	x1, 0
+	bne	.L19
+	mov	w0, 0
+	ldp	x29, x30, [sp]
+	ldp	x19, x20, [sp, 16]
+	ldp	x21, x22, [sp, 32]
+	ldp	x23, x24, [sp, 48]
+	add	sp, sp, 560
+	.cfi_remember_state
+	.cfi_restore 29
+	.cfi_restore 30
+	.cfi_restore 23
+	.cfi_restore 24
+	.cfi_restore 21
+	.cfi_restore 22
+	.cfi_restore 19
+	.cfi_restore 20
+	.cfi_def_cfa_offset 0
+	ret
+	.p2align 2,,3
+.L18:
+	.cfi_restore_state
+	mov	w0, 42
+	bl	putchar
+	add	x19, x20, 24
+	mov	w0, 10
+	bl	putchar
+	cmp	x19, x21
+	bne	.L14
+	b	.L13
+.L19:
+	bl	__stack_chk_fail
+	.cfi_endproc
+.LFE24:
+	.size	main, .-main
+	.ident	"GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
+	.section	.note.GNU-stack,"",@progbits
--- a/section_1/regs/array11.s
+++ b/section_1/regs/array11.s
@ -0,0 +1,39 @@
+        .global FindOldestPerson
+        .text
+        .align  2
+
+//  x0  has struct Person * people
+//      will be used for oldest_ptr as this is the return value
+//  w1  has int length
+//  w2  used for oldest_age
+//  x3  used for Person *
+//  x4  used for end_ptr
+//  w5  used for scratch
+
+FindOldestPerson:
+        cbz     x0, 99f             // short circuit
+        mov     w2, wzr             // initial oldest age is 0
+        mov     x3, x0              // initialize loop pointer
+        mov     x0, xzr             // initialize return value
+        mov     w5, 24              // struct is 24 bytes wide
+        smaddl	x4, w1, w5, x3      // initialize end_ptr
+        b       10f                 // enter loop
+
+1:      ldr     w5, [x3, p.age]     // fetch loop ptr -> age
+        cmp     w2, w5              // compare to oldest_age
+    	csel	w2, w2, w5, gt      // update based on cmp
+        csel	x0, x0, x3, gt      // update based on cmp
+        add     x3, x3, 24          // increment loop ptr
+10:     cmp     x3, x4              // has loop ptr reached end_ptr?
+        blt     1b                  // no, not yet
+
+99:     ret
+
+        .data
+        .struct 0
+p.fn:   .skip   8
+p.ln:   .skip   8
+p.age:  .skip   4
+p.pad:  .skip   4
+
+        .end
--- a/section_1/regs/backup.md
+++ b/section_1/regs/backup.md
@ -0,0 +1,124 @@
+# Backing up and Restoring Registers
+
+While it is true that there are 31 general purpose registers (the `x` and coincident `w` registers), they aren't all equally general purpose. First, let's call out the different categories of general purpose registers:
+
+* `x30` is the **Link Register**
+* `x0` through `x7` are truly scratch registers.
+* `x9` through `x15`  are registers you can't count on being preserved by functions called by your function. If you need them to be preserved, you must preserve and restore them yourself.
+* `x19` through `x28` are registers you must back up and restore if you use them.
+* `x8`, `x16` through `x18` and `x29` are used by compilers in support of the magic they do.
+
+![regs](./regs.png)
+
+The above image is due to ARM and is found [here](https://documentation-service.arm.com/static/5fbd26f271eff94ef49c7018).
+
+## What Does "Preserving" a Register Mean?
+
+To backup (or "preserve") a register is copying the register onto the stack (i.e. storing a copy of the register in RAM). The `str` and `stp` instructions are used for this.
+
+## What Does "Restoring" a Register Mean?
+
+To restore a register means copying a previously stored copy of the contents of a register from the stack (RAM) back to the register. The `ldr` and `ldp` instructions are used for this.
+
+## Always Manipulate the Stack in Multiples of 16
+
+Any change you make to `sp` must be a multiple of 16.
+
+Here is an example:
+
+```asm
+        .global main                                                    // 1 
+        .text                                                           // 2 
+        .align  2                                                       // 3 
+                                                                        // 4 
+main:   str     x30, [sp, -8]!                                          // 5 
+        ldr     x30, [sp], 8                                            // 6 
+        ret                                                             // 7 
+```
+
+which produces:
+
+```text
+regs > ./a.out
+Bus error (core dumped)
+regs >
+```
+
+## Link Register
+
+`x30` is the Link Register, a register which is automatically used to store the return address when a function call is made (i.e. the `bl` instruction). Indeed, the `bl` instruction is called Branch with Link.
+
+Nearly all functions have the implicit assumption that, at some point, they will return to whence they were called. Exceptions to this, of course, are functions like `exit()` and `exec()` family of functions.
+
+Here's a fun program:
+
+```asm
+        .global main                                                    // 1 
+        .text                                                           // 2 
+        .align  2                                                       // 3 
+                                                                        // 4 
+main:   mov     x30, xzr                                                // 5 
+        ret                                                             // 6 
+                                                                        // 7 
+        .end                                                            // 8 
+```
+
+```text
+regs > ./a.out
+Segmentation fault (core dumped)
+regs >
+```
+
+So short. So sweet. So lethal.
+
+Manipulating the `x30` register is done automatically by the `bl` and `ret` instructions.
+
+The `bl` performs these steps:
+
+* Compute the address of the instruction following the `bl`
+* Put that address into `x30`
+* Put the address of the function being called into the Program Counter
+
+Note that the Program Counter always contains the address of the next instruction to be executed. Loading a new value into the Program Counter causes a branch to take place. The Program Counter is a register but it is not one of the general purpose registers. Its mnemonic is `pc`.
+
+The `ret` instruction copies the contents of `x30` into the `pc`, causing a branch to that address (which ought to be where the function was called from).
+
+The program above crashes because `line 5` obliterates the address to which `main()` was supposed to return.
+
+### Exception to Needing to Protect `x30`
+
+If your function:
+
+* does not itself modify `x30` and
+* does not itself call any other functions
+
+then you do not need to backup and restore `x30`.
+
+## `x0` through `x7`
+
+These registers are truly scratch.
+
+* You can modify them at any time
+* You cannot count on their values surviving any function call you make
+
+Note these registers are used to convey up to 8 parameters to functions.
+
+## `x9` through `x15`
+
+These registers are free for you to use **but** if you call other functions, you cannot count on them being what they were when a function returns. If you need these values to be preserved across function calls, you have to preserve them.
+
+## `x19` through `x28`
+
+These registers are free for you to use **but** if you modify them, you **must** preserve them.
+
+## `x8`, `x16` through `x18` and `x29`
+
+While these registers *are* general purpose registers and you *can* use them, compilers use these to facilitate certain functions they do including easing the use of a debugger. If you're flitting back and forth from assembly language to higher level languages you might think about avoiding their use. `x29` in particular is known as the `frame pointer` and is used for debugging. It will be explained in more detail elsewhere.
+
+## Restatement
+
+| Registers | Preserve? |
+| --------- | --------- |
+| `x0`...`x7` | All bets are off - no promised made |
+| `x9`...`x15` | If you are counting on their value across function calls, you must preserve them |
+| `x19`...`x28` | If you use them you must preserve them |
--- a/section_1/regs/badstack.s
+++ b/section_1/regs/badstack.s
@ -0,0 +1,7 @@
+        .global main
+        .text
+        .align  2
+
+main:   str     x30, [sp, -8]!
+        ldr     x30, [sp], 8
+        ret
--- a/section_1/regs/bar.c
+++ b/section_1/regs/bar.c
@ -0,0 +1,4 @@
+char Foo(char a, char b)
+{
+	return a + b;
+}
--- a/section_1/regs/cast.c
+++ b/section_1/regs/cast.c
@ -0,0 +1,36 @@
+#include <stdio.h>
+
+void Foo()
+{
+    unsigned char c = 1;
+    unsigned short s = 2;
+    unsigned int i = 4;
+    unsigned long l = 8;
+
+    s += (unsigned short) c;
+    i += (unsigned int) s;
+    l += (unsigned long) i;
+}
+
+void Bar()
+{
+    int i = 4;
+    long l = 8;
+
+    i += (int) l;
+}
+
+int main()
+{
+    char c = 1;
+    short s = 2;
+    int i = 4;
+    long l = 8;
+
+    s += (short) c;
+    i += (int) s;
+    l += (long) i;
+    l += c;
+    c += l;
+    return 0;
+}
--- a/section_1/regs/cast.s
+++ b/section_1/regs/cast.s
@ -0,0 +1,171 @@
+	.arch armv8-a
+	.file	"cast.c"
+// GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (aarch64-linux-gnu)
+//	compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP
+
+// GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
+// options passed:  -imultiarch aarch64-linux-gnu cast.c -mlittle-endian
+// -mabi=lp64 -Wall -fverbose-asm -fasynchronous-unwind-tables
+// -fstack-protector-strong -Wformat-security -fstack-clash-protection
+// options enabled:  -fPIC -fPIE -faggressive-loop-optimizations
+// -fassume-phsa -fasynchronous-unwind-tables -fauto-inc-dec -fcommon
+// -fdelete-null-pointer-checks -fdwarf2-cfi-asm -fearly-inlining
+// -feliminate-unused-debug-types -ffp-int-builtin-inexact -ffunction-cse
+// -fgcse-lm -fgnu-runtime -fgnu-unique -fident -finline-atomics
+// -fipa-stack-alignment -fira-hoist-pressure -fira-share-save-slots
+// -fira-share-spill-slots -fivopts -fkeep-static-consts
+// -fleading-underscore -flifetime-dse -flto-odr-type-merging -fmath-errno
+// -fmerge-debug-strings -fomit-frame-pointer -fpeephole -fplt
+// -fprefetch-loop-arrays -freg-struct-return
+// -fsched-critical-path-heuristic -fsched-dep-count-heuristic
+// -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
+// -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
+// -fsched-stalled-insns-dep -fschedule-fusion -fsemantic-interposition
+// -fshow-column -fshrink-wrap-separate -fsigned-zeros
+// -fsplit-ivs-in-unroller -fssa-backprop -fstack-clash-protection
+// -fstack-protector-strong -fstdarg-opt -fstrict-volatile-bitfields
+// -fsync-libcalls -ftrapping-math -ftree-cselim -ftree-forwprop
+// -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
+// -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop
+// -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables
+// -fverbose-asm -fzero-initialized-in-bss -mfix-cortex-a53-835769
+// -mfix-cortex-a53-843419 -mglibc -mlittle-endian
+// -momit-leaf-frame-pointer -mpc-relative-literal-loads
+
+	.text
+	.align	2
+	.global	Foo
+	.type	Foo, %function
+Foo:
+.LFB0:
+	.cfi_startproc
+	sub	sp, sp, #16	//,,
+	.cfi_def_cfa_offset 16
+// cast.c:5:     unsigned char c = 1;
+	mov	w0, 1	// tmp93,
+	strb	w0, [sp, 1]	// tmp93, c
+// cast.c:6:     unsigned short s = 2;
+	mov	w0, 2	// tmp94,
+	strh	w0, [sp, 2]	// tmp94, s
+// cast.c:7:     unsigned int i = 4;
+	mov	w0, 4	// tmp95,
+	str	w0, [sp, 4]	// tmp95, i
+// cast.c:8:     unsigned long l = 8;
+	mov	x0, 8	// tmp96,
+	str	x0, [sp, 8]	// tmp96, l
+// cast.c:10:     s += (unsigned short) c;
+	ldrb	w0, [sp, 1]	// tmp97, c
+	and	w0, w0, 65535	// _1, tmp97
+	ldrh	w1, [sp, 2]	// tmp98, s
+	add	w0, w0, w1	// tmp99, _1, tmp100
+	strh	w0, [sp, 2]	// tmp101, s
+// cast.c:11:     i += (unsigned int) s;
+	ldrh	w0, [sp, 2]	// _2, s
+// cast.c:11:     i += (unsigned int) s;
+	ldr	w1, [sp, 4]	// tmp103, i
+	add	w0, w1, w0	// tmp102, tmp103, _2
+	str	w0, [sp, 4]	// tmp102, i
+// cast.c:12:     l += (unsigned long) i;
+	ldr	w0, [sp, 4]	// _3, i
+// cast.c:12:     l += (unsigned long) i;
+	ldr	x1, [sp, 8]	// tmp105, l
+	add	x0, x1, x0	// tmp104, tmp105, _3
+	str	x0, [sp, 8]	// tmp104, l
+// cast.c:13: }
+	nop	
+	add	sp, sp, 16	//,,
+	.cfi_def_cfa_offset 0
+	ret	
+	.cfi_endproc
+.LFE0:
+	.size	Foo, .-Foo
+	.align	2
+	.global	Bar
+	.type	Bar, %function
+Bar:
+.LFB1:
+	.cfi_startproc
+	sub	sp, sp, #16	//,,
+	.cfi_def_cfa_offset 16
+// cast.c:17:     int i = 4;
+	mov	w0, 4	// tmp91,
+	str	w0, [sp, 4]	// tmp91, i
+// cast.c:18:     long l = 8;
+	mov	x0, 8	// tmp92,
+	str	x0, [sp, 8]	// tmp92, l
+// cast.c:20:     i += (int) l;
+	ldr	x0, [sp, 8]	// tmp93, l
+	mov	w1, w0	// _1, tmp93
+// cast.c:20:     i += (int) l;
+	ldr	w0, [sp, 4]	// tmp95, i
+	add	w0, w0, w1	// tmp94, tmp95, _1
+	str	w0, [sp, 4]	// tmp94, i
+// cast.c:21: }
+	nop	
+	add	sp, sp, 16	//,,
+	.cfi_def_cfa_offset 0
+	ret	
+	.cfi_endproc
+.LFE1:
+	.size	Bar, .-Bar
+	.align	2
+	.global	main
+	.type	main, %function
+main:
+.LFB2:
+	.cfi_startproc
+	sub	sp, sp, #16	//,,
+	.cfi_def_cfa_offset 16
+// cast.c:25:     char c = 1;
+	mov	w0, 1	// tmp99,
+	strb	w0, [sp, 1]	// tmp99, c
+// cast.c:26:     short s = 2;
+	mov	w0, 2	// tmp100,
+	strh	w0, [sp, 2]	// tmp100, s
+// cast.c:27:     int i = 4;
+	mov	w0, 4	// tmp101,
+	str	w0, [sp, 4]	// tmp101, i
+// cast.c:28:     long l = 8;
+	mov	x0, 8	// tmp102,
+	str	x0, [sp, 8]	// tmp102, l
+// cast.c:30:     s += (short) c;
+	ldrb	w0, [sp, 1]	// tmp103, c
+	and	w1, w0, 65535	// _1, tmp103
+	ldrh	w0, [sp, 2]	// s.0_2, s
+	add	w0, w1, w0	// tmp104, _1, s.0_2
+	and	w0, w0, 65535	// _3, tmp104
+	strh	w0, [sp, 2]	// tmp105, s
+// cast.c:31:     i += (int) s;
+	ldrsh	w0, [sp, 2]	// _4, s
+// cast.c:31:     i += (int) s;
+	ldr	w1, [sp, 4]	// tmp107, i
+	add	w0, w1, w0	// tmp106, tmp107, _4
+	str	w0, [sp, 4]	// tmp106, i
+// cast.c:32:     l += (long) i;
+	ldrsw	x0, [sp, 4]	// _5, i
+// cast.c:32:     l += (long) i;
+	ldr	x1, [sp, 8]	// tmp109, l
+	add	x0, x1, x0	// tmp108, tmp109, _5
+	str	x0, [sp, 8]	// tmp108, l
+// cast.c:33:     l += c;
+	ldrb	w0, [sp, 1]	// _6, c
+	ldr	x1, [sp, 8]	// tmp111, l
+	add	x0, x1, x0	// tmp110, tmp111, _6
+	str	x0, [sp, 8]	// tmp110, l
+// cast.c:34:     c += l;
+	ldr	x0, [sp, 8]	// tmp112, l
+	and	w0, w0, 255	// _7, tmp112
+	ldrb	w1, [sp, 1]	// tmp113, c
+	add	w0, w0, w1	// tmp114, _7, tmp115
+	strb	w0, [sp, 1]	// tmp116, c
+// cast.c:35:     return 0;
+	mov	w0, 0	// _17,
+// cast.c:36: }
+	add	sp, sp, 16	//,,
+	.cfi_def_cfa_offset 0
+	ret	
+	.cfi_endproc
+.LFE2:
+	.size	main, .-main
+	.ident	"GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
+	.section	.note.GNU-stack,"",@progbits
--- a/section_1/regs/casting.s
+++ b/section_1/regs/casting.s
@ -0,0 +1,23 @@
+        .global     main
+        .text
+        .align      2
+
+main:   mov     x0, -1
+        ldr     x1, =ram
+        ldrb    w0, [x1]
+        // p/x $x0 here
+        mov     x0, -1
+        mov     x1, 1
+        mov     w0, w1
+        // p/x $x0 here
+        mov     x0, -1
+        movk    w0, 1
+        // p/x $x0 here
+
+        mov     x0, xzr
+        ret
+
+        .data
+ram:    .quad   0x01
+
+        .end
--- a/section_1/regs/crash01.s
+++ b/section_1/regs/crash01.s
@ -0,0 +1,8 @@
+        .global main
+        .text
+        .align  2
+
+main:   mov     x30, xzr
+        ret
+
+        .end
--- a/section_1/regs/eggs.jpeg
+++ b/section_1/regs/eggs.jpeg
--- a/section_1/regs/endiness.s
+++ b/section_1/regs/endiness.s
@ -0,0 +1,11 @@
+        .global	main
+        .text
+        .align	2
+
+main:	mov     x0, xzr
+        ret
+    
+        .data
+ram:    .quad   0xAABBCCDDEEFF0011
+        .end
+
--- a/section_1/regs/foo.s
+++ b/section_1/regs/foo.s
@ -0,0 +1,14 @@
+	.global	foo
+	.text
+	.align	2
+
+foo:	
+	ret
+
+	.section	.rodata
+	.quad		0xff
+
+	.bss
+	.quad		0xff
+
+	.end
--- a/section_1/regs/latency.png
+++ b/section_1/regs/latency.png
--- a/section_1/regs/latency2.png
+++ b/section_1/regs/latency2.png
--- a/section_1/regs/ldr.md
+++ b/section_1/regs/ldr.md
@ -0,0 +1,456 @@
+# Section 1 / Chapter 6 / Interlude - Load and Store
+
+In this section we will review the `ldr` family of instructions. By extension, this section covers `ldp`, `str` and `stp` instructions indirectly.
+
+Several example programs will be presented.
+
+As has been explained previously, modern CPUs are so much faster than RAM that fewer and fewer instructions are designed to operate on RAM directly. Instead, values from RAM are loaded from RAM into registers where they are used and possibly
+modified. If modified and desirable, the changed value might be stored from a register back to RAM.
+
+## Loading Data From RAM into Registers
+
+The instructions most commonly used to retrieve information from memory are `ldr` and `ldp`. The characters `ld` in these mnemonics bring to mind `load`. `ldr` is "load a register" while `ldp` is "load a pair of registers".
+
+Both of these instructions possess many variations, only a few of which will be described here. In common to all variations of the `ldr` and `ldp` instructions are the notions of *where to fetch from* and *where to store what's been fetched*.
+
+Like many AARCH64 instructions, the most basic form of the load instructions are read right to left as in:
+
+```asm
+    ldr    x0, [x1]
+```
+
+which means "go to the location in RAM specified by `x1` and load what's there into `x0`."
+
+Similarly,
+
+```asm
+    ldp    x0, x1, [sp]
+```
+
+loads a *pair* of registers from RAM at the address specified by the stack pointer.
+
+## Offsets
+
+To facilitate dereferencing `structs` and for accessing `arrays`, an offset may be specified.
+
+There are significant restrictions placed on offsets because (among other reasons) the entire instruction (including the encoding of the offset) must fit within the constant 4 byte width of all AARCH64 instructions.
+
+Here is text from an [ARM manual](https://developer.arm.com/documentation/dui0801/h/A64-Data-Transfer-Instructions/LDR--immediate-):
+
+```text
+1) LDR Xt, [Xn|SP{, #pimm}] ; 64-bit general registers
+2) LDR Xt, [Xn|SP], #simm ; 64-bit general registers, Post-index
+3) LDR Xt, [Xn|SP, #simm]! ; 64-bit general registers, Pre-index
+```
+
+These say you can load an `x` register (for simplicity we have ignored `w` registers) by dereferencing another `x` register or the stack pointer (i.e. `[Xn|SP]`).
+
+Line 1 says you can *optionally* specify an offset.
+
+Lines 2 and 3 says you can specify a *change* to the dereferenced register either before the actual fetch or after.
+
+Assume `ptr` is a pointer to a `long`:
+
+* Line 2 corresponds to: `*(ptr++)`. 
+* Line 3 corresponds to: `*(++ptr)`.
+
+Concerning the restrictions placed on the offsets:
+
+* `simm` can be in the range of -256 to 255 (10 byte signed value).
+* `pimm` can be in the range of 0 to 32760 in multiples of 8.
+
+`w` registers are used for `int`, `short` and `char`. When working with `int`, `simm` must be a multiple of 4. When working with `short`, `simm` must be even.
+
+## Examples
+
+### Array Indexing 1 - Wasteful
+
+Consider this code to sum up the values in an array:
+
+```c
+long Sum(long * values, long length)                                    /* 1 */
+{                                                                       /* 2 */
+    long sum = 0;                                                       /* 3 */
+    for (long i = 0; i < length; i++)                                   /* 4 */
+    {                                                                   /* 5 */
+        sum += values[i];                                               /* 6 */
+    }                                                                   /* 7 */
+    return sum;                                                         /* 8 */
+}                                                                       /* 9 */
+```
+
+We're not going to translate this to assembly language. Instead, we will call out how inefficient this code is. Notice we're using the index variable `i` for nothing more than traipsing through the array. This is fantastically inefficient (in this case).
+
+### Array Indexing 2 - More Efficiently
+
+Consider the following code that performs the same function:
+
+```c
+long Sum(long * values, long length)                                    /* 1 */
+{                                                                       /* 2 */
+    long sum = 0;                                                       /* 3 */
+    long * end = values + length;                                       /* 4 */
+    while (values < end)                                                /* 5 */
+    {                                                                   /* 6 */
+        sum += *(values++);                                             /* 7 */
+    }                                                                   /* 8 */
+    return sum;                                                         /* 9 */
+}                                                                       /* 10 */
+```
+
+Notice we don't use an index variable any longer. Instead, we use the pointer itself for both the dereferencing *and* to tell us when to stop the loop.
+
+`values` begins as the address of the first `long` in the array. On `line 4` we leverage *address arithmetic* to determine where to stop. `end` gets the address of the `long` just beyond the end of the array. When we get there, we stop.
+
+This approach, which avoids the overhead of a loop variable, works well in both `C` and `C++`. It is *similar in spirit*
+to this in `C++`:
+
+```c++
+    vector<Foo> foov;
+    for (auto it = foov.begin(); it < foov.end(); it++)
+```
+
+Here is a hand translation of the above `C` code for function `Sum()`:
+
+```asm
+    .global Sum                                                         // 1 
+    .text                                                               // 2 
+    .align  2                                                           // 3 
+                                                                        // 4 
+//  x0 is the pointer to data                                           // 5 
+//  x1 is the length and is reused as `end`                             // 6 
+//  x2 is the sum                                                       // 7 
+//  x3 is the current dereferenced value                                // 8 
+                                                                        // 9 
+Sum:                                                                    // 10 
+    mov     x2, xzr                                                     // 11 
+    add     x1, x0, x1, lsl 3                                           // 12 
+    b       2f                                                          // 13 
+                                                                        // 14 
+1:  ldr     x3, [x0], 8                                                 // 15 
+    add     x2, x2, x3                                                  // 16 
+2:  cmp     x0, x1                                                      // 17 
+    blt     1b                                                          // 18 
+                                                                        // 19 
+    mov     x0, x2                                                      // 20 
+    ret                                                                 // 21 
+                                                                        // 22 
+    .end                                                                // 23 
+```
+
+Recall that `Sum(long * values, long length)` means that `x0` has the address of the first long in the array.
+
+* We know it's an `x` register because it is an address.
+* We know it is the `0` register because it is the first argument.
+
+`x1` contains `length`.
+
+* We know it is an `x` register because it is a `long`.
+* We know it is the `1` register because it is the second argument.
+
+`Line 11` shows the first use of a "zero register," in this case `xzr`. Reading from a zero register always 
+returns zero. Writing to a zero register is ignored.
+
+`Line 12` is the first really interesting line. It implements `line 4` of the higher level language. That is:
+
+```c
+    long * end = values + length;
+```
+
+is implemented as:
+
+```asm
+    add     x1, x0, x1, lsl 3 
+```
+
+We are performing address arithmetic on `longs`. Each `long` is 8 bytes wide. `x1, lsl 3` means "before adding the value of `x1` to `x0`, multiple `x1` by 8." Eight is 2 raised to the power of 3. `lsl 3` means shift left by 3 bits ... shifting is a fast way of integer multiplication (and division) by powers of 2.
+
+`Line 13` is the branch to the *bottom* of the loop where the decision code is written. We saw how this can save an instruction [here](../for/README.md).
+
+`Line 15` is the `ldr` instruction which performs not only the load (dereference) but also the *post increment* of the pointer.
+
+```c
+    sum += *(values++);                                                 /* 7 */
+```
+
+is implemented by both `lines 15` and `16` in the assembly language.
+
+```asm
+1:  ldr     x3, [x0], 8                                                 // 15 
+    add     x2, x2, x3                                                  // 16 
+```
+
+`Line 17` compares the pointer to where we are now in the array to the address of just past the end of the array.
+
+`Line 18` says that as long as `x0` (or "where we are now") is less than the end of the array (in `x1`), we keep looping.
+
+`Line 20` copies the accumulated sum into `x0` where values returned from functions are expected to be found.
+
+### Fast Memory Copy
+
+This is a heavily contrived example. In reality it is a fun challenge to write an optimal general purpose `memcpy` function. Or, you can just use `memcpy`.
+
+For the purposes of this discussion, ignore issues relating to alignment.
+
+Suppose you needed to copy 16 bytes of memory from one place to another. You might do it like this:
+
+```c++
+void SillyMove16(uint8_t * dest, uint8_t * src)
+{
+    for (int i = 0; i < 16; i++)
+        *(dest++) = *(src++);
+}
+```
+
+This is especially silly as why would you go through 16 loops when you could have simply:
+
+```c++
+void SillyCopy16(uint64_t * dest, uint64_t * src)
+{
+    *(dest++) = *(src++); // 3
+    *dest = *src;         // 4
+}
+```
+
+`Line 3` dereferences `src`, holds the value that's there and increments `src` by the size of a `long`. The assignment puts the value dereferenced from `src` into the location specified by `dest` and increments the pointer afterwards.
+
+`Line 4` is simplified because this silly move is only two `longs` long. Since this is the second copy of 8 out of 16
+bytes, we have no need to increment the pointers.
+
+In assembly language, this could be written:
+
+```asm
+SillyCopy16:
+    ldr    x2, [x0], 8    // 2
+    str    x2, [x1], 8    // 3
+    ldr    x2, [x0]       // 4
+    str    x2, [x1]       // 5
+    ret
+```
+
+`Lines 2` and `3` increment `x0` and `x1` to the next `long` **after** dereferencing them.
+
+Then again, what about the *pair* load and store instructions? Can these help? Yes!
+
+```asm
+SillyCopy16:
+    ldp    x2, x3, [x0]
+    stp    x2, x3, [x1]
+    ret
+```
+
+As an interesting aside, remember the `q` registers? They are 16 bytes wide by themselves.
+
+```asm
+SillyCopy16:
+    ldr    q2, [x0]
+    str    q2, [x1]
+    ret
+```
+
+### Indexing Through An Array of `struct`
+
+You should read the chapter on `struct` found [here](where?).
+
+Here is a more elaborate case study. Given this:
+
+```c
+#include <stdio.h>                                                      /* 1 */
+                                                                        /* 2 */
+struct Person                                                           /* 3 */
+{                                                                       /* 4 */
+    char * fname;                                                       /* 5 */
+    char * lname;                                                       /* 6 */
+    int age;                                                            /* 7 */
+};                                                                      /* 8 */
+                                                                        /* 9 */
+extern int rand();                                                      /* 10 */
+extern struct Person * FindOldestPerson(struct Person *, int);          /* 11 */
+                                                                        /* 12 */
+struct Person * OriginalFindOldestPerson(struct Person * people, int length) /* 13 */
+{                                                                       /* 14 */
+    int oldest_age = 0;                                                 /* 15 */
+    struct Person * oldest_ptr = NULL;                                  /* 16 */
+                                                                        /* 17 */
+    if (people)                                                         /* 18 */
+    {                                                                   /* 19 */
+        struct Person * end_ptr = people + length;                      /* 20 */
+        while (people < end_ptr)                                        /* 21 */
+        {                                                               /* 22 */
+            if (people->age > oldest_age)                               /* 23 */
+            {                                                           /* 24 */
+                oldest_age = people->age;                               /* 25 */
+                oldest_ptr = people;                                    /* 26 */
+            }                                                           /* 27 */
+            people++;                                                   /* 28 */
+        }                                                               /* 29 */
+    }                                                                   /* 30 */
+    return oldest_ptr;                                                  /* 31 */
+}                                                                       /* 32 */
+                                                                        /* 33 */
+#define LENGTH  20                                                      /* 34 */
+                                                                        /* 35 */
+int main()                                                              /* 36 */
+{                                                                       /* 37 */
+    struct Person array[LENGTH];                                        /* 38 */
+    for (int i = 0; i < LENGTH; i++)                                    /* 39 */
+    {                                                                   /* 40 */
+        array[i].age = rand() & 5000;                                   /* 41 */
+    }                                                                   /* 42 */
+    struct Person * oldest = FindOldestPerson(array, LENGTH);           /* 43 */
+    for (int i = 0; i < LENGTH; i++)                                    /* 44 */
+    {                                                                   /* 45 */
+        printf("%d", array[i].age);                                     /* 46 */
+        if (oldest == &array[i])                                        /* 47 */
+            printf("*");                                                /* 48 */
+        printf("\n");                                                   /* 49 */
+    }                                                                   /* 50 */
+}                                                                       /* 51 */
+```
+
+`FindOldestPerson()` will march through the array of `struct Person` to find the oldest individual returning a pointer to that `struct`. In case of a tie, the first person found will be returned. The array is checked against `NULL` and if found to be so, `NULL` is returned.
+
+`gcc` with `-O2` or `-O3` optimization rendered `OriginalFindOldestPerson()` into 18 lines of assembly language.
+
+This example is more "real world" in that it offers us the chance to work with `w` registers (`int`). It also demonstrates `csel` which is like the `C` and `C++` `ternary operator`.
+
+```asm
+        .global FindOldestPerson                                        // 1 
+        .text                                                           // 2 
+        .align  2                                                       // 3 
+                                                                        // 4 
+//  x0  has struct Person * people                                      // 5 
+//      will be used for oldest_ptr as this is the return value         // 6 
+//  w1  has int length                                                  // 7 
+//  w2  used for oldest_age                                             // 8 
+//  x3  used for Person *                                               // 9 
+//  x4  used for end_ptr                                                // 10 
+//  w5  used for scratch                                                // 11 
+                                                                        // 12 
+FindOldestPerson:                                                       // 13 
+        cbz     x0, 99f             // short circuit                    // 14 
+        mov     w2, wzr             // initial oldest age is 0          // 15 
+        mov     x3, x0              // initialize loop pointer          // 16 
+        mov     x0, xzr             // initialize return value          // 17 
+        mov     w5, 24              // struct is 24 bytes wide          // 18 
+        smaddl  x4, w1, w5, x3      // initialize end_ptr               // 19 
+        b       10f                 // enter loop                       // 20 
+                                                                        // 21 
+1:      ldr     w5, [x3, p.age]     // fetch loop ptr -> age            // 22 
+        cmp     w2, w5              // compare to oldest_age            // 23 
+        csel    w2, w2, w5, gt      // update based on cmp              // 24 
+        csel    x0, x0, x3, gt      // update based on cmp              // 25 
+        add     x3, x3, 24          // increment loop ptr               // 26 
+10:     cmp     x3, x4              // has loop ptr reached end_ptr?    // 27 
+        blt     1b                  // no, not yet                      // 28 
+                                                                        // 29 
+99:     ret                                                             // 30 
+                                                                        // 31 
+        .data                                                           // 32 
+        .struct 0                                                       // 33 
+p.fn:   .skip   8                                                       // 34 
+p.ln:   .skip   8                                                       // 35 
+p.age:  .skip   4                                                       // 36 
+p.pad:  .skip   4                                                       // 37 
+                                                                        // 38 
+        .end                                                            // 39 
+```
+
+Before we get to the explanation, permit us a small pat on the back. The above version, written by us humans, rendered `FindOldestPerson()` in 15 lines.
+
+LEFT OFF HERE
+
+`Lines 5` through `11` are vitally important comments. You should always write comments like these as they will serve as your "dictionary" to help you keep track of what particular registers will be used for.
+
+`x0` begins as the pointer to `struct Person` being passed to us. `x0` is also used for returning values from a function so we'll copy `x0` to `x3` on `line 16`. This will save us an instruction later as we won't have to copy the intended return value back to `x0` prior to the `ret` on `line 30`.
+
+`w1` is passed to us as the length of the array. It is in a `w` register because we defined it as an `int`.
+
+`w2` will hold the oldest age found so far. It is a `w` register because we defined age as an `int`.
+
+`x3` is described above under `x0`.
+
+`x4` will be set to the address after the end of the array and will be used to stop our loop.
+
+`w5` is used for scratch.
+
+Recall that registers 0 through 7 are scratch registers and do not have to be backed up or restored.
+
+`Line 14` is a combination compare AND branch instruction.
+
+```asm
+       cbz     x0, 99f
+```
+
+says "Check `x0`. If it is zero, branch forward to temporary label 99." The `cbz` mnemonic means "compare and branch if zero." There is also a `cbnz` instruction branching if not zero.
+
+The `cbz` and `cbnz` instructions exist because testing against zero is so common.
+
+This instruction is the same as:
+
+```asm
+        cmp    x0, xzr
+        beq    99f
+```
+
+We did not choose the value 99 at random. This line was written at a time when we did not yet know what other temporary labels we might be needing. 99 is a large number that we would be unlikely to run into with other temporary labels. You might establish the habit of using a value like 99 to be your favorite value for the last label in a function.
+
+`Line 14` implements `line 18` of the `C` code.
+
+The closing brace found on `line 30` of the `C` code is implemented on `line 30` of the assembly language code. A coincidence, surely.
+
+`Line 15` establishes the oldest age found so far as being 0.
+
+`Line 16` copies the base address of the array to `x3` from `x0`. The value arrives in `x0` because it is the first parameter to the function. It must be an `x` register because it is a pointer. We need a pointer to march through the array. `x0` serves double duty as holding the first parameter but also as the place where function return values are found.
+
+We copy `x0` out to `x3` so that we can use `x0` to store a pointer to the array element representing the oldest person found so far. If we iterated over the array using `x0`, we would still a) need another `x` register to hold the pointer to the oldest person so far and b) have to copy this register to `x0` before we return anyway. Doing the marching through the array is a register *other* than `x0` saves us one instruction.
+
+`Line 17` initializes `x0` after we've preserved its original value in `x3`.
+
+`Line 18` puts the value of 24 into `w5`. This register is used for scratch or intermediate calculation purposes. We're setting up the calculation which ends with the pointer to just beyond the end of the array. The size of the `struct Person` is 24 bytes (not 20). We considered allowing the assembler to compute this for us but chose instead of hard code the value.
+
+`Line 19` is a mouthful. The mnemonic `smaddl` means *signed multiply add long*. Here is the instruction:
+
+```asm
+        smaddl  x4, w1, w5, x3      // initialize end_ptr               // 19 
+```
+
+`w1` (the length) will be multipled by `w5` (the size of each array member), added to `x3` (the base address of the array) and the result will be placed into `x4`. This assembly language instruction implements this:
+
+```c
+        struct Person * end_ptr = people + length;                      /* 20 */
+```
+
+`Line 20` branches to the `while` loop's decision test. Putting the decision test of a loop at the loop's bottom rather than the top has previously been shown to save one instruction.
+
+`Line 22` begins the main loop of this function. `w5` is loaded with the `int` found 16 bytes away from the address pointed to by `x3`. In this case, we allowed the assembler to compute the 16 for us - you can see this on `lines 33` through `37`. A `w` register is used because `age` is an `int`.
+
+`Line 23` compares the current age to the largest age found so far. This is a key line in that the `cmp` sets *status bits* which are used by the next two, very cool, instructions.
+
+`Line 24` and `25` both make use of the `csel` instruction. The mnemonic means "conditional select". The comparison **has already been made** (on `line 23`) setting the CPU's status bits recording if the comparison resulted in a less than zero, zero, or more than zero result.
+
+`Lines 24` and `25` read:
+
+```asm
+        csel    w2, w2, w5, gt      // update based on cmp              // 24 
+        csel    x0, x0, x3, gt      // update based on cmp              // 25 
+```
+
+These are identical to this:
+
+```c
+        w2 = (w5 > w2) ? w5 : w2;
+        x0 = (x5 > x2) ? x3 : x0;
+```
+
+**Remember that the condition or status bits have already been set based upon whether or not the current age is greater than (or equal to) the oldest age found so far. Both of the `csel` instructions leverage the outcome of the comparison, done just once.**
+
+`csel`, like the `C` and `C++` ternary operator, is quite cool in that we get the results of an `if` statement without the overhead of branching instructions!
+
+`Line 26` increments the loop pointer to the next array member or to the end of the array.
+
+`Line 27` compares the new value of the loop pointer to the address coming after the array.
+
+`Line 28` will branch to the next iteration of the loop if `x3` has not yet advanced as far as `x4` sitting past the end of the array.
+
+`Line 30` is simply a `ret` without no other bookkeeping because the value we want to return has been sitting in `x0` all along! A reminder that we did not need to preserve the value of `x30`, for example, because this function makes no function calls. `x30`, our return address, remains safely unchanged.
--- a/section_1/regs/regs.png
+++ b/section_1/regs/regs.png
--- a/section_1/regs/regtest.c
+++ b/section_1/regs/regtest.c
@ -0,0 +1,30 @@
+#include <stdio.h>
+
+void Foo(long a, long b, long c, long d, long e, long f, long g, long h, long i, long j)
+{
+    printf("%ld %ld %ld %ld %ld %ld %ld %ld %ld %ld\n",
+        a, b, c, d, e, f, g, h, i, j);
+}
+
+struct S
+{
+    long a;
+    long b;
+    long c;
+    long a1;
+    long b1;
+    long c1;
+    long a2;
+    long b2;
+    long c2;
+    long a3;
+    long b3;
+    long c3;
+};
+
+void Bum(long z, long a, long b);
+
+void Bar(long z, struct S s)
+{
+    Bum(z, s.a, s.b);
+}
--- a/section_1/regs/regvar.md
+++ b/section_1/regs/regvar.md
@ -0,0 +1,13 @@
+# Registers Versus Variables
+
+A key to mastering assembly language programming is thinking about your variables differently from how you may think about them presently. In your previous experience, variables are things that live in RAM. Apart from dynamically allocated memory, you've really had few other worries to consider.
+
+*Your job as an assembly language programmer is to plan out how to keep as much of your computational needs in registers for as long as possible.*
+
+As seen in the previous [section](./regs.md), reaching out to RAM is enormously costly in terms of time and delay. Learning to maximize register use is critical.
+
+There are 31 general purpose registers of which you can use a majority (a small number are reserved, in general, for particular purposes). There are also as many floating point and vector processing registers with fewer of these reserved for special purposes.
+
+Consider the typical function you have written. Chances are there are no more than a handful of local variables. Typically you can plan out how to keep all of your local variables in registers for the entire life of your function. Exceptions include local arrays since arrays don't map well to registers.
+
+There **are** some constraints placed on your register use. These are described in the next [section](./backup.md).
--- a/section_1/regs/spare.md
+++ b/section_1/regs/spare.md
@ -0,0 +1,5 @@
+# stuff that's unused as yet
+
+It is important that you become familiar with all the different kinds and uses of registers available on the AARCH64 ISA. We'll dive into this soon. But first, it is vitally important to expand upon your notion of "variable."
+
+So far we have restricted our discussion of registers almost exclusively to the `x` registers. These are 8 bytes wide and are used for both signed and unsigned integer operations.
--- a/section_1/regs/text.txt
+++ b/section_1/regs/text.txt
@ -0,0 +1,31 @@
+(gdb) b main
+Breakpoint 1 at 0x740: file c.s, line 5.
+(gdb) run
+Starting program: a.out 
+
+Breakpoint 1, main () at c.s:5
+5	main:   mov     x0, -1
+(gdb) n
+6	        ldr     x1, =ram
+(gdb) n
+7	        ldrb    w0, [x1]
+(gdb) p/x $x0
+$1 = 0xffffffffffffffff
+(gdb) n
+9	        mov     x0, -1
+(gdb) p/x $x0
+$2 = 0x1
+(gdb) n
+10	        mov     x1, 1
+(gdb) n
+11	        mov     w0, w1
+(gdb) n
+13	        mov     x0, -1
+(gdb) p/x $x0
+$3 = 0x1
+(gdb) n
+14	        movk    w0, 1
+(gdb) n
+17	        mov     x0, xzr
+(gdb) p/x $x0
+$4 = 0xffff0001
--- a/section_1/regs/widths.md
+++ b/section_1/regs/widths.md
@ -0,0 +1,252 @@
+# Register Widths
+
+## Overview
+
+In each of the various sets of registers, each register can be referred to by different synonyms which determine how wide the register operation will be.
+
+## General Purpose Registers
+
+| Intended Width | Register Prefix | Instruction Postfix |
+| -------------- | --------------- | ------------------- |
+| 8 bytes | `x` | NA |
+| 4 bytes | `w` | NA |
+| 2 bytes | `w` | `h` |
+| 1 byte | `w` | `b` |
+
+### `ldr` (and `ldp`)
+
+```asm
+    ldr    x0, [sp]   // load 8 bytes from address specified by sp
+    ldr    w0, [sp]   // load 4 bytes from address specified by sp
+    ldrh   w0, [sp]   // load 2 bytes from address specified by sp
+    ldrb   w0, [sp]   // load 1 byte  from address specified by sp
+```
+
+The address from which a load is taking must match the alignment of what is being loaded.
+
+### `str` (and `stp`)
+
+```asm
+    str    x0, [sp]   // store 8 bytes to address specified by sp
+    str    w0, [sp]   // store 4 bytes to address specified by sp
+    strh   w0, [sp]   // store 2 bytes to address specified by sp
+    strb   w0, [sp]   // store 1 byte  to address specified by sp
+```
+
+The address to which a store is made must match the alignment of what is being stored.
+
+### Example
+
+Let's look at this program:
+
+```asm
+        .global    main                                                 // 1 
+        .text                                                           // 2 
+        .align    2                                                     // 3 
+                                                                        // 4 
+main:   mov     x0, xzr                                                 // 5 
+        ldr     x1, =ram                                                // 6 
+        strb    w0, [x1]                                                // 7 
+        strh    w0, [x1]                                                // 8 
+        str     w0, [x1]                                                // 9 
+        str     x0, [x1]                                                // 10 
+        ret                                                             // 11 
+                                                                        // 12 
+        .data                                                           // 13 
+ram:    .quad   0xFFFFFFFFFFFFFFFF                                      // 14 
+                                                                        // 15 
+        .end                                                            // 16 
+                                                                        // 17 
+```
+
+`Line 14` puts an identifiable pattern into 8 bytes of RAM and gives them the symbol `ram`.
+
+`Line 6` gets the address of these bytes into `x1`.
+
+The next four lines put zeros into that memory using progressively wider store instructions.
+
+The following is a `gdb` session running the above program. Line numbers have been added to assist with the description of the session. Rather than describe all after a wall of text, descriptions will be provided inline.
+
+```text
+(gdb) b main                                                            // 1 
+Breakpoint 1 at 0x740: file align.s, line 5.                            // 2 
+```
+
+Immediately after entering `gdb` we set a breakpoint at `main`.
+
+```text
+(gdb) run                                                               // 3 
+Starting program: /media/psf/Home/buffet/3510/pk_do/regs/a.out          // 4 
+                                                                        // 5 
+Breakpoint 1, main () at align.s:5                                      // 6 
+5    main:    mov     x0, xzr                                           // 7 
+```
+
+We launched the program and `gdb` stops its execution upon reaching the breakpoint.
+
+```text
+(gdb) p/x $x0                                                           // 8 
+$1 = 0x1                                                                // 9 
+```
+
+Before putting zero into `x0`, let's see what it currently holds... the value 1. Recall this is `argc`. The `p` command means `print` and is used to print the values in registers. The modifier `/x` says to print in hexadecimal.
+
+```text
+(gdb) n                                                                 // 10 
+6            ldr     x1, =ram                                           // 11 
+(gdb) p/x $x0                                                           // 12 
+$2 = 0x0                                                                // 13 
+```
+
+After putting zero into `x0`, we confirm its contents.
+
+```text
+(gdb) p/x $x1                                                           // 14 
+$3 = 0xfffffffff028                                                     // 15 
+```
+
+Prior to loading the address of 8 bytes found with the label `ram`, we print out the value already sitting in `x1`. The address it contains will be the address of the `C`-string containing the name of the program being run.
+
+```text
+(gdb) n                                                                 // 16 
+7            strb    w0, [x1]                                           // 17 
+(gdb) p/x $x1                                                           // 18 
+$4 = 0xaaaaaaab1010                                                     // 19 
+```
+
+After loading the address of `ram` into `x1`, we confirm its contents.
+
+```text
+(gdb) p/x &ram                                                          // 20 
+$5 = 0xaaaaaaab1010                                                     // 21 
+```
+
+Just for kicks, we confirm that the previous instruction really did get the address correctly.
+
+```text
+(gdb) x/x &ram                                                          // 22 
+0xaaaaaaab1010:    0xffffffff                                           // 23 
+```
+
+We shift from `p`rint to e`x`amine to reach into memory and see what is found at `ram`.
+
+```text
+(gdb) x/gx &ram                                                         // 24 
+0xaaaaaaab1010:    0xffffffffffffffff                                   // 25 
+```
+
+Adding the `g` (for `g`iant) we can see all 8 bytes.
+
+```text
+(gdb) n                                                                 // 26 
+8            strh    w0, [x1]                                           // 27 
+(gdb) x/gx &ram                                                         // 28 
+0xaaaaaaab1010:    0xffffffffffffff00                                   // 29 
+```
+
+We just did a `strb` and looking at memory, we see one byte's worth of zeros.
+
+*Note: this brings up an interesting question... which byte is actually sitting at the address of `ram`? We will have to look into this more later.*
+
+```text
+(gdb) n                                                                 // 30 
+9            str     w0, [x1]                                           // 31 
+(gdb) x/gx &ram                                                         // 32 
+0xaaaaaaab1010:    0xffffffffffff0000                                   // 33 
+```
+
+After storing a `short`.
+
+```text
+(gdb) n                                                                 // 34 
+10            str     x0, [x1]                                          // 35 
+(gdb) x/gx &ram                                                         // 36 
+0xaaaaaaab1010:    0xffffffff00000000                                   // 37 
+```
+
+After storing an `int`.
+
+```text
+(gdb) n                                                                 // 38 
+11            ret                                                       // 39 
+(gdb) x/gx &ram                                                         // 40 
+0xaaaaaaab1010:    0x0000000000000000                                   // 41 
+(gdb) quit                                                              // 42 
+```
+
+And finally, after storing a `long`.
+
+Let's circle back to the question asked above: Which byte is actually at the address `ram`? When we examined the `long` just after putting in one byte of zero, we saw this:
+
+```text
+(gdb) x/gx &ram                                                         // 28 
+0xaaaaaaab1010:    0xffffffffffffff00                                   // 29 
+```
+
+Notice the zeros come at the end. Keep in mind, these bytes are printed as a `long`.
+
+But what if we look at these 8 bytes individually?
+
+```text
+(gdb) x/gx &ram
+0xaaaaaaabb010: 0xffffffffffffff00
+(gdb) x/8bx &ram
+0xaaaaaaabb010: 0x00    0xff    0xff    0xff    0xff    0xff    0xff    0xff
+```
+
+Look at that... the *least significant* byte of a `long` comes **first**.
+
+This is the definition of `little endian`.
+
+The following image is from [here](https://medium.com/worldsensing-techblog/big-endian-or-little-endian-37c3ed008c94):
+
+![eggs](./eggs.jpeg)
+
+### Little Endian in More Detail
+
+Given this program (not intended for meaningful execution... just e`x`aminging memory):
+
+```asm
+        .global    main                                                 // 1 
+        .text                                                           // 2 
+        .align    2                                                     // 3 
+                                                                        // 4 
+main:    mov     x0, xzr                                                // 5 
+        ret                                                             // 6 
+                                                                        // 7 
+        .data                                                           // 8 
+ram:    .quad   0xAABBCCDDEEFF0011                                      // 9 
+        .end                                                            // 10 
+```
+
+let's take a look at the memory at location `ram` in two ways. Once interpreted as a `long`:
+
+```text
+(gdb) x/gx &ram
+0x11010:    0xaabbccddeeff0011
+```
+
+and then intrepreted as 8 bytes appearing in the order of lowest address to highest:
+
+```text
+(gdb) x/8bx &ram
+0x11010:    0x11    0x00    0xff    0xee    0xdd    0xcc    0xbb    0xaa
+```
+
+Compare the order of the bytes. They are least significant to most significant. Specifically:
+
+* within a `long` the least significant `int` comes first
+* within an `int`, the least significant `short` comes first
+* within a `short` the least significant byte comes first
+
+Endiannes isn't an issue unless you're exchanging data with a computer that has a different endedness and then only if the data being transferred is longer in native width than 1 byte. Text, expressed in single bytes, is immune from endedness issues - text is an array of bytes and is the same on all platforms.
+
+### What Happens to the Rest of a Register When Only a Portion is Affected?
+
+Whenever a narrower portion of a register is written to, the remainder of the registe is zero'd out. That is: `strb` overwrites the least significant byte of an `x` register and zeros out the upper 7 bytes.
+
+*There are dedicated instructions for manipulating bits in the middle of registers*.
+
+### Casting Between `int` Type
+
+Casting between integer types is in some cases accomplished by `anding` with `255` and `65535` (for `char` and `short`). Otherwise, see the previous section (What Happens to the Rest of a Register...).