diff --git a/README.md b/README.md index efe25ca..32e09e1 100644 --- a/README.md +++ b/README.md @@ -319,6 +319,7 @@ In this section, we present miscellaneous material. | 5 | [Determining string literal lengths for C functions](./more/strlen_for_c/README.md) | [Link](./more/strlen_for_c/README.pdf) | | 6 | [Calling Assembly Language From Python](./python/) | [Link](./python/README.pdf) | | 7 | [Atomic Operations](./more/atomics/README.md) | [Link](./more/atomics/README.pdf) | +| 8 | [Jump Tables](./more/jump_tables/README.md) | [Link](./more/jump_tables/README.pdf) | ## Macro Suite diff --git a/README.pdf b/README.pdf index a2f8e18..f93c2b1 100644 Binary files a/README.pdf and b/README.pdf differ diff --git a/more/jump_tables/README.md b/more/jump_tables/README.md new file mode 100644 index 0000000..61c788f --- /dev/null +++ b/more/jump_tables/README.md @@ -0,0 +1,275 @@ +# Jump or Branch Tables + +A jump or branch table is a powerful instruction saving technique that +can be used to switch between multiple single instructions or even +choose one of a series of functions to call (or branches to take). + +This concept can be found as the implementation of some `switch` +statements and is found at the very very lowest end of an Operating +System (interrupt vectors, for example). + +The + +## Single Instructions a la Duff's Device + +[Duff's Device](https://en.wikipedia.org/wiki/Duff%27s_device) shoe +horned a jump table into the middle of a `while` loop. At the same +time, it also correctly demonstrates a simple case of *loop unrolling*. +It's very creative. + +Let's expand on Duff's Device. + +The full source code for this example can be found +[here](./branch_table.S). It demonstrates a branch table consisting of +instructions which are meant to be executed in sequence after jumping +into the middle of the sequence. + +Here: + +```asm + mov x6, 8 + MOD x2, x6, x4, x5 // x4 gets l % 8 + cbz x4, 10f // Handle evenly divisible case. + sub x4, x6, x4 // Invert sense of x4 e.g. 3 becomes 5 +``` + +we are performing this: *x4 is getting the result of modding the +number of times we want the instructions executed by the number of +times we unrolled the loop*. + +Specifically, this example does `length % 8`. However, the AARCH64 ISA +does not include a *mod* instruction. The `MOD` macro used above is +defined as: + +```asm +.macro MOD src_a, src_b, dest, scratch + sdiv \scratch, \src_a, \src_b + msub \dest, \scratch, \src_b, \src_a +.endm +``` + +`msub` is a cool instruction. It does this: + +```d = c - (b * a)``` + +Example: 13 % 8 == 5. First the `sdiv`: 13 / 8 is 1. Then, the `msub`: +13 - (1 * 8) is 5. + +Next: + +```asm + cbz x4, 10f // Handle evenly divisible case. + sub x4, x6, x4 // Invert sense of x4 e.g. 5 becomes 3 +``` + +This code is key. + +If the result of the `mod` is 0, then the entire table must be executed. +This is implemented by the `cbz`. + +If the result of the `mod` is not 0, then its value must be *flipped*. +This is the `sub` instruction. See the comment above. + +Finally, we have the computation of the address to where we jump into +the middle of the table. + +```asm + LLD_ADDR x5, 10f + add x5, x5, x4, lsl 2 + br x5 +``` + +Each of the lines above bears description: + +The `LLD_ADDR` is from the [*convergence +macros*](./apple-linux-convergence.S). It loads the address of the +beginning of the table. + +Next, the `add` instruction multiplies the flipped result of the `mod` +by 4 (the length of one instruction) THEN adds it to the base address +of the table. We have calculated *instruction addresses* exactly the +way we would with array dereferences. Thank you John von Neumann. + +Finally, we `br` which means branch to an address contained in a +register. + +```asm +10: str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + // loop code not shown +``` + +## Performing Multiple Instructions + +If you need to execute more than one instruction you have two choices: + +### Multiple Instructions by Address Arithmetic + +Suppose you needed two instructions in each step of the sequence. +Simply multiply the index by 8 instead of 4 (i.e. the length of two +instructions). The same technique works with a larger number. E.g. +you need three instructions per step: multiply by 12. + +Suppose some need 3 instruction and some need 2. You must handle this +because using this technique requires that all steps in the sequence +of steps must be the same length so that the address arithmetic holds. + +Simply insert the occasional `nop` instruction in the indexes that are +shorter than the others. + +### Multiple Instructions by Branch Branch + +Here's another [example of code](./jmptbl.s) that implements a branch or +jump table: + +```asm +jt: b 0f + b 1f + b 2f + b 3f + b 4f + b 5f + b 6f + b 7f +``` + +You jump into the middle of the table and then immediately jump some +place else. This is like: + +```c +if (blah) { + blah +} else if (blah) { + blah +} else if (blah) { + blah +} +etc. +``` + +### Multiple Instructions by Branch Call + +You can easily modify the above techniques to make something like: + +```asm +jt: br func_0 + br func_1 + br func_2 + br func_3 + br func_4 + br func_5 + br func_6 + br func_7 +``` + +or: + +```asm +jt: br func_0 + b common_label + br func_1 + b common_label + br func_2 + b common_label + br func_3 + b common_label + br func_4 + b common_label + br func_5 + b common_label + br func_6 + b common_label + br func_7 + b common_label + // perhaps some loop control... if none, the preceding + // b can be removed since can fall through to the common + // label. +common: +``` + +The above looks like a `switch` statement where each case is terminated +with a `break` statement. + +## Small Gaps in Sequential Indexes + +Suppose your range of indexes was 0 through 8 inclusive (notice there +are 9 integers in the range) but index 7 is skipped. That is, your +potential indexes are 0 through 6 inclusive and then 8 but never +7. + +In a `switch` statement, this would look like: + +```c++ +switch (index) { + case 0: blah blah; + break; + case 1: blah blah; + break; + case 2: blah blah; + break; + case 3: blah blah; + break; + case 4: blah blah; + break; + case 5: blah blah; + break; + case 6: blah blah; + break; + case 8: blah blah; + break; +} +``` + +Gaps in the potential indexes presents a surmountable problem if the +gaps are few. + +In the case where there are a small number of gaps simple fill them +with a branch to a common, otherwise "do nothing", label. For example, +you might have: + +```asm +b_table: b label0 + b label1 + b label2 + b label3 + b label4 + b label5 + b label6 + b do_nothing + b label8 +``` + +in a Duff's Device where you are executing sequential single +instructions, it might loop like this: + +```asm +x_fer: str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + nop + str w1, [x0], 1 +``` + +Here, the `nop` instruction means "no operation". It does nothing but +is a valid instruction meant to take up space (and decades ago, take +up time). + +In a high level language this might look like this: + +```c +for (int i = 0; i <= 8; i++) { + if (i == 7) + continue; + blah blah +} +``` diff --git a/more/jump_tables/README.pdf b/more/jump_tables/README.pdf new file mode 100644 index 0000000..6eb14d7 Binary files /dev/null and b/more/jump_tables/README.pdf differ diff --git a/more/jump_tables/apple-linux-convergence.S b/more/jump_tables/apple-linux-convergence.S new file mode 100644 index 0000000..8827423 --- /dev/null +++ b/more/jump_tables/apple-linux-convergence.S @@ -0,0 +1,156 @@ +/* Macros to permit the "same" assembly language to build on ARM64 + Linux systems as well as Apple Silicon systems. + + See the fuller documentation at: + https://github.com/pkivolowitz/asm_book/blob/main/macros/README.md + + Perry Kivolowitz + A Gentle Introduction to Assembly Language +*/ + +.macro GLD_PTR xreg, label +#if defined(__APPLE__) + adrp \xreg, _\label@GOTPAGE + ldr \xreg, [\xreg, _\label@GOTPAGEOFF] +#else + ldr \xreg, =\label + ldr \xreg, [\xreg] +#endif +.endm + +.macro GLD_ADDR xreg, label // Get a global address +#if defined(__APPLE__) + adrp \xreg, _\label@GOTPAGE + add \xreg, \xreg, _\label@GOTPAGEOFF +#else + ldr \xreg, =\label +#endif +.endm + +.macro LLD_ADDR xreg, label +#if defined(__APPLE__) + adrp \xreg, \label@PAGE + add \xreg, \xreg, \label@PAGEOFF +#else + ldr \xreg, =\label +#endif +.endm + +.macro LLD_DBL xreg, dreg, label +#if defined(__APPLE__) + adrp \xreg, \label@PAGE + add \xreg, \xreg, \label@PAGEOFF + ldur \dreg, [\xreg] +// fmov \dreg, \xreg +#else + ldr \xreg, =\label + ldur \dreg, [\xreg] +#endif +.endm + +.macro LLD_FLT xreg, sreg, label +#if defined(__APPLE__) + adrp \xreg, \label@PAGE + add \xreg, \xreg, \label@PAGEOFF + ldur \sreg, [\xreg] +#else + ldr \xreg, =\label + ldur \sreg, [\xreg] +#endif +.endm + +.macro GLABEL label +#if defined(__APPLE__) + .global _\label +#else + .global \label +#endif +.endm + +.macro MAIN +#if defined(__APPLE__) +_main: +#else +main: +#endif +.endm + +/* Fetching the address of the externally defined errno is quite + different on Apple and Linux. This macro leaves the address of + errno in x0. +*/ +.macro ERRNO_ADDR +#if defined(__APPLE__) + bl ___error +#else + bl __errno_location +#endif +.endm + +.macro CRT label +#if defined(__APPLE__) + bl _\label +#else + bl \label +#endif +.endm + +.macro START_PROC // after starting label + .cfi_startproc +.endm + +.macro END_PROC // after the return + .cfi_endproc +.endm + +.macro PUSH_P a, b + stp \a, \b, [sp, -16]! +.endm + +.macro PUSH_R a + str \a, [sp, -16]! +.endm + +.macro POP_P a, b + ldp \a, \b, [sp], 16 +.endm + +.macro POP_R a + ldr \a, [sp], 16 +.endm + +/* The smaller of src_a and src_b is put into dest. A cmp instruction + or other instruction that sets the flags must be performed first. + This macro makes it easy to remember which register does what in the + csel. + + Thank you to u/TNorthover for nudge to add the cmp. +*/ + +.macro MIN src_a, src_b, dest + cmp \src_a, \src_b + csel \dest, \src_a, \src_b, LT +.endm + +/* The larger of src_a and src_b is put into dest. A cmp instruction + or other instruction that sets the flags must be performed first. + This macro makes it easy to remember which register does what in the + csel. + + Thank you to u/TNorthover for nudge to add the cmp. +*/ + +.macro MAX src_a, src_b, dest + cmp \src_a, \src_b + csel \dest, \src_a, \src_b, GT +.endm + +.macro AASCIZ label, string + .p2align 2 +\label: .asciz "\string" +.endm + +.macro MOD src_a, src_b, dest, scratch + sdiv \scratch, \src_a, \src_b + msub \dest, \scratch, \src_b, \src_a +.endm diff --git a/more/jump_tables/branch_table.S b/more/jump_tables/branch_table.S new file mode 100644 index 0000000..84d1b49 --- /dev/null +++ b/more/jump_tables/branch_table.S @@ -0,0 +1,57 @@ +#include "apple-linux-convergence.S" + + .p2align 2 + .text + GLABEL MyMemSet + +/* MyMemSet(unsigned char * b, unsigned char v, long l) + x0 w1 x2 + + The length is first checked against less than or equal to 0. If + so, the body of the function is skipped. + + The loop will be unrolled 8x. The length (x2) modulo 8 gets turned + into the number of instructions to jump to or beyond the initial + str. A modulo of 0 is handled separately - it causes a branch to the + initial str. + + This code can be dramatically improved by copying more than one byte + at a time. You will have to figure out how to do this optimally in + P6 - MemCpy +*/ +#if defined(__APPLE__) +_MyMemSet: +#else +MyMemSet: +#endif + START_PROC + PUSH_P x29, x30 + mov x29, sp + cmp x2, xzr // Test for bad length. + ble 99f // Take branch of 0 or less. + + add x3, x2, x0 // x3 gets address of one beyond buffer + mov x6, 8 + MOD x2, x6, x4, x5 // x4 gets l % 8 + cbz x4, 10f // Handle evenly divisible case. + sub x4, x6, x4 // Invert sense of x4 e.g. 3 becomes 5 + + LLD_ADDR x5, 10f + add x5, x5, x4, lsl 2 + br x5 + +10: str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + str w1, [x0], 1 + cmp x3, x0 + bgt 10b + +99: POP_P x29, x30 + ret + END_PROC + diff --git a/more/jump_tables/jmptbl.s b/more/jump_tables/jmptbl.s new file mode 100644 index 0000000..97e272a --- /dev/null +++ b/more/jump_tables/jmptbl.s @@ -0,0 +1,83 @@ + .text + .align 4 + .global main + +main: str x30, [sp, -16]! + mov x0, xzr // set up call to time(nullptr) + bl time // call time setting up srand + bl srand // call srand setting up rand + bl rand // get a random number + and x0, x0, 7 // ensure its range is 0 to 7 + // note use of x register is on purpose + lsl x0, x0, 2 // multiply by 4 + ldr x1, =jt // load base address of jump table + add x1, x1, x0 // add offset to base address + br x1 + +// If, as in this case, all the "cases" have the same number of +// instructions then this intermediate jump table can be omitted saving +// some space and a tiny amount of time. To omit the intermediate jump +// table, you'd multiply by 12 above and not 4. Twelve because each +// "case" has 3 instructions (3 x 4 == 12). + +// Question for you: If you did omit the jump table, relative to what +// would you jump (since "jt" would be gone). + +jt: b 0f + b 1f + b 2f + b 3f + b 4f + b 5f + b 6f + b 7f + +0: ldr x0, =ZR + bl puts + b 99f + +1: ldr x0, =ON + bl puts + b 99f + +2: ldr x0, =TW + bl puts + b 99f + +3: ldr x0, =TH + bl puts + b 99f + +4: ldr x0, =FR + bl puts + b 99f + +5: ldr x0, =FV + bl puts + b 99f + +6: ldr x0, =SX + bl puts + b 99f + +7: ldr x0, =SV + bl puts + b 99f + +99: mov w0, wzr + ldr x30, [sp], 16 + ret + + .data + .section .rodata + +ZR: .asciz "0 returned" +ON: .asciz "1 returned" +TW: .asciz "2 returned" +TH: .asciz "3 returned" +FR: .asciz "4 returned" +FV: .asciz "5 returned" +SX: .asciz "6 returned" +SV: .asciz "7 returned" + + .end diff --git a/more/jump_tables/jt.c b/more/jump_tables/jt.c new file mode 100644 index 0000000..e4a56d1 --- /dev/null +++ b/more/jump_tables/jt.c @@ -0,0 +1,55 @@ +#include +#include +#include + +/* This is the prototype for the assembly language version. You may + have always thought that switch statements are implemented as a long + chain of if / else. Well, sometimes they are. Sometimes they are + implemented using binary search and still other times they are + implemented as jump tables. + + My assembly language version is found in jmptbl.s. +*/ + +int main() +{ + int r; + + srand(time(0)); + r = rand() & 7; + switch (r) + { + case 0: + puts("0 returned"); + break; + + case 1: + puts("1 returned"); + break; + + case 2: + puts("2 returned"); + break; + + case 3: + puts("3 returned"); + break; + + case 4: + puts("4 returned"); + break; + + case 5: + puts("5 returned"); + break; + + case 6: + puts("6 returned"); + break; + + case 7: + puts("7 returned"); + break; + } + return 0; +} \ No newline at end of file diff --git a/more/jump_tables/test_interop.cpp b/more/jump_tables/test_interop.cpp new file mode 100644 index 0000000..02de5ea --- /dev/null +++ b/more/jump_tables/test_interop.cpp @@ -0,0 +1,31 @@ +#include + +extern "C" void MyMemSet(unsigned char *, unsigned char v, long length); + +/* MyMemSet(unsigned char *, unsigned char v, long length); +*/ + +/* +void MyMemSet(unsigned char * b, unsigned char v, long l) { + for (long i = 0; i < l; i++) { + b[i] = v; + } +} +*/ +const long BUFFER_SIZE = 1000; + +unsigned char buffer[BUFFER_SIZE]; + +int main() { + unsigned char before = buffer[-1]; + unsigned char after = buffer[BUFFER_SIZE]; + + MyMemSet(buffer, 0xF0, BUFFER_SIZE); + + if (before != buffer[-1]) + printf("Bytes prior to buffer are smashed.\n"); + if (after != buffer[BUFFER_SIZE]) + printf("Bytes after buffer are smashed.\n"); + + return 0; +}