# Starting Systems Programming, Pt 1: Programmers Write Programs A software article by Efron Licht **MAR 2025** <
> This is the first of four articles on the fundamentals of systems programming. It will cover many of the essentials, such as bit manipulation, parsing, filesystems, input/output, syscalls, memory management, and signals. Like many of my article series, this is more of a grab bag than a comprehensive guide - but I hope it will be useful to you. - [1. Series Introduction](#1-series-introduction) - [1.1. Programmers Write Programs](#11-programmers-write-programs) - [1.2. Some final caveats:](#12-some-final-caveats) - [1.3. Series Overview](#13-series-overview) - [2. What is systems programming?](#2-what-is-systems-programming) - [3. Peeking into the black box: what is a program, anyways?](#3-peeking-into-the-black-box-what-is-a-program-anyways) - [3.1. hello.go](#31-hellogo) - [3.2. buildhello.bash](#32-buildhellobash) - [4. Investigating the data segment](#4-investigating-the-data-segment) - [4.1. finding strings with `findoffset.go`](#41-finding-strings-with-findoffsetgo) - [4.2. writing simple files with `echo.go`](#42-writing-simple-files-with-echogo) - [4.3. printing files with `cat.go`](#43-printing-files-with-catgo) - [4.4. investigating the `hello` program with `findoffset`, `echo`, and `cat`](#44-investigating-the-hello-program-with-findoffset-echo-and-cat) - [4.5. Basic Hacking w/ `binpatch.go`](#45-basic-hacking-w-binpatchgo) - [4.6. reaching into files with `torso.go`](#46-reaching-into-files-with-torsogo) - [4.7. Investigation: What's with `3814697265625`?](#47-investigation-whats-with-3814697265625) - [5. Investigating the code segment.](#5-investigating-the-code-segment) - [5.1. reading binary files with `shexdump.go`](#51-reading-binary-files-with-shexdumpgo) - [5.2. deserializing hexdumps with `unhexdump.go`](#52-deserializing-hexdumps-with-unhexdumpgo) - [5.3. hexdump\_test.bash](#53-hexdump_testbash) - [5.4. Investigating the ELF header of `hello`](#54-investigating-the-elf-header-of-hello) - [6. Conclusion: The Spirit of Systems Programming.](#6-conclusion-the-spirit-of-systems-programming) ## 1. Series Introduction ### 1.1. [Programmers Write Programs](./startingsystems1.md#programmers-write-programs) When's the last time you wrote a program from scratch? For a shocking number of programmers, the answer is 'in school'. This is a pervasive problem in the industry, and it's only getting worse. I interview a lot of candidates, and I've run into people with titles like 'Technical Lead @ Tesla' (or worse yet, Principal Engineer) who can't program their way out of a paper bag. My ordinary interview question is "write me `grep`" - a problem which _should_ be appropriate for first or second-year computer science students - and the overwhelming majority of candidates fail it. I don't think they're _dumb_ - they usually aren't - but they don't have the grounding in programming fundamentals - systems programming fundamentals - that they need to "really" program. This is bad for the candidates, bad for the industry, and bad for our increasingly computerized world. Why? Making reliable, intuitive, and efficient software is about minimizing complexity. If all you can do is _add on_ to existing programs, you're glued to pre-existing complexity. If you can't **write a program** from scratch, you're stuck in a world of other people's code and other people's mistakes. **The way to get good at something is by doing it. Pitchers pitch, painters paint, and programmers program.** So this article will be about **writing programs** - dozens of them. As such, while reading the text of this article might teach you a few things, to really get the most out of it, you'll need to understand the programs. I've provided exercises to help you practice. #### [Note on style & environment](./startingsystems1.md#note-on-style--environment) Where possible, the code in this series will use as few libraries as possible. This is not because you shouldn't use libraries - but because you shouldn't _need_ them. I want to show you that you can make practical tools out of simple primitives. You'll see a number of code blocks throughout this article. These are either go programs or bash shell scripts. If you've had experience with a mainstream programming language like Python, Javascript, or C, you should be able to follow along, but you might want to review pointers a bit. Go uses `//` for comments, and bash and python use `#`. I'll start each code block with a comment indicating the language: `// filename.go` or `#!/usr/bin/env bash`. #### [note for python programmers](./startingsystems1.md#note-for-python-programmers) > **I have provided python implementations of many programs in this series**. See my [`gitlab` in the `articles/startingsystems/cmd/pythonports` directory](https://gitlab.com/efronlicht/blog/-/tree/master/articles/startingsystems/cmd/pythonports?ref_type=heads). > I'll try to link the specific files at the head of each go program. The go program should always be treated as the 'canonical' version. I may or may not keep doing this in the next article - it's a lot of work. --- #### example go block Go programs will start with `// filename.go `. ```go // minimal.go is an example go program. // see https://gitlab.com/efronlicht/blog/-/blob/58fb4c13f870a73514284617c71027bbe0a76e2a/articles/startingsystems/cmd/pythonports/minimal.py for the python version. package main import "fmt" func main(){ fmt.Println("this is a go program") } ``` #### example bash block Bash scripts will start with `#!/usr/bin/env bash`. They usually contain a series of commands that you might run in a terminal after the `# IN` comment. The `# OUT` comment shows the expected output of the commands. ```bash #!/usr/bin/env bash # example.bash demonstrates a simple bash script with an # IN and # OUT section. # IN: echo "this is a bash script" # OUT: this is a bash script ``` > #### lemma: sidenotes > > Sidenotes will show up in indented boxes, like this. A 'lemma` is a small digression important to clarify a point. > #### lemma: shebang (`#!`) > > The shebang (`#!`) at the beginning of a file tells the operating system what program to use to run it. For example, `/usr/bin/bash` will run the file with the bash shell located at `/usr/bin/bash`. `#!/usr/bin/env bash` tells the OS to use whatever `bash` is in your `PATH` environment variable to run the script. We'll talk more about all of these things in a later article. ### 1.2. Some final caveats: - I'm way more familiar with Linux than Windows, Darwin, or BSD, so this article will be linux-centric. I will occasionally point out differences between Linux and other operating systems - but when I say "the OS", I might just mean Linux. - This cannot be a comprehensive guide. Ideally, you'll have some basic knowledge of computer architecture. If you run into terms you don't know, like "nibble", "register" or "file descriptor", _don't panic_ - in general, you should be able to follow along. I'll provide a glossary for the things I can think of, but I'm sure I'll miss some. - This series of articles provides the source code for many, many programs. **Reading the code is the core of this series** - the code is usually _more_ important than the text. I strongly encourage you to read and modify the code as you go along. OK, enough ceremony. Let's get started. ### 1.3. Series Overview 1. #### [Programmers Write Programs](./startingsystems1.html) <--- you are here In this article, we'll talk about what systems programming is, what a program is, and how to interact with the data inside a program. We'll build a program and dig into the data inside it, hack it to change it's behavior, and build a series of software tools to help us understand it that we'll use throughout the series. 2. #### [Your program and the outside world](./startingsystems2.html): Command-line arguments, environment variables, and syscalls How do programs interact with the outside world? We'll cover the fundamentals of the UNIX programming environment, including command-line arguments, environment variables, and syscalls, building up to a simple command-line interpreter (aka shell). 3. #### Execution Counts: Hardware, Memory, & Software Performance \(COMING SOON\) How do programs interact with hardware? We'll cover the fundamentals of storage and access - registers, memory management, and cache - talk about what _actually happens_ when you call a function or system call - and give a crash course on performant programming in general. 4. #### Wait, it's all `gotwo` - the fundamentals of programming, virtual machines, assembly, debugging, and ABIs. (COMING SOON) Wait, it's all `goto`? When it comes down to it, programming is some memory, an instruction pointer, and a series of conditional jumps. We'll use our new systems programming skills to build a virtual machine & assembly language that's a valid subset of go. We'll use that to illustrate how debuggers and ABIs work. We'll invent a virtual machine and programming language that's a valid subset of go and use it to explore the fundamentals of programming & debugging. ## 2. What is systems programming? There's no clear line between "systems programming" and other kinds. A problem might be 'systems programming if it' - interacts with the operating system or hardware - has tight performance constraints - operates at a 'low level', dealing with individual bytes or registers A **systems programmer** sees a computer as a _physical machine_ that can be completely understood, rather than a mathematical or formal abstraction. They understand the _hardware_ and _software_ of a computer system, and they can **write programs** that interact with both. A systems programmer is unafraid to tear something apart, confident that they can put it back together again. ## 3. Peeking into the black box: what is a program, anyways? A _program_ is an executable file that your operating system can interpret as a series of machine instructions. That is, it's a combination of code and data that the operating system can load into memory and execute. Programs come in two main types: 1. the kind that takes input and produces output (the focus of this article) 1. the kind that run indefinitely, waiting for interaction from the outside world (daemons, servers, etc) When we say 'input' and 'output', we mean bytes. To warm up, let's **write** a program that takes no input but produces output: the nearly 50-year-old classic, "hello, world!". ### 3.1. hello.go [hello.py: click here](https://gitlab.com/efronlicht/blog/-/blob/073f60b6e7c057961fd9344c766cca6b63ff9900/articles/startingsystems/cmd/pythonports/helloworld.py) #### overview - print a string to standard output ```go // hello.go package main import "fmt" func main() { fmt.Println("hello, world!") } ``` ### 3.2. buildhello.bash #### Overview 1. call `go build` to compile the program 2. run the program ```bash #!/usr/bin/env bash # buildhello.bash builds and runs the hello program. # IN go build -o hello hello.go # 1. call 'go build' to compile the program ./hello # 2. run the program # OUT hello, world! ``` Great. What's actually _in_ the hello program? We expect - data - including the string "hello, world!" - code - the instructions to print that string - maybe some other stuff? We'll take a poke around and see what we can find. Let's start with the **data** - that is, any of the bytes in the file that aren't executable instructions. ## 4. Investigating the data segment Regardless of your operating system or architecture, there's one thing we can be sure will exist in the file: the string "hello, world!". Let's look for it. Better yet, let's **write a program** to look for it - we'll call it `findoffset`. ### 4.1. finding strings with `findoffset.go` #### Overview We want to look for a specific string in a file and print the offset of the first occurrence. When it comes down to it, a string is just a sequence of bytes in some character encoding. We'll do that by comparing the bytes in the file to the bytes in the string, one-by-one. That is, we'll: 1. parse the command line arguments 2. read the file into memory 3. compare the bytes in the file to the bytes in the string, one-by-one 1. no match: continue at next offset 2. match: print and exit 0 (ok) 4. exit 1 (error) > #### Lemma: standard output streams > All programs are connected to three files by default. This is often called "standard i/o", sometimes just `stdio`. They are: > | FILE | Name | R/W? | NOTE | PYTHON | JS | GO | Note | > | --- | --- | --- | --- | --- | --- | --- | --- | > | STDIN | standard input | R | what you type in the terminal goes here. | sys.stdin | process.stdin | os.Stdin | Input > | STDOUT | standard output | W | what the program writes goes here. intended for other programs. | sys.stdout | process.stdout | os.Stdout | > | STDERR | standard error | W | where the program writes errors. intended for humans. | sys.stderr | process.stderr | os.Stderr | Error #### Used in this example: | function or variable | type | description | notes | | | --- | --- | --- | ---| | `os.Args` | `[]string` | command line | arguments | | `os.Exit(int)` | | `fmt.Fprintf(io.Writer, string, ...interface{})` | `int` | write formatted output to a stream (files, memory buffers, etc) | #### [findoffset.py: click here](https://gitlab.com/efronlicht/blog/-/blob/58fb4c13f870a73514284617c71027bbe0a76e2a/articles/startingsystems/cmd/pythonports/findoffset.py) ```go // findoffset.go is a command line tool that finds the offset of the first occurrence of a string in a file and prints it to stdout. package main import ( "fmt" "os" ) func main() { // 1. parse the command line arguments // the operating system provides command line arguments to your program. // os.Args[0] is the name of the program, and the rest are the the 'real' arguments. if len(os.Args) != 3 { fmt.Fprintf(os.Stderr, "Usage: findoffset ") os.Exit(1) } filepath, pattern := os.Args[1], os.Args[2] // 2. read the file into memory // it's inefficent to read the entire file into memory, but it's simple and works well for small files b, err := os.ReadFile(filepath) // we'll talk about how reading files works more later, too! if err != nil { fmt.Fprintf(os.Stderr, "read %s: %v", filepath, err) // HUMAN-READABLE DEBUG INFO should go to STDERR os.Exit(1) } // 3. compare the bytes in the file to the bytes in the string, one-by-one for i := 0; i < len(b)-len(pattern); i++ { for j := range pattern { // byte-by-byte comparison // 3.1. no match: continue at next offset if b[i+j] != pattern[j] { break } // 3.2. match: print and exit 0 (ok) if j == len(pattern)-1 { // found it! print the offset & newline & exit fmt.Fprintf(os.Stdout, "%d\n", i) // MACHINE-READABLE OUTPUT should go to STDOUT os.Exit(0) } } } // 4. exit 1 (error) os.Exit(1) } ``` Looks good. But how do we test it? It would be easier to test `findoffset` if we had a way to create files with specific contents. Let's **write a program** to do that - following the unix tradition, we'll call it `echo`. ### 4.2. writing simple files with `echo.go` #### Overview 1. iterate over the command line arguments 2. print each argument to standard output, separated by spaces 3. terminate with a newline ```go // echo prints its arguments to standard output, separated by spaces and terminated by a newline. // usage: echo // see the python port at https://gitlab.com/efronlicht/blog/-/blob/0d2327696c01d6a46551fac21521937ee9f6fbe3/articles/startingsystems/cmd/pythonports/echo.py package main func main() { // 1. iterate over the command line arguments for i, arg := range os.Args[1:] { if i > 0 { fmt.Print(" ") } // 2. print each argument to standard output, separated by spaces fmt.Print(arg) } // 3. terminate with a newline fmt.Println() } ``` Now we can write a simple file, but how do we know what's in it? We can **write a program** to read it - following the unix tradition, we'll call it `cat`. --- ### 4.3. printing files with `cat.go` > `cat` - short for con**cat**enate - combines files and prints them to standard output. But it's more often used to just read a single file and send it to the terminal or another program. [`cat.py`: click here](https://gitlab.com/efronlicht/blog/-/blob/0d2327696c01d6a46551fac21521937ee9f6fbe3/articles/startingsystems/cmd/pythonports/cat.py) #### overview We want to 1. read each file specified on the command line 2. read it into memory 3. copy that memory to standard output ```go // cat reads each file specified on the command line and writes its contents to standard output. // usage: cat [ ...] package main import ( "fmt" "os" ) func main() { for _, file := range os.Args[1:] { // 1. read each file specified on the command line f, err := os.Open(file) if err != nil { fmt.Fprintf(os.Stderr, "open %s: %v", file, err) os.Exit(1) } // performance note: it's better to use `io.Copy`, but I want to illustrate the process. defer f.Close() b, err := io.ReadAll(f) // 2. read it into memory if err != nil { fmt.Fprintf(os.Stderr, "read %s: %v", file, err) os.Exit(1) } os.Stdout.Write(b) // 3. write its contents to standard output } } ``` Let's write a pair of files with `echo` and read them with `cat`. ```bash #!/usr/bin/env bash # IN echo "the quick brown fox" > fox.txt echo "jumps over the lazy dog" > dog.txt cat fox.txt dog.txt # OUT the quick brown fox jumps over the lazy dog ``` #### exercises - Write a program to number the lines of a file, `numberlines`. - Write a program to replace non-printable characters in a file, `escapetext`. --- ### 4.4. investigating the `hello` program with `findoffset`, `echo`, and `cat` Let's write a simple file, `fox.txt` and read it with `cat`. #### bash script: `catfox.bash` ```bash #!/usr/bin/env bash # IN echo "the quick brown fox jumps over the lazy dog" > fox.txt cat fox.txt # OUT the quick brown fox jumps over the lazy dog ``` Looks good. Let's use `findoffset` to find the offset of "brown" in `fox.txt`. ```bash #!/usr/bin/env bash # findbrown.bash looks for the string "brown" in the file "fox.txt" and prints the offset. # IN: echo "the quick brown fox jumps over the lazy dog" > fox.txt findoffset fox.txt "brown" # OUT 10 ``` Seems like it works. Let's find the offset of "hello, world!" in our `hello` program. ```bash #!/usr/bin/env bash # IN findoffset hello "hello, world!" # OUT 721335 ``` > #### exercises > > - modify `findoffset` to take a second argument which specifies the occurrence of the string to find. For example, `findoffset hello "hello, world!" 2` should find the second occurrence of "hello, world!" in the `hello` program. > - allow negative offsets in `findoffset` to search from the end of the file. For example, `findoffset hello "hello, world!" -1` should find the last occurrence of "hello, world!" in the `hello` program. --- ### 4.5. Basic Hacking w/ `binpatch.go` Suppose we want to change the behavior of a _compiled_ program and don't have access to the source code. We know the following facts: - Our program is just a bunch of bytes. - We can _read_ and _write_ those bytes. This is all we need to know to change the behavior. Let's change our program to write "hello, efron!" instead of "hello, world!" _without_ recompiling it. We can do this by **patching** the binary. Let's **write a program**, `binpatch`, to do so. #### Overview We want to copy everything over _except_ a specific chunk of bytes. We need to: 1. Parse the arguments 2. Copy everything before the replacement from `file` to standard output (that is, `offset` bytes of the file) 3. Write `replacement` to standard output 4. Skip over the bytes we're replacing 5. copy the rest of the file to standard output [`binpatch.py`: click here](https://gitlab.com/efronlicht/blog/-/blob/master/articles/startingsystems/cmd/pythonports/binpatch.py?ref_type=heads) ```go // binpatch replaces a sequence of bytes in file starting at offset with a replacement string, // and writes the result to standard output. // Usage: binpatch package main import ( "fmt" "io" "os" "strconv" ) func main() { // 1. Parse the arguments // the first argument is the name of the program, so we need to check for 4 arguments. // we'll talk more about arguments later. if len(os.Args) != 4 { // having the name of the program is useful for error messages, like this one. // error messages are written to stderr, so they don't interfere with the output. fmt.Fprintf(os.Stderr, "Usage: %s ", os.Args[0]) os.Exit(1) } var ( file = os.Args[1] offset, err = strconv.ParseInt(os.Args[2], 0, 64) replacement = os.Args[3] ) if err != nil || offset < 0 { fatalf("invalid offset: %v\nUsage: %s ", err, os.Args[0]) } // open the file for reading and writing f, err := os.OpenFile(file, os.O_RDWR, 0) if err != nil { fatalf("open %s: %v\n", file, err) } defer f.Close() // 2. Copy everything before the replacement from `file` to standard output (that is, `offset` bytes of the file) _, err = io.CopyN(os.Stdout, f, offset) if err != nil { fatalf("copy: %v\n", err) } // we're now at the offset where we want to write the replacement chunk. // 3. Write `replacement` to standard output _, err = os.Stdout.Write([]byte(replacement)) if err != nil { fatalf("write: %v\n", err) } // 4. Skip over the bytes we're replacing by throwing them away. if _, err := io.CopyN(io.Discard, f, int64(len(replacement))); err != nil { fatalf("copy: %v\n", err) } // 5. copy the rest of the file to standard output _, err = io.Copy(os.Stdout, f) if err != nil { fatalf("copy: %v\n", err) } } // fatalf prints an error message to stderr with fmt.Fprintf, then exits with status 1. func fatalf(format string, args ...interface{}) { fmt.Fprintf(os.Stderr, format, args...) os.Exit(1)~ } ``` Let's try changing "brown" to "green" in our `fox.txt` file. ```sh # IN findoffset fox.txt "brown" # OUT 10 ``` And use that offset to patch the file... ```sh # IN binpatch fox.txt 10 "green" # OUT the quick green fox jumps over the lazy dog ``` #### Hacking a binary It works! Let's try it on our `hello` program. > **bash note**: the `$(...)` syntax in the bash shell is called "command substitution" - it runs the command inside the parentheses and replaces the expression with the output of the command. we can use this to feed the output of `findoffset` into `binpatch`. ```bash # IN binpatch hello $(findoffset hello "hello, world!") "hello, efron!" > hackedhello ./hackedhello # OUT bash: ./hackedhello: Permission denied ``` Whoops, forgot to make `hackedhello` executable. Let's fix that. > We'll cover file permissions later. For now, know that files can be READABLE (`r`), WRITABLE (`w`), and EXECUTABLE (`x`). The `chmod` command changes these permissions. `+x` makes a file executable, `-x` makes it non-executable, and `+r` makes it readable. ```sh # IN chmod +x hackedhello ./hackedhello # OUT hello, efron! ``` It works! We've successfully changed the behavior of our program without recompiling it. **This is more hacking than most 'programmers' ever do in their entire careers - and we're just warming up**. > BTW, anyone can do this. You probably should verify the integrity of your executables against some kind of checksum after you download them... but you probably don't. --- **You have the power to look inside programs and see what's there - or even to change them!** If you take nothing else away from this, remember that - it's _not_ a black box, it's _not_ magic. They're just bytes. When it comes down to it, programming is about **transforming data**. It's not about theories, paradigms, patterns - those things are handy, but they miss the point. Programming is about taking data, changing it, and producing new data. You can do this with a text editor, a hex editor, or a program - you can do it with the help of an IDE, a library, github copilot, whatever - but there's no substitute for that fundamental understanding. --- Let's keep looking at the data inside our program. What's in the binary _around_ what used to be "hello, world!"? You know the drill - let's **write a program**, `torso` to find out. > The unix coreutils `head` and `tail` programs read the beginning and end of a file, respectively. Usually you'd use those - torso is a play on those words, reading the 'middle' of a file. > We'll cover arguments and flags in more detail later. For now, know that `-flag` is a common way to specify flags in unix programs. ### 4.6. reaching into files with `torso.go` #### Overview We need to: 1. define and parse command line flags: - choose a `-file` to read from - choose an `offset` to read from - choose how many bytes to read `-before` and `-after` the offset 2. skip to the first byte we want to read (`offset` - `before`) 3. copy (`before` + `after`) bytes to standard output 4. add a newline if requested [`torso.py`: click here](https://gitlab.com/efronlicht/blog/-/blob/dfbbfbbee27a90fb11187a72be43a19113fd6287/articles/startingsystems/cmd/pythonports/torso.py) ```go // torso reads the 'middle' of a file - the bytes around a given offset. // it's not the head of the file, and it's not the tail - it's the torso. // usage: // // torso -offset n -before [b=128] -after [a=128] -from file [-newline] // // if no file is given, reads from standard input. package main import ( "flag" "fmt" "io" "os" ) func main() { var offset, before, after int var from string var newline bool { // 1. define and parse command line flags flag.IntVar(&offset, "offset", -1, "offset to read from: must be specified") // -offset is required flag.IntVar(&before, "before", 128, "bytes to read before offset: will be clamped to 0") // defaults to -before 128 flag.IntVar(&after, "after", 128, "bytes to read after offset: will be clamped to 0") // defaults to -after 128 flag.StringVar(&from, "from", "", "file to read from: if empty, reads from standard input") // -from is required flag.BoolVar(&newline, "newline", false, "append a newline to the output") // -newline is optional flag.Parse() } // bounds checking and normalization { before = max(before, 0) // can't be negative before = min(before, offset) // can't go past the beginning after = max(after, 0) if offset < 0 { fmt.Fprintf(os.Stderr, "missing or invalid -offset\n") os.Exit(1) } } start := offset - before // where to start? n := before + after // total number of bytes to read if n == 0 { return // nothing to do } buf := make([]byte, n) // read from a file f, err := os.Open(from) if err != nil { fmt.Fprintf(os.Stderr, "open: %s: %v\n", from, err) os.Exit(1) } // 2. skip to the first byte we want to read (offset - before) _, err = f.Seek(int64(start), io.SeekStart) if err != nil { fmt.Fprintf(os.Stderr, "seek: %s: %v\n", from, err) // make sure to close the file before exiting! // experienced go programmers will use 'defer()', but I want this to be accessible to non-go programmers. f.Close() os.Exit(1) } // 3. copy (before + after) bytes to standard output // first read them into memory... n, err = io.ReadFull(f, buf) if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF { fmt.Fprintf(os.Stderr, "read: %s: %v\n", from, err) os.Exit(1) } buf = buf[:n] // then write them to standard output _, err = os.Stdout.Write(buf) if err != nil { fmt.Fprintf(os.Stderr, "write: %v\n", err) f.Close() os.Exit(1) } // 4. add a newline if requested if newline { fmt.Println() } f.Close() } ``` ```bash #!/usr/bin/env bash # IN torso -offset $(findoffset hello "hello, world!") -before 128 -after 128 -from hello # OUT mismatchwrong timersillegal seekinvalid slothost is downnot pollablegotypesaliashttpmuxgo121multipathtcprandautoseedtlsunsafeekmhello, world!3814697265625wakeableSleepprofMemActiveprofMemFuturetraceStackTabexecRInternaltestRInternalGC sweep waitSIGQUIT: qu ``` Looks like a mixture of error messages ("illegal seek", "host is down", "not pollable"), internal go runtime messages ("profMemActive", "profMemFuture"), and some kind of mysterious number, 3814697265625. In the next section, we'll investigate where that number comes from. > #### exercises > > - modify `torso` to work on lines instead of bytes with a `-lines` flag. > - modify `torso` to work on words instead of bytes with a `-words` flag. > - modify `torso` to work on utf-8 codepoints instead of bytes with a `-runes` flag. > - modify `torso` to read from standard input if no `-from` file is given. --- ### 4.7. Investigation: What's with `3814697265625`? If the string appeared in our **source code**, it's got to be in the **binary**. If it's in the binary... it's probably in the **source code**. Maybe in one of our imported packages? Let's look for it the go language source code. Let's **write a program** to find files that contain a string. The classic unix tool `grep` is perfect for this. We'll **write a program** to do this... but that's for next time, when we've talked a bit more about files. Let's use it to find the first appearance of "3814697265625" in the go source code for our version of go ```sh # IN git clone https://github.com/golang/go cd go git checkout v1.23 # or whatever version you're using grep -r "3814697265625" . # OUT math/big/floatconv.go: 3814697265625, strconv/decimal.go: {6, "3814697265625"}, // * 262144 ``` Which one shows up in our program, though? Let's use `findoffset` to find out. ```sh # IN # is it math/big? findoffset hello "math/big" || echo "math/big not found" findoffset hello "strconv" || echo "strconv not found" # OUT math/big not found 745064 ``` It's **strconv**. Let's use `cat` again to see what's around it. #### IN ```sh // #!/usr/bin/env bash cat strconv/decimal.go ``` #### OUT ```go // Cheat sheet for left shift: table indexed by shift count giving // number of new digits that will be introduced by that shift. // // For example, leftcheats[4] = {2, "625"}. That means that // if we are shifting by 4 (multiplying by 16), it will add 2 digits // when the string prefix is "625" through "999", and one fewer digit // if the string prefix is "000" through "624". // // Credit for this trick goes to Ken. type leftCheat struct { delta int // number of new digits cutoff string // minus one digit if original < a. } var leftcheats = []leftCheat{ // Leading digits of 1/2^i = 5^i. // 5^23 is not an exact 64-bit floating point number, // so have to use bc for the math. // Go up to 60 to be large enough for 32bit and 64bit platforms. /* seq 60 | sed 's/^/5^/' | bc | awk 'BEGIN{ print "\t{ 0, \"\" }," } { log2 = log(2)/log(10) printf("\t{ %d, \"%s\" },\t// * %d\n", int(log2*NR+1), $0, 2**NR) }' */ {0, ""}, /* many entries omitted for space ...*/ {6, "3814697265625"}, // * 262144 /* many entries omitted for space ...*/ {19, "867361737988403547205962240695953369140625"}, // * 1152921504606846976 } ``` This turns out to be an sophisticated bit of systems programming by the master Ken Thompson himself! We'll take another look at this later... consider it a sneak peak ;). > #### Ken Thompson > > Kenneth Lane Thompson (B: 1943) created Unix, `grep`, `Go`, and a zillion other tools that form as a foundation of every nearly every computer system in the world. He didn't do it himself, but he's legendarily productive - he famously implemented unix pipes over a lunch break. If you've ever used a regular expression, a mac or linux computer, or a smart phone, you've used his work. That's enough fiddling with the data segment (for now). Let's see if we can get some insight into the rest of the program. ## 5. Investigating the code segment. Let's use `torso` to take a peek at the first 256 bytes of our program. ```sh # IN torso -offset 0 -after 256 -from hello # OUT ELF ``` That seems like a lot fewer bytes than we expected. What's going on? The shell isn't sure how to interpret binary data. **ASCII** - the character encoding used by nearly every shell since the 1960s - covers 128 characters. Some of these are printable, like 'a' or '|' - but many are not. The first 32 characters are **control characters** - they _control_ the terminal, rather than printing anything. The first 256 bytes of our program must contain a lot of these - and they're confusing the shell. To look deeper, we need to **serialize** the bytes into a human-readable format. The most common way to do this is to print the bytes as **hexadecimal**. Let's **write a program** to to do this - a take on the classic unix tool `hexdump`. ### 5.1. reading binary files with `shexdump.go` This tool will be our first foray into manipulating binary data, a core systems programming skill. We'll read the file in chunks, convert each byte to a pair of hexadecimal digits, and print them to standard output. We'll call it `shexdump` (**s**imple **hex**adecimal **dump**) to distinguish it from the classic `hexdump`. It would be nice if we could output the hexadecimal bytes in a more readable format - say, 16 bytes per line, with a space between each byte and a newline after every 16 bytes. What will we need to do? 1. choose the input source: stdin or a file 2. read a chunk of bytes (up to 16) from the input 3. convert each byte in the chunk to a pair of hexadecimal digits 4. print the hexadecimal digits to standard output, space-separated, terminating with a newline [`hexdump.py`: click here](https://gitlab.com/efronlicht/blog/-/blob/073f60b6e7c057961fd9344c766cca6b63ff9900/articles/startingsystems/cmd/pythonports/simplehexdump.py) ```go // shexdump.go dumps the input as pairs of space-separated hexadecimal bytes, with a newline after every 16 bytes. // // # example // // #!usr/bin/env/bash // // echo "now is the time for all good men to come to the aid of their country" | shexdump // // 6e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 // // 6f 66 20 61 6c 6c 20 67 6f 6f 64 20 6d 65 6e 20 // // 74 6f 20 63 6f 6d 65 20 74 6f 20 74 68 65 20 61 // // 69 64 20 6f 66 20 74 68 65 69 72 20 63 6f 75 6e // // 74 72 79 20 0a package main import ( "bufio" "fmt" "io" "os" ) func main() { // 1. choose the input source: stdin or a file var src io.Reader switch len(os.Args) { case 1: src = os.Stdin case 2: f, err := os.Open(os.Args[1]) if err != nil { fmt.Fprintf(os.Stderr, "open %s: %v", os.Args[1], err) os.Exit(1) } defer f.Close() src = f default: fmt.Fprintf(os.Stderr, "Usage: %s [filename]", os.Args[0]) os.Exit(1) } if err := hexdump(os.Stdout, src); err != nil { fmt.Fprintf(os.Stderr, "hexdump: %v", err) os.Exit(1) } } // dump the contents of r to w in a hexdump format. func hexdump(dst io.Writer, src io.Reader) error { // performance: small reads and writes are very inefficient. while we could write a byte at a time, it's much faster to read and write in chunks. r := bufio.NewReader(src) defer w.Flush() for { // 2. read a chunk of bytes (up to 16) from the input var raw [16]byte // read 16 bytes at a time encoded := make([]byte, 0, 16*3+1+1) // 16 bytes, 3 characters per byte, 1 space between bytes, newline at the end. n, err := io.ReadFull(r, raw[:]) // 3. convert each byte in the chunk to a pair of hexadecimal digits const hex = "0123456789abcdef" if n != 0 { for i := range min(n, 8) { encoded = append(encoded, hex[raw[i]>>4], hex[raw[i]&0x0f], ' ') } encoded = append(encoded, ' ') for i := 8; i < min(n, 16); i++ { encoded = append(encoded, hex[raw[i]>>4], hex[raw[i]&0x0f], ' ') } encoded[len(encoded)-1] = '\n' // 4. print the hexadecimal digits to standard output, space-separated, terminating with a newline if _, err := w.Write(encoded); err != nil { return err } } if err == io.ErrUnexpectedEOF { return nil } else if err != nil { return err } } } ``` --- #### Exercises - Add an `-offset` flag to prefix each line with the offset in the file in hexadecimal, terminating with the ending offset of the line (default false) - Add a `-columns` flag to specify the number of columns to print per line (default 2) - Add a `-column-width` flag to specify the number of bytes to print per column. (default 8) - Add a `-squeeze` flag to compress multiple consecutive identical lines into a single line with a `*` and a count (default false) - Add an `-ascii` flag to suffix each line with the ASCII representation of the bytes (default false). Replace non-printable or non-ASCII bytes with a `.`. ```bash #!/usr/bin/env bash # IN echo -n "now is the time " | shexdump -ascii # OUT 6e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 |now is the time | ``` - Add a `-canon` flag to print the output in canonical hex+ASCII format, like `hexdump -C` (default false). **This combines the -ascii and -offset flags and should be exclusive with them.** ```bash #!/usr/bin/env bash // # IN echo "now is the time of all good men to come to the aid of their country " | shexdump -canon ``` ``` # OUT 00000000 6e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 |now is the time | 00000010 6f 66 20 61 6c 6c 20 67 6f 6f 64 20 6d 65 6e 20 |of all good men | 00000020 74 6f 20 63 6f 6d 65 20 74 6f 20 74 68 65 20 61 |to come to the a| 00000030 69 64 20 6f 66 20 74 68 65 69 72 20 63 6f 75 6e |id of their coun| 00000040 74 72 79 20 0a |try .| 00000045 ``` --- How do we test if we've written this correctly? We should be able to losslessly convert the output of `hexdump` back into the original file. Let's **write a function** to do that - and **write a program**, `unhexdump`, to use it. ### 5.2. deserializing hexdumps with `unhexdump.go` #### overview Our last program, `hexdump` _read_ binary and write whitespace-separated pairs of hexadecimal bytes. This program, `unhexdump` should _read_ whitespace-separated pairs of hexadecimal bytes and write the original binary. In other words: 1. choose the input source: stdin or a file 2. read pairs of whitespace-separated hexadecimal bytes from a file 3. convert each pair back to a byte, `unhex`-ing it 4. write that `unhex`-ed byte to standard output #### program: `unhexdump.go` [`unhexdump.py`: click here](https://gitlab.com/efronlicht/blog/-/blob/073f60b6e7c057961fd9344c766cca6b63ff9900/articles/startingsystems/cmd/pythonports/unhexdump.py) ```go package main import ( "bufio" "fmt" "io" "os" ) // unhexdump.go reverses the process of hexdump, converting a hexdump back into a file. // it expects pairs of whitespace-separated hexadecimal bytes. func main() { // 1. choose the input source: stdin or a file var src io.Reader switch len(os.Args) { case 1: src = os.Stdin case 2: f, err := os.Open(os.Args[1]) if err != nil { fmt.Fprintf(os.Stderr, "open %s: %v", os.Args[1], err) os.Exit(1) } defer f.Close() src = f default: fmt.Fprintf(os.Stderr, "Usage: %s [filename]", os.Args[0]) os.Exit(1) } if err := unhexdump(os.Stdout, src); err != nil { fmt.Fprintf(os.Stderr, "unhexdump: %v", err) os.Exit(1) } } // unhexdump reads pairs of whitespace-separated hexadecimal bytes from r and writes the corresponding bytes to w. func unhexdump(w io.Writer, r io.Reader) error { // 2. read pairs of whitespace-separated hexadecimal bytes from the input. // we're going to start caring a little more about performance here. we'll use a buffered reader and writer to reduce the number of system calls & allocations (more about these topics in later articles). scanner := bufio.NewScanner(r) scanner.Split(bufio.ScanWords) bw := bufio.NewWriter(w) defer bw.Flush() for i := 0; scanner.Scan(); i++ { b := scanner.Bytes() if len(b)&1 == 1 { // odd number of hex digits return fmt.Errorf("odd number of hex digits at position %d (%q)", i) } // 3. convert each pair back to a byte, unhex-ing it for i := 0; i < len(b); i += 2 { high, ok := unhex(b[i]) if !ok { return fmt.Errorf("bad hex %x '%c' at position %d", b[i], b[i], i) } low, ok := unhex(b[i+1]) if !ok { return fmt.Errorf("bad hex %x '%c' at position %d", b[i+1], b[i+1], i+1) } // 4. write that unhex-ed byte to standard output if err := bw.WriteByte(high<<4 | low); err != nil { return err } } } return scanner.Err() } // unhex converts a hexadecimal character to it's value (0x0-0xf), // or 0, false if the character is not a valid hexadecimal digit. func unhex(b byte) (byte, bool) { switch { case '0' <= b && b <= '9': return b - '0', true case 'a' <= b && b <= 'f': return b - 'a' + 10, true case 'A' <= b && b <= 'F': return b - 'A' + 10, true default: return 0, false } } ``` Let's test these programs by hexdumping a file, then unhexdumping it. ### 5.3. hexdump_test.bash #### overview 1. write a file, `moby.txt` 2. hexdump it to `moby.hex` 3. unhexdump it to `moby2.txt` 4. compare the files (we'll use `diff`, a classic unix tool, to do this) ```bash #!/usr/bin/env bash # IN echo "to the last I grapple with thee; from hell's heart I stab at thee; for hate's sake I spit my last breath at thee" > moby.txt # 1. write a file shexdump moby.txt > moby.hex # 2. hexdump it unhexdump moby.hex > moby2.txt # 3. unhexdump it diff -s moby.txt moby2.txt # 4. compare the files ``` ```bash # OUT: Files moby.txt and moby2.txt are identical ``` Looks good. As a final exercise before we close this article out, let's use `torso` and `hexdump` to investigate the first bytes of our 'hello' program and see what we can learn. ### 5.4. Investigating the ELF header of `hello` Programs in Linux are stored in the `ELF` (Executable and Linkable Format), which starts with a header that describes the rest of the file. This is a common way to organize file formats - a brief header explaining the rest. Let's use ELF as an example header. ```bash #!/usr/bin/env bash # IN cat hello | torso -offset 0 -after 16 | shexdump # OUT 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 02 00 3e 00 01 00 00 00 c0 cf 46 00 00 00 00 00 ``` `7f` is meaningless to us - it's not ASCII - but `45 4c 46` is `E L F` in ASCII. Beyond that, it's mostly opaque binary - but we can already figure out why `hello` stopped printing with `cat` given our new knowledge of C-strings. The 8th byte of `hello` is a null, which terminates the string and got our shell to stop printing. [Let's look at what wikipedia says about the ELF header to learn more](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format). The **object file type** is identified with a special byte at offset 0x10: ```bash # IN cat hello | torso -offset 0x10 -before 0 -after 1 | shexdump # OUT 02 ``` | Value | Type | Meaning | | -------- | ----------- | -------------------- | | 0x00 | ET_NONE | Unknown. | | 0x01 | ET_REL | Relocatable file. | | **0x02** | **ET_EXEC** | **Executable file.** | | 0x03 | ET_DYN | Shared object. | | 0x04 | ET_CORE | Core file. | Unsurprisingly, our executable is an executable. We can take a peek at our architecture with the **machine type** at offset 0x12. ```bash # IN cat hello | torso -offset 0x12 -before 0 -after 1 | shexdump # OUT 3e ``` This is the **x86-64** architecture. The **endianess** of the file is stored at offset 0x05. That's great, but where does our program _actually start_? The **entrypoint** is the place in memory where the program starts executing - the initial value of the [**instruction pointer**](#lemma-instruction-pointer). This is stored in the ELF header, [**little-endian**](#lemma-endianness), at offset 0x18. ```bash # IN cat hello | torso -offset 0x18 -before 0 -after 8 | shexdump # OUT c0 cf 46 00 00 00 00 00 ``` Looks like we'll start executing instructions at memory address `0x46cfc0`. More about that later. That's all the digging we'll do for now - this article's already over twenty pages long! If you made it through - and especially if you did the exercises - congratulations and thank you for sticking with me. A couple of final notes before we close out: > #### Lemma: Instruction Pointer > At the most fundamental level, a program works like this: > > - read an instruction at the memory address pointed to by the **instruction pointer** > - do what the instruction says > - increment the instruction pointer by the size of the instruction > > The **entrypoint** is where the **instruction pointer** starts. > #### Lemma: Endianness > When we want to convert a list of bytes to an integer, there's two ways to do it: either the most significant byte comes first (big-endian), or the least significant byte comes first (little-endian). The x86-64 architecture is little-endian, so the bytes `c0 cf 46 00 00 00 00 00` are 'reversed' when converted to an integer; they're the number `0x0046cfc0` (4640704) in little-endian, but `0xc0cf46000000000` (13893400341275213824) in big-endian. Since we know that this represents a **memory address**, we can figure out that this is a little-endian machine just by looking at the bytes - no machine ever made has 12635974 terrabytes of memory, so you're not going to see that in a memory address. Knowing a bit about how computers _actually work_ can help you understand what you're looking at. > > **Exercise**: **Write a program** to read a 64-bit little-endian integer from a file starting at a given-offset and print it "big-endian" (most significant byte first). ――――― - [1. Series Introduction](#1-series-introduction) - [1.1. Programmers Write Programs](#11-programmers-write-programs) - [1.2. Some final caveats:](#12-some-final-caveats) - [1.3. Series Overview](#13-series-overview) - [2. What is systems programming?](#2-what-is-systems-programming) - [3. Peeking into the black box: what is a program, anyways?](#3-peeking-into-the-black-box-what-is-a-program-anyways) - [3.1. hello.go](#31-hellogo) - [3.2. buildhello.bash](#32-buildhellobash) - [4. Investigating the data segment](#4-investigating-the-data-segment) - [4.1. finding strings with `findoffset.go`](#41-finding-strings-with-findoffsetgo) - [4.2. writing simple files with `echo.go`](#42-writing-simple-files-with-echogo) - [4.3. printing files with `cat.go`](#43-printing-files-with-catgo) - [4.4. investigating the `hello` program with `findoffset`, `echo`, and `cat`](#44-investigating-the-hello-program-with-findoffset-echo-and-cat) - [4.5. Basic Hacking w/ `binpatch.go`](#45-basic-hacking-w-binpatchgo) - [4.6. reaching into files with `torso.go`](#46-reaching-into-files-with-torsogo) - [4.7. Investigation: What's with `3814697265625`?](#47-investigation-whats-with-3814697265625) - [5. Investigating the code segment.](#5-investigating-the-code-segment) - [5.1. reading binary files with `shexdump.go`](#51-reading-binary-files-with-shexdumpgo) - [5.2. deserializing hexdumps with `unhexdump.go`](#52-deserializing-hexdumps-with-unhexdumpgo) - [5.3. hexdump\_test.bash](#53-hexdump_testbash) - [5.4. Investigating the ELF header of `hello`](#54-investigating-the-elf-header-of-hello) - [6. Conclusion: The Spirit of Systems Programming.](#6-conclusion-the-spirit-of-systems-programming) ## 6. Conclusion: The Spirit of Systems Programming. Hopefully I've given you a 'taste' of systems programming. A summary of the "mindset" or "spirit" behind this article would be: - Build your own tools. - Look at the data with your own eyes. - Understand the system rather than relying on abstractions. - It's all just bytes. - **Programmers write programs**. That's all for now! In the next article, Starting Systems 2: Your Program and the Outside World, we'll look at how programs _actually do things_ - how they interact with the outside world via files and system calls, how they manage memory, environment variables and command-line arguments, standard input and output, and more. <> #### Sidenote: The origin of endianness > The term "endianess" is a reference to Gulliver's Travels. It comes from "ON HOLY WARS AND A PLEA FOR PEACE" by Danny Cohen on April 1, 1980: the [internet archive has a copy of the original post](https://web.archive.org/web/20220210184239/http://www.networksorcery.com:80/enp/ien/ien137.txt). I reproduce the relevant section of gulliver's travels here: > > > It began upon the following occasion. > > > > It is allowed on all hands, that the primitive way of breaking eggs before we eat them, was upon the larger end: but his present Majesty's grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the Emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs. > > The people so highly resented this law, that our Histories tell us there have been six rebellions raised on that account, wherein one Emperor lost his life, and another his crown. These civil commotions were constantly formented by the monarchs of Blefuscu, and when they were quelled, the exiles always fled for refuge to that Empire. > > > > It is computed, that eleven thousand persons have, at several times, suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volums have been published upon this controversy: but the books of the Big-Endians have been long forbidden, and the whole party rendered incapable by law of holding employments. > > > > During the course of these troubles, the emperors of Blefuscu did frequently expostulate by their ambassadors, accusing us of making a schism in religion, by offending against a fundamental doctrine of our great prophet Lustrog, in the fifty-fourth chapter of the Brundecral (which is their Alcoran). This, however, is thought to be a mere strain upon the text: for their words are these; That all true believers shall break their eggs at the convenient end: and which is the convenient end, seems, in my humble opinion, to be left to every man's conscience, or at least in the power of the chief magistrate to determine.