A software article by Efron Licht
MAR 2025
starting systems programming: practical systems programming for the contemporary developer (2025-)
gotwo
?: Virtual Machines, Assembly, and Debuggingbackend from the beginning: a complete guide to backend development in go (2023)
advanced go & gamedev (2023)
misc/uncategorized (2022-2023)
This is the first of four articles on the fundamentals of systems programming. It will cover many of the essentials, such as bit manipulation, parsing, filesystems, input/output, syscalls, memory management, and signals. Like many of my article series, this is more of a grab bag than a comprehensive guide - but I hope it will be useful to you.
When’s the last time you wrote a program from scratch? For a shocking number of programmers, the answer is ‘in school’. This is a pervasive problem in the industry, and it’s only getting worse. I interview a lot of candidates, and I’ve run into people with titles like ‘Technical Lead @ Tesla’ (or worse yet, Principal Engineer) who can’t program their way out of a paper bag. My ordinary interview question is “write me grep
” - a problem which should be appropriate for first or second-year computer science students - and the overwhelming majority of candidates fail it.
I don’t think they’re dumb - they usually aren’t - but they don’t have the grounding in programming fundamentals - systems programming fundamentals - that they need to “really” program. This is bad for the candidates, bad for the industry, and bad for our increasingly computerized world. Why? Making reliable, intuitive, and efficient software is about minimizing complexity. If all you can do is add on to existing programs, you’re glued to pre-existing complexity. If you can’t write a program from scratch, you’re stuck in a world of other people’s code and other people’s mistakes.
The way to get good at something is by doing it. Pitchers pitch, painters paint, and programmers program. So this article will be about writing programs - dozens of them. As such, while reading the text of this article might teach you a few things, to really get the most out of it, you’ll need to understand the programs. I’ve provided exercises to help you practice.
Where possible, the code in this series will use as few libraries as possible. This is not because you shouldn’t use libraries - but because you shouldn’t need them. I want to show you that you can make practical tools out of simple primitives.
You’ll see a number of code blocks throughout this article. These are either go programs or bash shell scripts. If you’ve had experience with a mainstream programming language like Python, Javascript, or C, you should be able to follow along, but you might want to review pointers a bit.
Go uses //
for comments, and bash and python use #
. I’ll start each code block with a comment indicating the language: // filename.go
or #!/usr/bin/env bash
.
I have provided python implementations of many programs in this series. See my
gitlab
in thearticles/startingsystems/cmd/pythonports
directory.
I’ll try to link the specific files at the head of each go program. The go program should always be treated as the ‘canonical’ version. I may or may not keep doing this in the next article - it’s a lot of work.
Go programs will start with // filename.go <description>
.
1// minimal.go is an example go program.
2// see https://gitlab.com/efronlicht/blog/-/blob/58fb4c13f870a73514284617c71027bbe0a76e2a/articles/startingsystems/cmd/pythonports/minimal.py for the python version.
3package main
4import "fmt"
5func main(){ fmt.Println("this is a go program") }
Bash scripts will start with #!/usr/bin/env bash
. They usually contain a series of commands that you might run in a terminal after the # IN
comment. The # OUT
comment shows the expected output of the commands.
1#!/usr/bin/env bash
2# example.bash demonstrates a simple bash script with an # IN and # OUT section.
3
4# IN:
5echo "this is a bash script"
6
7# OUT:
8this is a bash script
lemma: sidenotes
Sidenotes will show up in indented boxes, like this. A ‘lemma` is a small digression important to clarify a point.
lemma: shebang (
#!
)The shebang (
#!
) at the beginning of a file tells the operating system what program to use to run it. For example,/usr/bin/bash
will run the file with the bash shell located at/usr/bin/bash
.#!/usr/bin/env bash
tells the OS to use whateverbash
is in yourPATH
environment variable to run the script. We’ll talk more about all of these things in a later article.
OK, enough ceremony. Let’s get started.
In this article, we’ll talk about what systems programming is, what a program is, and how to interact with the data inside a program. We’ll build a program and dig into the data inside it, hack it to change it’s behavior, and build a series of software tools to help us understand it that we’ll use throughout the series.
gotwo
- the fundamentals of programming, virtual machines, assembly, debugging, and ABIs. (COMING SOON)goto
? When it comes down to it, programming is some memory, an instruction pointer, and a series of conditional jumps. We’ll use our new systems programming skills to build a virtual machine & assembly language that’s a valid subset of go. We’ll use that to illustrate how debuggers and ABIs work.There’s no clear line between “systems programming” and other kinds. A problem might be ‘systems programming if it’
A systems programmer sees a computer as a physical machine that can be completely understood, rather than a mathematical or formal abstraction. They understand the hardware and software of a computer system, and they can write programs that interact with both. A systems programmer is unafraid to tear something apart, confident that they can put it back together again.
A program is an executable file that your operating system can interpret as a series of machine instructions. That is, it’s a combination of code and data that the operating system can load into memory and execute.
Programs come in two main types:
When we say ‘input’ and ‘output’, we mean bytes. To warm up, let’s write a program that takes no input but produces output: the nearly 50-year-old classic, “hello, world!”.
1// hello.go
2package main
3import "fmt"
4func main() {
5 fmt.Println("hello, world!")
6}
go build
to compile the program1#!/usr/bin/env bash
2# buildhello.bash builds and runs the hello program.
3# IN
4go build -o hello hello.go # 1. call 'go build' to compile the program
5./hello # 2. run the program
6# OUT
7hello, world!
Great. What’s actually in the hello program? We expect
We’ll take a poke around and see what we can find. Let’s start with the data - that is, any of the bytes in the file that aren’t executable instructions.
Regardless of your operating system or architecture, there’s one thing we can be sure will exist in the file: the string “hello, world!”. Let’s look for it. Better yet, let’s write a program to look for it - we’ll call it findoffset
.
findoffset.go
We want to look for a specific string in a file and print the offset of the first occurrence.
When it comes down to it, a string is just a sequence of bytes in some character encoding. We’ll do that by comparing the bytes in the file to the bytes in the string, one-by-one.
That is, we’ll:
Lemma: standard output streams
All programs are connected to three files by default. This is often called “standard i/o”, sometimes just
stdio
. They are:
FILE Name R/W? NOTE PYTHON JS GO Note STDIN standard input R what you type in the terminal goes here. sys.stdin process.stdin os.Stdin Input STDOUT standard output W what the program writes goes here. intended for other programs. sys.stdout process.stdout os.Stdout STDERR standard error W where the program writes errors. intended for humans. sys.stderr process.stderr os.Stderr Error
| function or variable | type | description | notes |
| | — | — | — | —|
| os.Args
| []string
| command line | arguments |
| os.Exit(int)
|
| fmt.Fprintf(io.Writer, string, ...interface{})
| int
| write formatted output to a stream (files, memory buffers, etc) |
1// findoffset.go is a command line tool that finds the offset of the first occurrence of a string in a file and prints it to stdout.
2package main
3
4import (
5 "fmt"
6 "os"
7)
8
9func main() {
10 // 1. parse the command line arguments
11
12 // the operating system provides command line arguments to your program.
13 // os.Args[0] is the name of the program, and the rest are the the 'real' arguments.
14 if len(os.Args) != 3 {
15 fmt.Fprintf(os.Stderr, "Usage: findoffset <filename> <string>")
16 os.Exit(1)
17 }
18
19 filepath, pattern := os.Args[1], os.Args[2]
20
21 // 2. read the file into memory
22
23 // it's inefficent to read the entire file into memory, but it's simple and works well for small files
24 b, err := os.ReadFile(filepath) // we'll talk about how reading files works more later, too!
25 if err != nil {
26 fmt.Fprintf(os.Stderr, "read %s: %v", filepath, err) // HUMAN-READABLE DEBUG INFO should go to STDERR
27 os.Exit(1)
28 }
29
30 // 3. compare the bytes in the file to the bytes in the string, one-by-one
31 for i := 0; i < len(b)-len(pattern); i++ {
32 for j := range pattern { // byte-by-byte comparison
33 // 3.1. no match: continue at next offset
34 if b[i+j] != pattern[j] {
35 break
36 }
37
38 // 3.2. match: print and exit 0 (ok)
39 if j == len(pattern)-1 { // found it! print the offset & newline & exit
40 fmt.Fprintf(os.Stdout, "%d\n", i) // MACHINE-READABLE OUTPUT should go to STDOUT
41 os.Exit(0)
42 }
43 }
44 }
45 // 4. exit 1 (error)
46 os.Exit(1)
47}
Looks good. But how do we test it? It would be easier to test findoffset
if we had a way to create files with specific contents. Let’s write a program to do that - following the unix tradition, we’ll call it echo
.
echo.go
1// echo prints its arguments to standard output, separated by spaces and terminated by a newline.
2// usage: echo <args...>
3// see the python port at https://gitlab.com/efronlicht/blog/-/blob/0d2327696c01d6a46551fac21521937ee9f6fbe3/articles/startingsystems/cmd/pythonports/echo.py
4package main
5func main() {
6 // 1. iterate over the command line arguments
7 for i, arg := range os.Args[1:] {
8 if i > 0 {
9 fmt.Print(" ")
10 }
11 // 2. print each argument to standard output, separated by spaces
12 fmt.Print(arg)
13 }
14 // 3. terminate with a newline
15 fmt.Println()
16}
Now we can write a simple file, but how do we know what’s in it? We can write a program to read it - following the unix tradition, we’ll call it cat
.
cat.go
cat
- short for concatenate - combines files and prints them to standard output. But it’s more often used to just read a single file and send it to the terminal or another program.
We want to
1// cat reads each file specified on the command line and writes its contents to standard output.
2// usage: cat <file1> [<file2> ...]
3package main
4import (
5 "fmt"
6 "os"
7)
8func main() {
9 for _, file := range os.Args[1:] { // 1. read each file specified on the command line
10 f, err := os.Open(file)
11 if err != nil {
12 fmt.Fprintf(os.Stderr, "open %s: %v", file, err)
13 os.Exit(1)
14 }
15 // performance note: it's better to use `io.Copy`, but I want to illustrate the process.
16 defer f.Close()
17 b, err := io.ReadAll(f) // 2. read it into memory
18 if err != nil {
19 fmt.Fprintf(os.Stderr, "read %s: %v", file, err)
20 os.Exit(1)
21 }
22 os.Stdout.Write(b) // 3. write its contents to standard output
23
24 }
25}
Let’s write a pair of files with echo
and read them with cat
.
1#!/usr/bin/env bash
2# IN
3echo "the quick brown fox" > fox.txt
4echo "jumps over the lazy dog" > dog.txt
5cat fox.txt dog.txt
6# OUT
7the quick brown fox
8jumps over the lazy dog
numberlines
.escapetext
.hello
program with findoffset
, echo
, and cat
Let’s write a simple file, fox.txt
and read it with cat
.
catfox.bash
1#!/usr/bin/env bash
2# IN
3echo "the quick brown fox jumps over the lazy dog" > fox.txt
4cat fox.txt
5# OUT
6the quick brown fox jumps over the lazy dog
Looks good. Let’s use findoffset
to find the offset of “brown” in fox.txt
.
1#!/usr/bin/env bash
2# findbrown.bash looks for the string "brown" in the file "fox.txt" and prints the offset.
3# IN:
4echo "the quick brown fox jumps over the lazy dog" > fox.txt
5findoffset fox.txt "brown"
6# OUT
710
Seems like it works. Let’s find the offset of “hello, world!” in our hello
program.
1#!/usr/bin/env bash
2# IN
3findoffset hello "hello, world!"
4# OUT
5721335
exercises
- modify
findoffset
to take a second argument which specifies the occurrence of the string to find. For example,findoffset hello "hello, world!" 2
should find the second occurrence of “hello, world!” in thehello
program.
- allow negative offsets in
findoffset
to search from the end of the file. For example,findoffset hello "hello, world!" -1
should find the last occurrence of “hello, world!” in thehello
program.
binpatch.go
Suppose we want to change the behavior of a compiled program and don’t have access to the source code.
We know the following facts:
This is all we need to know to change the behavior.
Let’s change our program to write “hello, efron!” instead of “hello, world!” without recompiling it. We can do this by patching the binary. Let’s write a program, binpatch
, to do so.
We want to copy everything over except a specific chunk of bytes. We need to:
file
to standard output (that is, offset
bytes of the file)replacement
to standard output 1// binpatch replaces a sequence of bytes in file starting at offset with a replacement string,
2// and writes the result to standard output.
3// Usage: binpatch <file> <offset> <replacement>
4package main
5
6import (
7 "fmt"
8 "io"
9 "os"
10 "strconv"
11)
12
13func main() {
14 // 1. Parse the arguments
15
16 // the first argument is the name of the program, so we need to check for 4 arguments.
17 // we'll talk more about arguments later.
18 if len(os.Args) != 4 {
19 // having the name of the program is useful for error messages, like this one.
20 // error messages are written to stderr, so they don't interfere with the output.
21 fmt.Fprintf(os.Stderr, "Usage: %s <file> <offset> <replacement>", os.Args[0])
22 os.Exit(1)
23 }
24 var (
25 file = os.Args[1]
26 offset, err = strconv.ParseInt(os.Args[2], 0, 64)
27 replacement = os.Args[3]
28 )
29 if err != nil || offset < 0 {
30 fatalf("invalid offset: %v\nUsage: %s <file> <offset> <replacement>", err, os.Args[0])
31 }
32 // open the file for reading and writing
33 f, err := os.OpenFile(file, os.O_RDWR, 0)
34 if err != nil {
35 fatalf("open %s: %v\n", file, err)
36 }
37 defer f.Close()
38
39
40 // 2. Copy everything before the replacement from `file` to standard output (that is, `offset` bytes of the file)
41 _, err = io.CopyN(os.Stdout, f, offset)
42 if err != nil {
43 fatalf("copy: %v\n", err)
44 }
45
46 // we're now at the offset where we want to write the replacement chunk.
47 // 3. Write `replacement` to standard output
48 _, err = os.Stdout.Write([]byte(replacement))
49 if err != nil {
50 fatalf("write: %v\n", err)
51 }
52
53 // 4. Skip over the bytes we're replacing by throwing them away.
54 if _, err := io.CopyN(io.Discard, f, int64(len(replacement))); err != nil {
55 fatalf("copy: %v\n", err)
56 }
57
58 // 5. copy the rest of the file to standard output
59 _, err = io.Copy(os.Stdout, f)
60 if err != nil {
61 fatalf("copy: %v\n", err)
62 }
63}
64
65// fatalf prints an error message to stderr with fmt.Fprintf, then exits with status 1.
66func fatalf(format string, args ...interface{}) {
67 fmt.Fprintf(os.Stderr, format, args...)
68 os.Exit(1)~
69}
Let’s try changing “brown” to “green” in our fox.txt
file.
1# IN
2findoffset fox.txt "brown"
3# OUT
410
And use that offset to patch the file…
1# IN
2binpatch fox.txt 10 "green"
3# OUT
4the quick green fox jumps over the lazy dog
It works! Let’s try it on our hello
program.
bash note: the
$(...)
syntax in the bash shell is called “command substitution” - it runs the command inside the parentheses and replaces the expression with the output of the command. we can use this to feed the output offindoffset
intobinpatch
.
1# IN
2binpatch hello $(findoffset hello "hello, world!") "hello, efron!" > hackedhello
3./hackedhello
4# OUT
5bash: ./hackedhello: Permission denied
Whoops, forgot to make hackedhello
executable. Let’s fix that.
We’ll cover file permissions later. For now, know that files can be READABLE (
r
), WRITABLE (w
), and EXECUTABLE (x
). Thechmod
command changes these permissions.+x
makes a file executable,-x
makes it non-executable, and+r
makes it readable.
1# IN
2chmod +x hackedhello
3./hackedhello
4# OUT
5hello, efron!
It works! We’ve successfully changed the behavior of our program without recompiling it. This is more hacking than most ‘programmers’ ever do in their entire careers - and we’re just warming up.
BTW, anyone can do this. You probably should verify the integrity of your executables against some kind of checksum after you download them… but you probably don’t.
You have the power to look inside programs and see what’s there - or even to change them! If you take nothing else away from this, remember that - it’s not a black box, it’s not magic. They’re just bytes.
When it comes down to it, programming is about transforming data. It’s not about theories, paradigms, patterns - those things are handy, but they miss the point. Programming is about taking data, changing it, and producing new data. You can do this with a text editor, a hex editor, or a program - you can do it with the help of an IDE, a library, github copilot, whatever - but there’s no substitute for that fundamental understanding.
Let’s keep looking at the data inside our program. What’s in the binary around what used to be “hello, world!”?
You know the drill - let’s write a program, torso
to find out.
The unix coreutils
head
andtail
programs read the beginning and end of a file, respectively. Usually you’d use those - torso is a play on those words, reading the ‘middle’ of a file.We’ll cover arguments and flags in more detail later. For now, know that
-flag
is a common way to specify flags in unix programs.
torso.go
We need to:
-file
to read fromoffset
to read from-before
and -after
the offsetoffset
- before
)before
+ after
) bytes to standard output 1
2// torso reads the 'middle' of a file - the bytes around a given offset.
3// it's not the head of the file, and it's not the tail - it's the torso.
4// usage:
5//
6// torso -offset n -before [b=128] -after [a=128] -from file [-newline]
7//
8// if no file is given, reads from standard input.
9package main
10
11import (
12 "flag"
13 "fmt"
14 "io"
15 "os"
16)
17
18func main() {
19 var offset, before, after int
20 var from string
21 var newline bool
22 { // 1. define and parse command line flags
23 flag.IntVar(&offset, "offset", -1, "offset to read from: must be specified") // -offset is required
24 flag.IntVar(&before, "before", 128, "bytes to read before offset: will be clamped to 0") // defaults to -before 128
25 flag.IntVar(&after, "after", 128, "bytes to read after offset: will be clamped to 0") // defaults to -after 128
26 flag.StringVar(&from, "from", "", "file to read from: if empty, reads from standard input") // -from is required
27 flag.BoolVar(&newline, "newline", false, "append a newline to the output") // -newline is optional
28 flag.Parse()
29 }
30
31 // bounds checking and normalization
32 {
33 before = max(before, 0) // can't be negative
34 before = min(before, offset) // can't go past the beginning
35 after = max(after, 0)
36 if offset < 0 {
37 fmt.Fprintf(os.Stderr, "missing or invalid -offset\n")
38 os.Exit(1)
39 }
40 }
41
42 start := offset - before // where to start?
43 n := before + after // total number of bytes to read
44 if n == 0 {
45 return // nothing to do
46 }
47 buf := make([]byte, n)
48
49 // read from a file
50 f, err := os.Open(from)
51 if err != nil {
52 fmt.Fprintf(os.Stderr, "open: %s: %v\n", from, err)
53 os.Exit(1)
54 }
55 // 2. skip to the first byte we want to read (offset - before)
56 _, err = f.Seek(int64(start), io.SeekStart)
57 if err != nil {
58 fmt.Fprintf(os.Stderr, "seek: %s: %v\n", from, err)
59 // make sure to close the file before exiting!
60 // experienced go programmers will use 'defer()', but I want this to be accessible to non-go programmers.
61 f.Close()
62 os.Exit(1)
63 }
64
65 // 3. copy (before + after) bytes to standard output
66
67 // first read them into memory...
68 n, err = io.ReadFull(f, buf)
69 if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
70 fmt.Fprintf(os.Stderr, "read: %s: %v\n", from, err)
71 os.Exit(1)
72 }
73 buf = buf[:n]
74
75 // then write them to standard output
76 _, err = os.Stdout.Write(buf)
77 if err != nil {
78 fmt.Fprintf(os.Stderr, "write: %v\n", err)
79 f.Close()
80 os.Exit(1)
81 }
82
83 // 4. add a newline if requested
84 if newline {
85 fmt.Println()
86 }
87 f.Close()
88}
1#!/usr/bin/env bash
2# IN
3torso -offset $(findoffset hello "hello, world!") -before 128 -after 128 -from hello
4# OUT
5mismatchwrong timersillegal seekinvalid slothost is downnot pollablegotypesaliashttpmuxgo121multipathtcprandautoseedtlsunsafeekmhello, world!3814697265625wakeableSleepprofMemActiveprofMemFuturetraceStackTabexecRInternaltestRInternalGC sweep waitSIGQUIT: qu
Looks like a mixture of error messages (“illegal seek”, “host is down”, “not pollable”), internal go runtime messages (“profMemActive”, “profMemFuture”), and some kind of mysterious number, 3814697265625. In the next section, we’ll investigate where that number comes from.
exercises
- modify
torso
to work on lines instead of bytes with a-lines
flag.
- modify
torso
to work on words instead of bytes with a-words
flag.
- modify
torso
to work on utf-8 codepoints instead of bytes with a-runes
flag.
- modify
torso
to read from standard input if no-from
file is given.
3814697265625
?If the string appeared in our source code, it’s got to be in the binary. If it’s in the binary… it’s probably in the source code. Maybe in one of our imported packages? Let’s look for it the go language source code. Let’s write a program to find files that contain a string.
The classic unix tool grep
is perfect for this. We’ll write a program to do this… but that’s for next time, when we’ve talked a bit more about files.
Let’s use it to find the first appearance of “3814697265625” in the go source code for our version of go
1# IN
2git clone https://github.com/golang/go
3cd go
4git checkout v1.23 # or whatever version you're using
5grep -r "3814697265625" .
6# OUT
7math/big/floatconv.go: 3814697265625,
8strconv/decimal.go: {6, "3814697265625"}, // * 262144
Which one shows up in our program, though? Let’s use findoffset
to find out.
1# IN
2# is it math/big?
3findoffset hello "math/big" || echo "math/big not found"
4findoffset hello "strconv" || echo "strconv not found"
5# OUT
6math/big not found
7745064
It’s strconv. Let’s use cat
again to see what’s around it.
1// #!/usr/bin/env bash
2cat strconv/decimal.go
1// Cheat sheet for left shift: table indexed by shift count giving
2// number of new digits that will be introduced by that shift.
3//
4// For example, leftcheats[4] = {2, "625"}. That means that
5// if we are shifting by 4 (multiplying by 16), it will add 2 digits
6// when the string prefix is "625" through "999", and one fewer digit
7// if the string prefix is "000" through "624".
8//
9// Credit for this trick goes to Ken.
10type leftCheat struct {
11 delta int // number of new digits
12 cutoff string // minus one digit if original < a.
13}
14
15var leftcheats = []leftCheat{
16 // Leading digits of 1/2^i = 5^i.
17 // 5^23 is not an exact 64-bit floating point number,
18 // so have to use bc for the math.
19 // Go up to 60 to be large enough for 32bit and 64bit platforms.
20 /*
21 seq 60 | sed 's/^/5^/' | bc |
22 awk 'BEGIN{ print "\t{ 0, \"\" }," }
23 {
24 log2 = log(2)/log(10)
25 printf("\t{ %d, \"%s\" },\t// * %d\n",
26 int(log2*NR+1), $0, 2**NR)
27 }'
28 */
29 {0, ""},
30/* many entries omitted for space ...*/
31 {6, "3814697265625"}, // * 262144
32/* many entries omitted for space ...*/
33 {19, "867361737988403547205962240695953369140625"}, // * 1152921504606846976
34}
This turns out to be an sophisticated bit of systems programming by the master Ken Thompson himself! We’ll take another look at this later… consider it a sneak peak ;).
Ken Thompson
Kenneth Lane Thompson (B: 1943) created Unix,
grep
,Go
, and a zillion other tools that form as a foundation of every nearly every computer system in the world. He didn’t do it himself, but he’s legendarily productive - he famously implemented unix pipes over a lunch break. If you’ve ever used a regular expression, a mac or linux computer, or a smart phone, you’ve used his work.
That’s enough fiddling with the data segment (for now). Let’s see if we can get some insight into the rest of the program.
Let’s use torso
to take a peek at the first 256 bytes of our program.
1# IN
2torso -offset 0 -after 256 -from hello
3# OUT
4ELF
That seems like a lot fewer bytes than we expected. What’s going on? The shell isn’t sure how to interpret binary data. ASCII - the character encoding used by nearly every shell since the 1960s - covers 128 characters. Some of these are printable, like ‘a’ or ‘|’ - but many are not. The first 32 characters are control characters - they control the terminal, rather than printing anything. The first 256 bytes of our program must contain a lot of these - and they’re confusing the shell.
To look deeper, we need to serialize the bytes into a human-readable format. The most common way to do this is to print the bytes as hexadecimal. Let’s write a program to to do this - a take on the classic unix tool hexdump
.
shexdump.go
This tool will be our first foray into manipulating binary data, a core systems programming skill. We’ll read the file in chunks, convert each byte to a pair of hexadecimal digits, and print them to standard output. We’ll call it shexdump
(simple hexadecimal dump) to distinguish it from the classic hexdump
.
It would be nice if we could output the hexadecimal bytes in a more readable format - say, 16 bytes per line, with a space between each byte and a newline after every 16 bytes.
What will we need to do?
1// shexdump.go dumps the input as pairs of space-separated hexadecimal bytes, with a newline after every 16 bytes.
2//
3// # example
4// // #!usr/bin/env/bash
5// // echo "now is the time for all good men to come to the aid of their country" | shexdump
6// // 6e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20
7// // 6f 66 20 61 6c 6c 20 67 6f 6f 64 20 6d 65 6e 20
8// // 74 6f 20 63 6f 6d 65 20 74 6f 20 74 68 65 20 61
9// // 69 64 20 6f 66 20 74 68 65 69 72 20 63 6f 75 6e
10// // 74 72 79 20 0a
11
12package main
13
14import (
15 "bufio"
16 "fmt"
17 "io"
18 "os"
19)
20
21func main() {
22 // 1. choose the input source: stdin or a file
23 var src io.Reader
24 switch len(os.Args) {
25 case 1:
26 src = os.Stdin
27 case 2:
28 f, err := os.Open(os.Args[1])
29 if err != nil {
30 fmt.Fprintf(os.Stderr, "open %s: %v", os.Args[1], err)
31 os.Exit(1)
32 }
33 defer f.Close()
34 src = f
35 default:
36 fmt.Fprintf(os.Stderr, "Usage: %s [filename]", os.Args[0])
37 os.Exit(1)
38 }
39 if err := hexdump(os.Stdout, src); err != nil {
40 fmt.Fprintf(os.Stderr, "hexdump: %v", err)
41 os.Exit(1)
42 }
43}
44
45// dump the contents of r to w in a hexdump format.
46func hexdump(dst io.Writer, src io.Reader) error {
47 // performance: small reads and writes are very inefficient. while we could write a byte at a time, it's much faster to read and write in chunks.
48 r := bufio.NewReader(src)
49 defer w.Flush()
50 for { // 2. read a chunk of bytes (up to 16) from the input
51
52 var raw [16]byte // read 16 bytes at a time
53
54 encoded := make([]byte, 0, 16*3+1+1) // 16 bytes, 3 characters per byte, 1 space between bytes, newline at the end.
55 n, err := io.ReadFull(r, raw[:])
56
57 // 3. convert each byte in the chunk to a pair of hexadecimal digits
58 const hex = "0123456789abcdef"
59 if n != 0 {
60 for i := range min(n, 8) {
61 encoded = append(encoded, hex[raw[i]>>4], hex[raw[i]&0x0f], ' ')
62 }
63 encoded = append(encoded, ' ')
64 for i := 8; i < min(n, 16); i++ {
65 encoded = append(encoded, hex[raw[i]>>4], hex[raw[i]&0x0f], ' ')
66 }
67 encoded[len(encoded)-1] = '\n'
68
69 // 4. print the hexadecimal digits to standard output, space-separated, terminating with a newline
70 if _, err := w.Write(encoded); err != nil {
71 return err
72 }
73 }
74 if err == io.ErrUnexpectedEOF {
75 return nil
76 } else if err != nil {
77 return err
78 }
79 }
80}
Add an -offset
flag to prefix each line with the offset in the file in hexadecimal, terminating with the ending offset of the line (default false)
Add a -columns
flag to specify the number of columns to print per line (default 2)
Add a -column-width
flag to specify the number of bytes to print per column. (default 8)
Add a -squeeze
flag to compress multiple consecutive identical lines into a single line with a *
and a count (default false)
Add an -ascii
flag to suffix each line with the ASCII representation of the bytes (default false). Replace non-printable or non-ASCII bytes with a .
.
1#!/usr/bin/env bash
2# IN
3echo -n "now is the time " | shexdump -ascii
4# OUT
56e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 |now is the time |
Add a -canon
flag to print the output in canonical hex+ASCII format, like hexdump -C
(default false). This combines the -ascii and -offset flags and should be exclusive with them.
1#!/usr/bin/env bash
2// # IN
3echo "now is the time of all good men to come to the aid of their country " | shexdump -canon
1# OUT
200000000 6e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20 |now is the time |
300000010 6f 66 20 61 6c 6c 20 67 6f 6f 64 20 6d 65 6e 20 |of all good men |
400000020 74 6f 20 63 6f 6d 65 20 74 6f 20 74 68 65 20 61 |to come to the a|
500000030 69 64 20 6f 66 20 74 68 65 69 72 20 63 6f 75 6e |id of their coun|
600000040 74 72 79 20 0a |try .|
700000045
How do we test if we’ve written this correctly? We should be able to losslessly convert the output of hexdump
back into the original file. Let’s write a function to do that - and write a program, unhexdump
, to use it.
unhexdump.go
Our last program, hexdump
read binary and write whitespace-separated pairs of hexadecimal bytes.
This program, unhexdump
should read whitespace-separated pairs of hexadecimal bytes and write the original binary. In other words:
unhex
-ing itunhex
-ed byte to standard outputunhexdump.go
1package main
2
3import (
4 "bufio"
5 "fmt"
6 "io"
7 "os"
8)
9
10// unhexdump.go reverses the process of hexdump, converting a hexdump back into a file.
11// it expects pairs of whitespace-separated hexadecimal bytes.
12func main() {
13 // 1. choose the input source: stdin or a file
14 var src io.Reader
15 switch len(os.Args) {
16 case 1:
17 src = os.Stdin
18 case 2:
19 f, err := os.Open(os.Args[1])
20 if err != nil {
21 fmt.Fprintf(os.Stderr, "open %s: %v", os.Args[1], err)
22 os.Exit(1)
23 }
24 defer f.Close()
25 src = f
26 default:
27 fmt.Fprintf(os.Stderr, "Usage: %s [filename]", os.Args[0])
28 os.Exit(1)
29 }
30
31 if err := unhexdump(os.Stdout, src); err != nil {
32 fmt.Fprintf(os.Stderr, "unhexdump: %v", err)
33 os.Exit(1)
34 }
35}
36
37// unhexdump reads pairs of whitespace-separated hexadecimal bytes from r and writes the corresponding bytes to w.
38func unhexdump(w io.Writer, r io.Reader) error {
39 // 2. read pairs of whitespace-separated hexadecimal bytes from the input.
40
41 // we're going to start caring a little more about performance here. we'll use a buffered reader and writer to reduce the number of system calls & allocations (more about these topics in later articles).
42 scanner := bufio.NewScanner(r)
43 scanner.Split(bufio.ScanWords)
44 bw := bufio.NewWriter(w)
45 defer bw.Flush()
46 for i := 0; scanner.Scan(); i++ {
47 b := scanner.Bytes()
48 if len(b)&1 == 1 { // odd number of hex digits
49 return fmt.Errorf("odd number of hex digits at position %d (%q)", i)
50 }
51 // 3. convert each pair back to a byte, unhex-ing it
52 for i := 0; i < len(b); i += 2 {
53 high, ok := unhex(b[i])
54 if !ok {
55 return fmt.Errorf("bad hex %x '%c' at position %d", b[i], b[i], i)
56 }
57 low, ok := unhex(b[i+1])
58 if !ok {
59 return fmt.Errorf("bad hex %x '%c' at position %d", b[i+1], b[i+1], i+1)
60 }
61
62 // 4. write that unhex-ed byte to standard output
63 if err := bw.WriteByte(high<<4 | low); err != nil {
64 return err
65 }
66 }
67
68 }
69 return scanner.Err()
70}
71
72// unhex converts a hexadecimal character to it's value (0x0-0xf),
73// or 0, false if the character is not a valid hexadecimal digit.
74func unhex(b byte) (byte, bool) {
75 switch {
76 case '0' <= b && b <= '9':
77 return b - '0', true
78 case 'a' <= b && b <= 'f':
79 return b - 'a' + 10, true
80 case 'A' <= b && b <= 'F':
81 return b - 'A' + 10, true
82 default:
83 return 0, false
84 }
85}
86
Let’s test these programs by hexdumping a file, then unhexdumping it.
moby.txt
moby.hex
moby2.txt
diff
, a classic unix tool, to do this)1#!/usr/bin/env bash
2# IN
3echo "to the last I grapple with thee; from hell's heart I stab at thee; for hate's sake I spit my last breath at thee" > moby.txt # 1. write a file
4
5shexdump moby.txt > moby.hex # 2. hexdump it
6unhexdump moby.hex > moby2.txt # 3. unhexdump it
7diff -s moby.txt moby2.txt # 4. compare the files
1# OUT:
2Files moby.txt and moby2.txt are identical
Looks good. As a final exercise before we close this article out, let’s use torso
and hexdump
to investigate the first bytes of our ‘hello’ program and see what we can learn.
hello
Programs in Linux are stored in the ELF
(Executable and Linkable Format), which starts with a header that describes the rest of the file. This is a common way to organize file formats - a brief header explaining the rest. Let’s use ELF as an example header.
1#!/usr/bin/env bash
2# IN
3cat hello | torso -offset 0 -after 16 | shexdump
4# OUT
57f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
602 00 3e 00 01 00 00 00 c0 cf 46 00 00 00 00 00
7f
is meaningless to us - it’s not ASCII - but 45 4c 46
is E L F
in ASCII. Beyond that, it’s mostly opaque binary - but we can already figure out why hello
stopped printing with cat
given our new knowledge of C-strings. The 8th byte of hello
is a null, which terminates the string and got our shell to stop printing. Let’s look at what wikipedia says about the ELF header to learn more.
The object file type is identified with a special byte at offset 0x10:
1# IN
2cat hello | torso -offset 0x10 -before 0 -after 1 | shexdump
3# OUT
402
Value | Type | Meaning |
---|---|---|
0x00 | ET_NONE | Unknown. |
0x01 | ET_REL | Relocatable file. |
0x02 | ET_EXEC | Executable file. |
0x03 | ET_DYN | Shared object. |
0x04 | ET_CORE | Core file. |
Unsurprisingly, our executable is an executable. We can take a peek at our architecture with the machine type at offset 0x12.
1# IN
2cat hello | torso -offset 0x12 -before 0 -after 1 | shexdump
3# OUT
43e
This is the x86-64 architecture. The endianess of the file is stored at offset 0x05.
That’s great, but where does our program actually start? The entrypoint is the place in memory where the program starts executing - the initial value of the instruction pointer. This is stored in the ELF header, little-endian, at offset 0x18.
1# IN
2cat hello | torso -offset 0x18 -before 0 -after 8 | shexdump
3# OUT
4c0 cf 46 00 00 00 00 00
Looks like we’ll start executing instructions at memory address 0x46cfc0
. More about that later.
That’s all the digging we’ll do for now - this article’s already over twenty pages long! If you made it through - and especially if you did the exercises - congratulations and thank you for sticking with me.
A couple of final notes before we close out:
Lemma: Instruction Pointer
At the most fundamental level, a program works like this:
- read an instruction at the memory address pointed to by the instruction pointer
- do what the instruction says
- increment the instruction pointer by the size of the instruction
The entrypoint is where the instruction pointer starts.
Lemma: Endianness
When we want to convert a list of bytes to an integer, there’s two ways to do it: either the most significant byte comes first (big-endian), or the least significant byte comes first (little-endian). The x86-64 architecture is little-endian, so the bytes
c0 cf 46 00 00 00 00 00
are ‘reversed’ when converted to an integer; they’re the number0x0046cfc0
(4640704) in little-endian, but0xc0cf46000000000
(13893400341275213824) in big-endian. Since we know that this represents a memory address, we can figure out that this is a little-endian machine just by looking at the bytes - no machine ever made has 12635974 terrabytes of memory, so you’re not going to see that in a memory address. Knowing a bit about how computers actually work can help you understand what you’re looking at.Exercise: Write a program to read a 64-bit little-endian integer from a file starting at a given-offset and print it “big-endian” (most significant byte first).
―――――
Hopefully I’ve given you a ‘taste’ of systems programming. A summary of the “mindset” or “spirit” behind this article would be:
That’s all for now! In the next article, Starting Systems 2: Your Program and the Outside World, we’ll look at how programs actually do things - how they interact with the outside world via files and system calls, how they manage memory, environment variables and command-line arguments, standard input and output, and more.
Efron Licht (that’s me!) writes programs and articles about writing programs. My most recent paid title was Staff Software Engineer at Runpod. In my spare time, I play pickleball, cook, and read a bunch of books, especially about 19th-century american history.
Check out my linkedin or send me an email at efron.dev@gmail.com
. I’m available for consulting, short-and-long-term projects, and full-time work. I have a solid history of saving my employers hundreds of thousands of dollars - or more - in yearly server costs.
The term “endianess” is a reference to Gulliver’s Travels. It comes from “ON HOLY WARS AND A PLEA FOR PEACE” by Danny Cohen on April 1, 1980: the internet archive has a copy of the original post. I reproduce the relevant section of gulliver’s travels here:
It began upon the following occasion.
It is allowed on all hands, that the primitive way of breaking eggs before we eat them, was upon the larger end: but his present Majesty’s grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the Emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs.
The people so highly resented this law, that our Histories tell us there have been six rebellions raised on that account, wherein one Emperor lost his life, and another his crown. These civil commotions were constantly formented by the monarchs of Blefuscu, and when they were quelled, the exiles always fled for refuge to that Empire.It is computed, that eleven thousand persons have, at several times, suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volums have been published upon this controversy: but the books of the Big-Endians have been long forbidden, and the whole party rendered incapable by law of holding employments.
During the course of these troubles, the emperors of Blefuscu did frequently expostulate by their ambassadors, accusing us of making a schism in religion, by offending against a fundamental doctrine of our great prophet Lustrog, in the fifty-fourth chapter of the Brundecral (which is their Alcoran). This, however, is thought to be a mere strain upon the text: for their words are these; That all true believers shall break their eggs at the convenient end: and which is the convenient end, seems, in my humble opinion, to be left to every man’s conscience, or at least in the power of the chief magistrate to determine.