Cache and VM Lab
What this is
- This is a companion note for Chapter 2, not a source-faithful section of the book notes.
- The goal is to connect the cache, TLB, and virtual-memory ideas to commands you can run on a real machine.
Host commands
macOS
sysctl hw.pagesize hw.memsize hw.l1icachesize hw.l1dcachesize hw.l2cachesize hw.l3cachesize
vm_stat
memory_pressure
sysctl vm.swapusagehw.pagesizegives the page size.hw.l1*,hw.l2*, andhw.l3*give cache-capacity facts reported by the OS.vm_statshows free, active, inactive, speculative, and compressed-memory state.memory_pressuregives a quick read on how hard the VM system is working.vm.swapusageshows whether the machine is actively leaning on swap.
Linux
lscpu
getconf PAGE_SIZE
cat /sys/devices/system/cpu/cpu0/cache/index*/size
cat /sys/devices/system/cpu/cpu0/cache/index*/type
perf stat -d ./your_programlscpugives cache and topology summaries./sys/devices/system/cpu/.../cacheexposes cache metadata directly.perf stat -dis the easiest way to get real counter-based cache behavior on Linux.
Page-fault experiment
/usr/bin/time -l python3 - <<'PY'
page = 4096
size = 256 * 1024 * 1024
buf = bytearray(size)
for i in range(0, size, page):
buf[i] = 1
for i in range(0, size, page):
buf[i] += 1
print("touched", size // page, "pages")
PY- On macOS,
/usr/bin/time -lreports page reclaims and page faults. - The first touch forces the OS to back the pages.
- The second pass should be much cheaper because the pages are already resident.
- Run
vm_stat 1in another terminal while you do this if you want to watch the VM counters move.
TLB-style page walk
python3 - <<'PY'
import random
import time
page = 4096
pages = 32768
buf = bytearray(page * pages)
order = list(range(pages))
random.Random(0).shuffle(order)
for p in order:
buf[p * page] = 1
start = time.perf_counter()
for _ in range(64):
for p in order:
buf[p * page] += 1
end = time.perf_counter()
accesses = 64 * pages
print("pages", pages)
print("ns_per_access", (end - start) * 1e9 / accesses)
PY- This touches one byte per page in randomized page order.
- Increase or decrease
pagesand watch how the time per access changes. - The point is to make address-translation pressure visible by stepping page by page instead of exploiting normal cache-line locality.
Cache-line and stride experiment
python3 - <<'PY'
import time
size = 64 * 1024 * 1024
buf = bytearray(size)
for stride in (64, 256, 4096, 16384):
start = time.perf_counter()
steps = 0
for _ in range(32):
for i in range(0, size, stride):
buf[i] += 1
steps += 1
end = time.perf_counter()
print(f"stride={stride:5d} bytes ns_per_access={(end-start)*1e9/steps:8.3f}")
PY64bytes is a good approximation of one cache line on many machines.4096bytes is one page on many machines.- As the stride grows, spatial locality gets worse and each fetched line does less useful work.
Stronger cache experiment with a temporary binary
cat >/tmp/cache_stride.c <<'EOF'
#define _POSIX_C_SOURCE 200809L
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
static double now_ns(void){struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec;}
int main(void){
const size_t strides[] = {64,256,4096};
const size_t min_size = 4*1024, max_size = 64*1024*1024, target = 64*1024*1024;
unsigned char *buf = NULL; volatile uint64_t sink = 0;
if (posix_memalign((void **)&buf, 64, max_size) != 0) return 1;
for (size_t i = 0; i < max_size; ++i) buf[i] = (unsigned char)(i & 0xffu);
puts("size_kb,stride_b,ns_per_access");
for (size_t size = min_size; size <= max_size; size *= 2) {
for (size_t j = 0; j < sizeof(strides)/sizeof(strides[0]); ++j) {
size_t stride = strides[j], per = size / stride, reps = target / (per ? per : 1);
if (stride > size || per == 0) continue;
if (reps < 8) reps = 8;
memset(buf, (int)(size & 0xffu), size);
double start = now_ns();
for (size_t r = 0; r < reps; ++r) for (size_t i = 0; i < size; i += stride) sink += buf[i];
double end = now_ns();
printf("%zu,%zu,%.3f\n", size/1024, stride, (end-start)/(double)(reps*per));
}
}
if (sink == 0xdeadbeefULL) puts("impossible");
free(buf);
return 0;
}
EOF
cc -O2 -std=c11 -Wall -Wextra -pedantic /tmp/cache_stride.c -o /tmp/cache_stride
/tmp/cache_stride | column -t -s,
rm -f /tmp/cache_stride /tmp/cache_stride.c- This gives a much cleaner view of working-set and stride effects than pure Python loops.
- Watch for jumps as the footprint stops fitting in
L1, thenL2, thenL3.
Suggested workflow
- Run the host commands first so you know your page size and the cache sizes the OS reports.
- Run the page-fault experiment and compare first touch with second touch.
- Run the TLB-style page walk and vary the number of pages.
- Run the stride experiment and compare
64-byte versus4096-byte stepping. - If you want cleaner cache curves, run the temporary compiled cache experiment.
- If you later use Linux, repeat the same experiments with
perf stat -daround them to get hardware counter context.
Notes
- These experiments are deliberately small and noisy.
- They are for intuition, not publication-grade measurement.
- CPU frequency changes, background work, and prefetchers will affect the numbers.
- The shape of the curve matters more than any single timing value.
Written by GPT-5.4, Codex, 2026-04-14.