Cache and VM Lab

What this is

  • This is a companion note for Chapter 2, not a source-faithful section of the book notes.
  • The goal is to connect the cache, TLB, and virtual-memory ideas to commands you can run on a real machine.

Host commands

macOS

sysctl hw.pagesize hw.memsize hw.l1icachesize hw.l1dcachesize hw.l2cachesize hw.l3cachesize
vm_stat
memory_pressure
sysctl vm.swapusage
  • hw.pagesize gives the page size.
  • hw.l1*, hw.l2*, and hw.l3* give cache-capacity facts reported by the OS.
  • vm_stat shows free, active, inactive, speculative, and compressed-memory state.
  • memory_pressure gives a quick read on how hard the VM system is working.
  • vm.swapusage shows whether the machine is actively leaning on swap.

Linux

lscpu
getconf PAGE_SIZE
cat /sys/devices/system/cpu/cpu0/cache/index*/size
cat /sys/devices/system/cpu/cpu0/cache/index*/type
perf stat -d ./your_program
  • lscpu gives cache and topology summaries.
  • /sys/devices/system/cpu/.../cache exposes cache metadata directly.
  • perf stat -d is the easiest way to get real counter-based cache behavior on Linux.

Page-fault experiment

/usr/bin/time -l python3 - <<'PY'
page = 4096
size = 256 * 1024 * 1024
buf = bytearray(size)
for i in range(0, size, page):
    buf[i] = 1
for i in range(0, size, page):
    buf[i] += 1
print("touched", size // page, "pages")
PY
  • On macOS, /usr/bin/time -l reports page reclaims and page faults.
  • The first touch forces the OS to back the pages.
  • The second pass should be much cheaper because the pages are already resident.
  • Run vm_stat 1 in another terminal while you do this if you want to watch the VM counters move.

TLB-style page walk

python3 - <<'PY'
import random
import time
 
page = 4096
pages = 32768
buf = bytearray(page * pages)
order = list(range(pages))
random.Random(0).shuffle(order)
 
for p in order:
    buf[p * page] = 1
 
start = time.perf_counter()
for _ in range(64):
    for p in order:
        buf[p * page] += 1
end = time.perf_counter()
 
accesses = 64 * pages
print("pages", pages)
print("ns_per_access", (end - start) * 1e9 / accesses)
PY
  • This touches one byte per page in randomized page order.
  • Increase or decrease pages and watch how the time per access changes.
  • The point is to make address-translation pressure visible by stepping page by page instead of exploiting normal cache-line locality.

Cache-line and stride experiment

python3 - <<'PY'
import time
 
size = 64 * 1024 * 1024
buf = bytearray(size)
 
for stride in (64, 256, 4096, 16384):
    start = time.perf_counter()
    steps = 0
    for _ in range(32):
        for i in range(0, size, stride):
            buf[i] += 1
            steps += 1
    end = time.perf_counter()
    print(f"stride={stride:5d} bytes  ns_per_access={(end-start)*1e9/steps:8.3f}")
PY
  • 64 bytes is a good approximation of one cache line on many machines.
  • 4096 bytes is one page on many machines.
  • As the stride grows, spatial locality gets worse and each fetched line does less useful work.

Stronger cache experiment with a temporary binary

cat >/tmp/cache_stride.c <<'EOF'
#define _POSIX_C_SOURCE 200809L
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
static double now_ns(void){struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec;}
int main(void){
  const size_t strides[] = {64,256,4096};
  const size_t min_size = 4*1024, max_size = 64*1024*1024, target = 64*1024*1024;
  unsigned char *buf = NULL; volatile uint64_t sink = 0;
  if (posix_memalign((void **)&buf, 64, max_size) != 0) return 1;
  for (size_t i = 0; i < max_size; ++i) buf[i] = (unsigned char)(i & 0xffu);
  puts("size_kb,stride_b,ns_per_access");
  for (size_t size = min_size; size <= max_size; size *= 2) {
    for (size_t j = 0; j < sizeof(strides)/sizeof(strides[0]); ++j) {
      size_t stride = strides[j], per = size / stride, reps = target / (per ? per : 1);
      if (stride > size || per == 0) continue;
      if (reps < 8) reps = 8;
      memset(buf, (int)(size & 0xffu), size);
      double start = now_ns();
      for (size_t r = 0; r < reps; ++r) for (size_t i = 0; i < size; i += stride) sink += buf[i];
      double end = now_ns();
      printf("%zu,%zu,%.3f\n", size/1024, stride, (end-start)/(double)(reps*per));
    }
  }
  if (sink == 0xdeadbeefULL) puts("impossible");
  free(buf);
  return 0;
}
EOF
cc -O2 -std=c11 -Wall -Wextra -pedantic /tmp/cache_stride.c -o /tmp/cache_stride
/tmp/cache_stride | column -t -s,
rm -f /tmp/cache_stride /tmp/cache_stride.c
  • This gives a much cleaner view of working-set and stride effects than pure Python loops.
  • Watch for jumps as the footprint stops fitting in L1, then L2, then L3.

Suggested workflow

  1. Run the host commands first so you know your page size and the cache sizes the OS reports.
  2. Run the page-fault experiment and compare first touch with second touch.
  3. Run the TLB-style page walk and vary the number of pages.
  4. Run the stride experiment and compare 64-byte versus 4096-byte stepping.
  5. If you want cleaner cache curves, run the temporary compiled cache experiment.
  6. If you later use Linux, repeat the same experiments with perf stat -d around them to get hardware counter context.

Notes

  • These experiments are deliberately small and noisy.
  • They are for intuition, not publication-grade measurement.
  • CPU frequency changes, background work, and prefetchers will affect the numbers.
  • The shape of the curve matters more than any single timing value.

Written by GPT-5.4, Codex, 2026-04-14.