He explains the Virtual memory specially for Linux and x86-64 arch. Mainly as a dialogue between newly allocated process and the Kernel.

On the common 48-bit x86-64 virtual-address mode, the canonical virtual address range spans 256TB. Linux typically splits this into a lower user-space and an upper kernel-space half. The lower 128 TiB is available to user processes, while the upper half is reserved for kernel mappings used when execution enters kernel mode. Physical address capacity is separate from virtual address capacity and depends on the CPU and platform.

Naming convention of Linux vs x86:

Those 48 bits are splits into five parts:

  • four groups of 9 bits,
  • each followed by a 12-bit offset

Note:

The top 16 bits must be a sign-extension of bit 47: all zeros for low-half user-space addresses, all ones for high-half kernel space addresses. Such addresses are called canonical addresses. A non-canonical address faults before the normal page-table walk even completes. This is what creates the large unused gap between the low and high halves of the 64-bit virtual address space.

Who Looks At Page Tables?

Answer: Memory management unit

The CPU has a register called CR3 that holds the physical address of your current PGD. The kernel updates it on every context switch so the MMU knows which process’s tables to use.

Wait:

Every memory lookup requires 4 level walk? Welcome to TLB.

Kernel says: Programs that reuse the same memory regions repeatedly, such as tight loops, frequently executed functions, reused buffers, tend to stay within a small working set of pages, keeping the TLB warm and page walks rare. but that is not a given. Access patterns matter a great deal.

A process’s working set is the subset of its virtual pages that are actively needed during a given window of execution. The set matters for two hardware structures:

  • TLB: if the working set fits within the TLB’s capacity (typically a few hundred to a few thousand entries), translations stay cached and page walks are rare. If the working set exceeds TLB capacity, there are larger number of TLB misses which may cost performance.
  • Physical RAM: if the working set fits in RAM, pages stay resident. If it doesnt, the kernel must evict pages to swap and reload them on demand, which is a far more expensive operation

Bits in the PTE entry bro

No mention of software walk?

Demand Paging

When a process asks for a memory - the kernel allocates the virtual address range but not the physical backing (frames). It does so on demand. Why? For efficiency.

As the process tries to access the memory - the MMU raises a trap (a page fault) and control transfers to the kernel

  • Aside the page table, kernel maintains a “note” called virtual memory area or VMA
  • First check whether the faulting address falls inside a valid VMA. If yes, the access is legitimate .

Also,

the stack grows downwards which is demand driven too. The kernel marks the stack VMA as growable, but ti does not map every possible stack page upfront. When the stack pointer moves into the next valid page below the current stack, the access faults. Because the faulting address is just below the current stack bottom and the stack VMA is marked as growable, the kernel extends the VMA downward by one page, allocates a frame, and resumes execution.

Two prevntions:

  • the kernel enforces a maximum stack size, 8MB
  • the maximum stack limit sits a guard page: a single page that is deliberately left unmapped, no VMA covers it. if the stack pointer jumps far enough to land in or past the guard page, the fault finds no covering VMA

Demand paging creates an interesting situation: if the kernel only allocates physical frames at first-access time, then malloc(10GB) on a machine with 4 GB of RAM will succeed (at least initially). The kernel records the promise in a VMA and returns immediately. No frames are allocated. This is called overcommitting memory: the total size of all VMAs across all running processes can far exceed the amount of physical RAM plus swap.

The kernel’s bet is statistical. In practice, most allocated memory is never fully touched. A process might allocate a large buffer “just in case” and only ever write to a fraction of it. A JVM might reserve a large heap up front but populate it lazily. Across hundreds of processes, the working sets sum to much less than the total committed virtual memory, and the system runs fine.

The bet occasionally goes wrong. When too many processes start faulting in pages simultaneously, memory pressure spikes, and the kernel runs out of physical frames. At this point it invokes the OOM killer (Out-Of-Memory killer): a kernel subsystem that scores each process by its memory consumption, age, and other heuristics, then kills the highest-scoring one to reclaim its frames.

Swap

The kernel looks for a page that hasn’t been accessed recently. It could be from another process, or even one of your own pages. Once I find the page to evict, I write its contents to disk to a reserved area called swap space. Then I reclaim the frame and give it to you.

Before I give that frame to a process:

  • i update the process’s page table
  • locate the PTE that points to that frame
  • clear its present bit to 0
  • and store the swap location in the remaining bits of the entry
  • hardware never looks at those bits when present is 0 but I do when handling the page fault
  • as the page is swapped, its VMA must exist
  • checks the PTE next
  • finds swap coordinates in the non-present bits,
  • uses those to read the data from the disk
  • loads it into a fresh frame
  • after that, it reinstalls the PTE with present=1

Is it the case that if PTE is 0 the data is in swap. No sir, it can be file too.

You use the mmap system call. It lets you map a file into your address space. when you do that, I create VMAs for the mapped range, but I leave the PTEs absent, just like with malloc.

For file-backed mappings, there is no swap entry. Instead, the VMA itself tells me which file and which block of that file to read. I load that block into a frame, install it in the page table, and resume the process.j

Three kind of page faults so far:

  • where you access unallocated (but malloced) memory
  • where you do swap
  • where you fetch blocks from a file

Copy on Write

A process wants to share the workload:

fork():

  • but after you fork the first thing that happens is page fault why?
  • kernel marks it read only as optimization
  • the child process gets an independent copy of the memory
  • simple approach to copy page immediately doesnt work since there is gigabytes of heap and all that
  • so gives child new page tables that initially point at the same physical frames
  • which means child and parent share the same frames
  • when either of them need to write to one of those shared pages, page fault occurs and gives writing process a private copy of that frame
  • if parent exists then kernel takes care of tracking the reference and mapping state for each physical frame, when parent exists, it removes mappings
  • the next time child writes to a page, if the page is no longer shared, it can skip the copy and simply restore the write permission on the PTE

  • unix pattern is to call form immediately followed by exec to load and execute a new program
  • exec discards the child’s entire address space and builds a fresh one for the new program

Memory mapped files

Better way to read large files than looping and filling a buffer?

  • mmap and access via a pointer
  • kernel creates a new VMA in the process address space (a memory mapped region)
  • then user can read or write to addresses in that region just like regular memory
  • First thing that happens is:

A major page fault

  • demand paging works for files too
  • kernel puts data after reading from disk into a page cache
  • this is the pool of frames that are used to cache the file data
  • after getting into page cache, the PTE is updated

Note:

A misconception is that the page cache is a reserved pool of memory, its not. Its simply the set of physical frames that the kernel is currently using to hold file data. When an application needs more memory and there are no free frames, the kernel can reclaim clean page-cache frames instantly, because the file on disk is already backing copy.

This is why a system that looks nearly full of “used” memory can still allocate freely: much of that “used” memory is reclaimable cache, not locked-in application data.

Now compare that to what happens when you use read() instead. I still bring the file data into the page cache, usually by DMA from the storage device into memory. But then read() copies the data from the page cache frame into your user-space buffer. That page-cache-to-user-buffer copy is the extra step that mmap() avoids.

Dont you use mmap directly?

Not always since mmap removes one cost, but it introduces others. It trades explicit IO and copying for page faults, page tables, TLB pressure, and different failures modes. Whether that trade is good depends on the access pattern.

Its not faster automatically:

  • the first access to a cold mapped page is still a page fault
  • The faults enters the kernel, locates the VMA, finds or reads the page cache page, installs a PTE, and resumes the faulting instruction
  • if you scan a huge file once, you may take one fault per 4KB page, and those faults can dominate the page-cache-to-user-buffer copy you avoided

(I LOST THE TRACK HERE LOL. GOT TOO CONFUSION FROM THIS POINT. I WILL READ AFTER COMING BACK BY CLEARING LOVE)

https://blog.codingconfessions.com/p/virtual-memory