Memory Management

Physical Pages

The kernel treats physical pages as the fundamental unit of memory management. While processors address memory by bytes or words, the Memory Management Unit (MMU) maintains page tables with page-sized granularity. Architecture-specific page sizes dictate logical memory partitioning; 32-bit architectures typically utilize 4KB pages, while 64-bit architectures commonly use 8KB pages.

The struct page data structure represents every physical page on the system. Its purpose is to describe physical memory strictly, not the transient data contained within the page.

flags: Bit flags storing the status of the page (e.g., dirty, locked in memory).
_count: The usage count indicating the number of references to the page.
- An internal value of negative one dictates the page is free and available for allocation.
- Kernel code evaluates this via the page_count() function, which returns zero for free pages and a positive integer for in-use pages.
virtual: The page’s virtual address. Evaluates to NULL for unmapped high memory.
mapping & index: Used when the page belongs to the page cache or is utilized as private data.

To accommodate physical hardware addressing constraints, physical pages are partitioned into distinct functional zones.

Memory Zones

Hardware constraints prevent the kernel from treating all pages identically. Certain hardware devices limit Direct Memory Access (DMA) to specific memory addresses, and certain architectures possess physical memory exceeding their virtual addressing limits. The kernel mitigates this by logically grouping pages of similar properties into zones.

ZONE_DMA: Pages capable of undergoing DMA operations.
ZONE_DMA32: Pages capable of undergoing DMA operations, restricted to 32-bit devices.
ZONE_NORMAL: Standard, regularly mapped pages.
ZONE_HIGHMEM: High memory pages not permanently mapped into the kernel’s virtual address space.

Zones are logical groupings for the kernel’s internal tracking; they possess no physical hardware relevance. Allocations cannot cross zone boundaries. The kernel prefers to satisfy normal allocations from ZONE_NORMAL to preserve ZONE_DMA for hardware that strictly requires it, though allocations can fall back across zones under memory pressure.

Each zone is managed via a struct zone.

lock: A spinlock protecting the zone structure from concurrent access. It does not protect the individual pages residing in the zone.
watermark: An array retaining the minimum, low, and high watermarks. The kernel utilizes these benchmarks to dynamically adjust its aggressiveness based on available free memory.

With physical pages grouped into structural zones, the kernel utilizes specialized low-level interfaces to request and map these pages.

Low-Level Page Allocation

The kernel provides a low-level mechanism for allocating and freeing memory with strict page-sized granularity. Contiguous physical pages are requested using a power-of-two order.

Page Allocation Functions:
- alloc_pages(gfp_mask, order): Allocates $2^{order}$ contiguous physical pages and returns a pointer to the first page’s struct page.
- __get_free_pages(gfp_mask, order): Allocates $2^{order}$ pages and directly returns the logical address of the first page.
- alloc_page(gfp_mask) / __get_free_page(gfp_mask): Wrapper macros for allocating a single page (where order is zero).
- get_zeroed_page(gfp_mask): Allocates a single page and fills it entirely with zeros. This is critical for security when returning memory to user-space to prevent leaking sensitive data.
Page Deallocation Functions:
- __free_pages(struct page *page, unsigned int order)
- free_pages(unsigned long addr, unsigned int order)
- free_page(unsigned long addr)

Memory allocation can fail, returning NULL. Code must execute explicit error checking to handle allocation failures and unwind previous operations.

While page-level allocation serves large contiguous requests, general kernel operations mandate granular byte-sized allocations.

Byte-Sized Allocation: kmalloc() and kfree()

The kmalloc() function acts as the primary interface for obtaining byte-sized chunks of kernel memory. Memory returned by kmalloc() is guaranteed to be physically contiguous.

void * kmalloc(size_t size, gfp_t flags): Returns a pointer to a contiguous memory region of at least size bytes.
void kfree(const void *ptr): Frees a block of memory exclusively allocated by kmalloc(). Calling kfree(NULL) is explicitly checked for and safe.

Both page-level allocators and byte-level allocators rely on a standardized set of behavioral flags to dictate strict memory retrieval policies.

Allocation Flags (gfp_mask)

Allocation flags, represented by the gfp_t type, dictate the specific behavior of the memory allocator during retrieval. Flags are divided into three distinct categories: action modifiers, zone modifiers, and type flags.

Action Modifiers: Specify how the kernel allocates memory.
- __GFP_WAIT: The allocator is permitted to sleep.
- __GFP_HIGH: The allocator can access emergency memory pools.
- __GFP_IO / __GFP_FS: The allocator can initiate disk or filesystem I/O operations.
Zone Modifiers: Specify where the kernel allocates memory from.
- __GFP_DMA / __GFP_DMA32: Restricts allocation strictly to DMA zones.
- __GFP_HIGHMEM: Satisfies the request from ZONE_HIGHMEM or ZONE_NORMAL. Cannot be utilized with kmalloc() or __get_free_pages() because these functions require a logical address, which unmapped high memory lacks.
Type Flags: Combine action and zone modifiers to simplify flag specification for common contexts.
- GFP_KERNEL: A standard process-context allocation that can sleep, block, and initiate I/O. This is the default choice for the vast majority of kernel allocations.
- GFP_ATOMIC: A high-priority allocation that is strictly prohibited from sleeping. Mandatory for interrupt handlers, softirqs, tasklets, and code holding spinlocks.
- GFP_NOIO / GFP_NOFS: Allocations that can block but refrain from initiating disk or filesystem I/O, preventing deadlocks in low-level block/filesystem code.
- GFP_DMA: Specifies an allocation strictly from ZONE_DMA. Generally combined with GFP_ATOMIC or GFP_KERNEL.

While the aforementioned flags secure physically contiguous memory, certain isolated software allocations strictly require only virtual contiguity.

Virtually Contiguous Memory: vmalloc()

The vmalloc() function allocates memory that is contiguous within the virtual address space, but not necessarily contiguous in physical RAM.

void * vmalloc(unsigned long size): Allocates virtually contiguous memory. The function can sleep and is explicitly prohibited from use within interrupt context.
void vfree(const void *addr): Frees memory allocated via vmalloc().

To construct a contiguous virtual layout from noncontiguous physical pages, vmalloc() must specifically modify individual page table entries. This mapping process incurs performance penalties and causes significant Translation Lookaside Buffer (TLB) thrashing. Hardware devices mandate physically contiguous memory because they operate below the MMU and do not process virtual addresses. Consequently, vmalloc() is reserved strictly for large, software-only regions, such as dynamically inserted modules.

General memory allocation interfaces incur severe overhead during frequent creation and destruction cycles, necessitating a dedicated object caching mechanism.

The Slab Allocator

The slab allocator acts as a generic data structure-caching layer to minimize allocation overhead and prevent memory fragmentation.

Design Principles:
- Caches frequently allocated and freed data structures.
- Maintains free lists of contiguous memory, eliminating fragmentation.
- Implements per-processor caches to enable lockless allocations.
- Applies object coloring to prevent multiple objects from mapping to identical cache lines (false sharing).
Structural Hierarchy:
- Caches (kmem_cache): Exists one cache per specific object type (e.g., task_struct, inode).
- Slabs (struct slab): Compose the caches. Consist of one or more physically contiguous pages. Exist in three states: full, partial, or empty.
- Objects: The specific cached data structures housed within the slabs.
Allocation Mechanics:
- The kernel satisfies allocation requests from a partial slab.
- If no partial slab exists, an empty slab is utilized.
- If no empty slabs exist, the slab layer interfaces with the low-level page allocator via __get_free_pages() to generate a new slab.
Slab Layer Interfaces:
- kmem_cache_create(): Creates a new cache. Flags include SLAB_HWCACHE_ALIGN to enforce cache-line alignment and SLAB_PANIC to issue a system panic upon allocation failure.
- kmem_cache_destroy(): Destroys a cache. Mandates that all slabs within the cache are definitively empty and synchronized against concurrent access.
- kmem_cache_alloc(): Retrieves an object from the specified cache.
- kmem_cache_free(): Marks a specific object as free and returns it to its originating slab.

While the slab allocator efficiently caches dynamic object payloads, localized static execution requires strict adherence to rigidly limited kernel stack space.

Statically Allocating on the Stack

The kernel stack is exceptionally small and strictly fixed in size. Unlike user-space stacks, it cannot dynamically grow.

Stack Sizing: Kernel stacks are defined at compile-time to be either one or two pages. Total stack limits scale between 4KB and 16KB depending on architecture page sizes.
Single-Page vs. Interrupt Stacks: Historically, interrupt handlers shared the stack of the interrupted process. Enabling single-page kernel stacks removes this burden by establishing dedicated, single-page per-processor interrupt stacks.
Usage Constraints:
- Local automatic variables must be kept to an absolute minimum.
- Large static allocations, such as substantial arrays or complete structures, are strictly prohibited on the stack.
- Stack overflows manifest silently, permanently corrupting adjacent memory—typically terminating the trailing thread_info structure and forcing a system crash.

Data sizes exceeding rigid stack limits mandate external dynamic allocation, sometimes pushing into memory regions that cannot be permanently mapped.

High Memory Mappings

High memory pages (e.g., memory exceeding 896MB on x86-32 architectures) do not hold permanent mappings within the kernel’s logical address space.

Permanent Mappings:
- void *kmap(struct page *page): Maps a high memory page into the kernel’s logical address space. If the page resides in low memory, the existing virtual address is returned. The function can sleep and is valid only in process context.
- void kunmap(struct page *page): Unmaps the permanent mapping, alleviating pressure on the limited pool of permanent address space.
Temporary (Atomic) Mappings:
- void *kmap_atomic(struct page *page, enum km_type type): Atomically maps a high memory page into a reserved temporary mapping slot. Disables kernel preemption and strictly does not sleep, making it mandatory for interrupt context.
- void kunmap_atomic(void *kvaddr, enum km_type type): Unmaps the temporary atomic mapping, enabling kernel preemption.

Temporary atomic mappings inherently disable preemption, mimicking the concurrency protection mechanisms used when processors access data exclusively localized to themselves.

Per-CPU Allocations

Per-CPU data ensures variables are entirely unique to specific processors, storing items in an array indexed by processor number.

Compile-Time Interface:
- DEFINE_PER_CPU(type, name) creates a per-CPU variable instance for every processor.
- get_cpu_var(name) returns an lvalue for the current processor’s data and automatically disables kernel preemption.
- put_cpu_var(name) enables kernel preemption once data manipulation concludes.
Runtime Interface:
- alloc_percpu(type) / __alloc_percpu(size, align) dynamically allocates a memory instance for every processor.
- free_percpu() frees the dynamically allocated per-CPU data.
Benefits:
- Eliminates explicit locking requirements, provided the data is exclusively accessed by the local processor.
- Prevents cache invalidation (cache thrashing). The percpu interface natively cache-aligns all data, ensuring concurrent CPU updates do not collide on the same cache line.

The availability of highly specialized memory retrieval mechanisms requires a deliberate selection strategy mapped precisely to execution context and architecture constraints.

Picking an Allocation Method

Selecting the correct allocation method mandates analyzing data lifecycle, sleep viability, and contiguity requirements.

Contiguous Physical Pages: Use kmalloc() for byte-sized requirements or alloc_pages() for page-sized limits.
Process Context (Can Sleep): Execute allocations using the GFP_KERNEL flag.
Interrupt Context (Cannot Sleep): Execute allocations using the GFP_ATOMIC flag.
High Memory Mapping: Procure struct page boundaries using alloc_pages() with __GFP_HIGHMEM, mapping to a logical address via kmap() or kmap_atomic().
Virtually Contiguous Only: Rely on vmalloc() for substantial regions of memory restricted to software-layer execution.
Frequent Creation/Destruction: Instantiate a specialized kmem_cache cache via the slab allocator.
Thread-Localized Counters/State: Leverage Per-CPU allocations to circumvent standard lock contention and maximize cache locality.

My Knowledge Base

Explorer

12 Memory Management