Memory Management

Kernel memory allocations are different from a programmer’s point of view.

Physical Pages

The kernel treats physical pages as the fundamental unit of memory management. While processors address memory by bytes or words, the Memory Management Unit (MMU) maintains page tables with page-sized granularity.

Architecture-specific page sizes dictate logical memory partitioning: x86-64 Linux normally uses 4096-byte pages.

The struct page data structure represents every physical page on the system. Its purpose is to describe physical memory strictly, not the transient data contained within the page. <linux/mm_types.h>. This structure is defined in <linux/mm_types.h>.

The kernel uses this structure to keep track of all the pages in the system, because the kernel needs to know whether a page is free (that is, if the page is not allocated). If a page is not free, the kernel needs to know who owns the page. Possible owners include user-space processes, dynamically allocated kernel data, static kernel code, the page cache, and so on.

flags: Bit flags storing the status of the page (e.g., dirty, locked in memory). .The flag values are defined in <linux/page-flags.h>. Bit flags repre- sent the various values, so at least 32 different flags are simultaneously available.
_count: The usage count indicating the number of references to the page.
- An internal value of negative one dictates the page is free and available for allocation.
- Kernel code evaluates this via the page_count() function, which returns zero for free pages and a positive integer for in-use pages.
- A page may be used by
  - the page cache (in which case the mapping field points to the address_space object associated with this page),
  - as private data (pointed at by private), or
  - as a mapping in a process’s page table.
virtual: The page’s virtual address. Evaluates to NULL for unmapped high memory. (virtual memory of what?)

struct page { unsigned long flags; atomic_t _count; atomic_t _mapcount; unsigned long private; struct address_space *mapping; pgoff_t index; struct list_head lru; void *virtual; };

An instance of this structure is allocated for each physical page in the system. (whoa that’s a lot of memory no?) - perhaps a surprisingly large number in absolute terms, but only a small fraction of a percent relative to the system’s.

To accommodate physical hardware addressing constraints, physical pages are partitioned into distinct functional zones.

Memory Zones

Hardware constraints prevent the kernel from treating all pages identically. The kernel uses the zones to group pages of similar properties. In particular, Linux has to deal with two shortcomings of hardware with respect to memory addressing:

Certain hardware devices limit Direct Memory Access (DMA) to specific memory addresses,
and certain architectures possess physical memory exceeding their virtual addressing limits.

The kernel mitigates this by logically grouping pages of similar properties into zones.

ZONE_DMA: Pages capable of undergoing DMA operations.
ZONE_DMA32: Pages capable of undergoing DMA operations, restricted to 32-bit devices.
ZONE_NORMAL: Standard, regularly mapped pages.
ZONE_HIGHMEM: High memory pages not permanently mapped into the kernel’s virtual address space. These zones, and two other, less notable ones, are defined in <linux/mmzone.h>. The actual use and layout of the memory zones is architecture-dependent.

Zones are logical groupings for the kernel’s internal tracking; they possess no physical hardware relevance. Allocations cannot cross zone boundaries. The kernel prefers to satisfy normal allocations from ZONE_NORMAL to preserve ZONE_DMA for hardware that strictly requires it, though allocations can fall back across zones under memory pressure.

For example, a 64-bit architecture such as Intel’s x86-64 can fully map and handle 64-bits of memory.Thus, x86-64 has no ZONE_HIGHMEM and all physical memory is contained within ZONE_DMA and ZONE_NORMAL. Each zone is represented by struct zone, which is defined in <linux/mmzone.h>:

Each zone is managed via a struct zone.

The structure is big, but only three zones are in the system and, thus, only three of these structures

Low-Level Page Allocation

The kernel provides a low-level mechanism for allocating and freeing memory with strict page-sized granularity. Contiguous physical pages are requested using a power-of-two order.

re declared in <linux/gfp.h>

Page Allocation Functions:
- alloc_pages(gfp_mask, order): Allocates $2^{order}$ contiguous physical pages and returns a pointer to the first page’s struct page.
- You can convert a given page to its logical address with the function `void * page_address(struct page *page) This returns a pointer to the logical address where the given physical page currently resides.
- __get_free_pages(gfp_mask, order): Allocates $2^{order}$ pages and directly returns the logical address of the first page.
- alloc_page(gfp_mask) / __get_free_page(gfp_mask): Wrapper macros for allocating a single page (where order is zero).
- get_zeroed_page(gfp_mask): Allocates a single page and fills it entirely with zeros. This is critical for security when returning memory to user-space to prevent leaking sensitive data.
Page Deallocation Functions:
- __free_pages(struct page *page, unsigned int order)
- free_pages(unsigned long addr, unsigned int order)
- free_page(unsigned long addr)

You must be careful to free only pages you allocate. Passing the wrong struct page or address, or the incorrect order, can result in corruption.

Memory allocation can fail, returning NULL. Code must execute explicit error checking to handle allocation failures and unwind previous operations.

While page-level allocation serves large contiguous requests, general kernel operations mandate granular byte-sized allocations.

Byte-Sized Allocation: kmalloc() and kfree()

The kmalloc() function’s operation is similar to that of user-space’s familiar malloc() routine, with the exception of the additional flags parameter.

The kmalloc() function acts as the primary interface for obtaining byte-sized chunks of kernel memory. Memory returned by kmalloc() is guaranteed to be physically contiguous.

The function is declared in <linux/slab.h>:

void * kmalloc(size_t size, gfp_t flags): Returns a pointer to a contiguous memory region of at least size bytes.
void kfree(const void *ptr): Frees a block of memory exclusively allocated by kmalloc(). Calling kfree(NULL) is explicitly checked for and safe.

Both page-level allocators and byte-level allocators rely on a standardized set of behavioral flags to dictate strict memory retrieval policies.

Allocation Flags (gfp_mask)

Allocation flags, represented by the gfp_t type (defined in <linux/types.h>, dictate the specific behavior of the memory allocator during retrieval. Flags are divided into three distinct categories: action modifiers, zone modifiers, and type flags.

Action Modifiers: Specify how the kernel allocates memory.
- __GFP_WAIT: The allocator is permitted to sleep.
- __GFP_HIGH: The allocator can access emergency memory pools.
- __GFP_IO / __GFP_FS: The allocator can initiate disk or filesystem I/O operations.
- and so on.
Zone Modifiers: Specify where the kernel allocates memory from. There are only three zone modifiers because there are only three zones other than ZONE_NORMAL (which is where, by default, allocations originate)
- __GFP_DMA / __GFP_DMA32: Restricts allocation strictly to DMA zones.
- __GFP_HIGHMEM: Satisfies the request from ZONE_HIGHMEM or ZONE_NORMAL. Cannot be utilized with kmalloc() or __get_free_pages() because these functions require a logical address, which unmapped high memory lacks.
Type Flags: Combine action and zone modifiers to simplify flag specification for common contexts.
- GFP_KERNEL: A standard process-context allocation that can sleep, block, and initiate I/O. This is the default choice for the vast majority of kernel allocations.
- GFP_ATOMIC: A high-priority allocation that is strictly prohibited from sleeping. Mandatory for interrupt handlers, softirqs, tasklets, and code holding spinlocks.
- GFP_NOIO / GFP_NOFS: Allocations that can block but refrain from initiating disk or filesystem I/O, preventing deadlocks in low-level block/filesystem code.
- GFP_DMA: Specifies an allocation strictly from ZONE_DMA. Generally combined with GFP_ATOMIC or GFP_KERNEL.

While the aforementioned flags secure physically contiguous memory, certain isolated software allocations strictly require only virtual contiguity.

Virtually Contiguous Memory: vmalloc()

The vmalloc() function allocates memory that is contiguous within the virtual address space, but not necessarily contiguous in physical RAM.

The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous)

void * vmalloc(unsigned long size): Allocates virtually contiguous memory. The function can sleep and is explicitly prohibited from use within interrupt context.
void vfree(const void *addr): Frees memory allocated via vmalloc().

Despite the fact that physically contiguous memory is required in only certain cases, most kernel code uses kmalloc() and not vmalloc() to obtain memory. Primarily, this is for performance

To construct a contiguous virtual layout from noncontiguous physical pages, vmalloc() must specifically modify individual page table entries. This mapping process incurs performance penalties and causes significant Translation Lookaside Buffer (TLB) thrashing. Hardware devices mandate physically contiguous memory because they operate below the MMU and do not process virtual addresses. Consequently, vmalloc() is reserved strictly for large, software-only regions, such as dynamically inserted modules.

The vmalloc() function is declared in <linux/vmalloc.h> and defined in mm/vmalloc.c. Usage is identical to user-space’s malloc()

General memory allocation interfaces incur severe overhead during frequent creation and destruction cycles, necessitating a dedicated object caching mechanism.

The Slab Allocator

Allocating and freeing data structures is one of the most common operations inside any kernel.To facilitate frequent allocations and deallocations of data, programmers often introduce free lists.A free list contains a block of available, already allocated, data structures. When code requires a new instance of a data structure, it can grab one of the structures off the free list rather than allocate the sufficient amount of memory and set it up for the data structure. Later, when the data structure is no longer needed, it is returned to the free list instead of deallocated. In this sense, the free list acts as an object cache, caching a fre- quently used type of object. One of the main problems with free lists in the kernel is that there exists no global control.When available memory is low, there is no way for the kernel to communicate to every free list that it should shrink the sizes of its cache to free up memory.The ker- nel has no understanding of the random free lists at all.To remedy this, and to consoli- date code, the Linux kernel provides the slab layer (also called the slab allocator).The slab layer acts as a generic data structure-caching layer.

The slab allocator acts as a generic data structure-caching layer to minimize allocation overhead and prevent memory fragmentation.

Design Principles:
- Caches frequently allocated and freed data structures.
- Maintains free lists of contiguous memory, eliminating fragmentation.
- Implements per-processor caches to enable lockless allocations.
- Applies object coloring to prevent multiple objects from mapping to identical cache lines (false sharing).
Structural Hierarchy:

Caches (kmem_cache): Exists one cache per specific object type (e.g., task_struct, inode). the kmalloc() interface is built on top of the slab layer, using a family of general purpose caches.

Slabs (struct slab): The caches are then divided into slabs (hence the name of this subsystem).The slabs are composed of one or more physically contiguous pages. Typically, slabs are composed of only a single page. Each cache may consist of multiple slabs.

each slab contains some number of objects, which are the dat structures being cached
each slab is in one or three stages: full, partial or empty
When some part of the kernel requests a new object, the request is satisfied from a partial slab, if one exists. Otherwise, the request is satisfied from an empty slab (if there exists no empty slab, one is created.)

Each cache is represented by a kmem_cache structure. This structure contains three lists

slabs_full
slabs_partial
slabs_empty stored inside a kmen_list structure, which is defined in mm/slab.c. These lists contains all the slabs associated with the cache. A slab descripor, struct slab represents each slab

struct slab {
	struct list_head list;         /* full, partial, or empty list */
	unsigned long colouroff;       /* offset for the slab coloring */
	void *s_mem;                   /* first object in the slab */
	unsigned int inuse;            /* allocated objects in the slab */
	kmem_bufctl_t free;            /* first free object, if any */
};

Slab descriptors are allocated either outside the slab in a general cache or inside the slab itself, at the beginning.

The descriptor is stored inside the slab if the total size of the slab is sufficiently small, or if internal slack space is sufficient to hold the descriptor.

The slab allocator creates new slabs by interfacing with the low-level kernel page allo- cator via __get_free_pages():

The slab layer is managed on a per-cache basis through a simple interface, which is exported to the entire kernel.The interface enables the creation and destruction of new caches and the allocation and freeing of objects within the caches.The sophisticated man- agement of caches and the slabs within is entirely handled by the internals of the slab layer.After you create a cache, the slab layer works just like a specialized allocator for the specific type of object

Objects: The specific cached data structures housed within the slabs.

Allocation Mechanics:
- The kernel satisfies allocation requests from a partial slab.
- If no partial slab exists, an empty slab is utilized.
- If no empty slabs exist, the slab layer interfaces with the low-level page allocator via __get_free_pages() to generate a new slab.

Slab Layer Interface

Slab Layer Interfaces:
- kmem_cache_create(): Creates a new cache. Flags include SLAB_HWCACHE_ALIGN to enforce cache-line alignment and SLAB_PANIC to issue a system panic upon allocation failure.
- kmem_cache_destroy(): Destroys a cache. Mandates that all slabs within the cache are definitively empty and synchronized against concurrent access.
- kmem_cache_alloc(): Retrieves an object from the specified cache.
- kmem_cache_free(): Marks a specific object as free and returns it to its originating slab.
  
  While the slab allocator efficiently caches dyna`mic object payloads, localized static execution requires strict adherence to rigidly limited kernel stack space.

Statically Allocating on the Stack

The kernel stack is exceptionally small and strictly fixed in size. Unlike user-space stacks, it cannot dynamically grow.

Stack Sizing: Kernel stacks are defined at compile-time to be either one or two pages. Total stack limits scale between 4KB and 16KB depending on architecture page sizes.
Single-Page vs. Interrupt Stacks: Historically, interrupt handlers shared the stack of the interrupted process. Enabling single-page kernel stacks removes this burden by establishing dedicated, single-page per-processor interrupt stacks.
Usage Constraints:
- Local automatic variables must be kept to an absolute minimum.
- Large static allocations, such as substantial arrays or complete structures, are strictly prohibited on the stack.
- Stack overflows manifest silently, permanently corrupting adjacent memory—typically terminating the trailing thread_info structure and forcing a system crash.

Data sizes exceeding rigid stack limits mandate external dynamic allocation, sometimes pushing into memory regions that cannot be permanently mapped.

High Memory Mappings

High memory pages (e.g., memory exceeding 896MB on x86-32 architectures) do not hold permanent mappings within the kernel’s logical address space.

Permanent Mappings:
- void *kmap(struct page *page): Maps a high memory page into the kernel’s logical address space. If the page resides in low memory, the existing virtual address is returned. The function can sleep and is valid only in process context.
- void kunmap(struct page *page): Unmaps the permanent mapping, alleviating pressure on the limited pool of permanent address space.
Temporary (Atomic) Mappings:
- void *kmap_atomic(struct page *page, enum km_type type): Atomically maps a high memory page into a reserved temporary mapping slot. Disables kernel preemption and strictly does not sleep, making it mandatory for interrupt context.
- void kunmap_atomic(void *kvaddr, enum km_type type): Unmaps the temporary atomic mapping, enabling kernel preemption.

Temporary atomic mappings inherently disable preemption, mimicking the concurrency protection mechanisms used when processors access data exclusively localized to themselves.

Per-CPU Allocations

Per-CPU data ensures variables are entirely unique to specific processors, storing items in an array indexed by processor number.

Compile-Time Interface:
- DEFINE_PER_CPU(type, name) creates a per-CPU variable instance for every processor.
- get_cpu_var(name) returns an lvalue for the current processor’s data and automatically disables kernel preemption.
- put_cpu_var(name) enables kernel preemption once data manipulation concludes.
Runtime Interface:
- alloc_percpu(type) / __alloc_percpu(size, align) dynamically allocates a memory instance for every processor.
- free_percpu() frees the dynamically allocated per-CPU data.
Benefits:
- Eliminates explicit locking requirements, provided the data is exclusively accessed by the local processor.
- Prevents cache invalidation (cache thrashing). The percpu interface natively cache-aligns all data, ensuring concurrent CPU updates do not collide on the same cache line.

The availability of highly specialized memory retrieval mechanisms requires a deliberate selection strategy mapped precisely to execution context and architecture constraints.

Picking an Allocation Method

Selecting the correct allocation method mandates analyzing data lifecycle, sleep viability, and contiguity requirements.

Contiguous Physical Pages: Use kmalloc() for byte-sized requirements or alloc_pages() for page-sized limits.
Process Context (Can Sleep): Execute allocations using the GFP_KERNEL flag.
Interrupt Context (Cannot Sleep): Execute allocations using the GFP_ATOMIC flag.
High Memory Mapping: Procure struct page boundaries using alloc_pages() with __GFP_HIGHMEM, mapping to a logical address via kmap() or kmap_atomic().
Virtually Contiguous Only: Rely on vmalloc() for substantial regions of memory restricted to software-layer execution.
Frequent Creation/Destruction: Instantiate a specialized kmem_cache cache via the slab allocator.
Thread-Localized Counters/State: Leverage Per-CPU allocations to circumvent standard lock contention and maximize cache locality.

My Knowledge Base

Explorer

08 Memory Management