ARM Cortex-A53 Memory Hierarchy

Architecture Overview

Target Domain: Personal Mobile Devices (PMDs) such as tablets and smartphones, prioritizing high energy efficiency.
Core Delivery: Distributed as a configurable Intellectual Property (IP) core rather than a fixed hardware chip.
- Hard Cores: Optimized for specific semiconductor vendors, providing higher performance and smaller die area, but limited to external parameterization (e.g., L2 cache size).
- Soft Cores: Built using standard logic libraries, allowing extensive modification and retargeting across different semiconductor vendors.
Processor Base: Dual-issue, statically scheduled superscalar core supporting the ARMv8A ISA (32-bit and 64-bit modes) with clock rates up to 1.3 GHz.

The memory system relies on a two-level Translation Lookaside Buffer (TLB) and a two-level cache structure.

Instruction MicroTLB: 10 entries, fully associative, 2-clock-cycle miss penalty.
Data MicroTLB: 10 entries, fully associative, 2-clock-cycle miss penalty.
L2 Unified TLB: 512 entries, 4-way set associative, 20-clock-cycle miss penalty.
- Optimization: A dedicated page map cache tracks physical page locations for a set of virtual pages, directly reducing the L2 TLB miss penalty.

Capacity: Configurable from 8 KiB to 64 KiB.
Organization:
- Instruction Cache: 2-way set associative, 64-byte block size.
- Data Cache: 2-way or 4-way set associative, 64-byte block size.
Indexing & Tagging: Virtually indexed, physically tagged.
Miss Penalty: 13 clock cycles (latency to retrieve from L2).
Miss Handling:
- The critical word is returned first to immediately resume processor execution.
- Nonblocking architecture allows the processor to continue operating while the miss completes.
Write Policy (Data): Write-back policy, defaulting to allocate-on-write.
Replacement Policy: Least Recently Used (LRU) approximation.

Capacity: Configurable from 128 KiB to 2 MiB.
Organization: 16-way set associative, 64-byte block size.
Miss Penalty: 124 clock cycles (latency to retrieve from main memory).
Write Policy: Write-back policy, defaulting to allocate-on-write.
Replacement Policy: LRU approximation.
Main Memory Interface: Connects to main memory via a 64-bit to 128-bit wide bus, supporting up to 4 memory banks.

Virtual address spaces are partitioned to index the TLBs and caches sequentially.

Cache Indexing Calculation:
$Index Bits = lo g_{2} (\frac{Cache Capacity}{Block Size \times Associativity})$
Aliasing Challenge: When the cache index plus the block offset exceeds the page offset, a single physical page can map to multiple cache addresses.
- Example: With a 32 KiB 2-way set associative L1 cache and 64-byte blocks, the index size is 8 bits.
- Given a 4 KiB page size (12-bit page offset) and a 6-bit block offset, the index uses 14 bits of the address ( $8 index bits + 6 offset bits$ ).
- This forces 2 bits of the virtual page number to overlap with the cache index, creating potential aliases.
Resolution: Hardware dynamically detects and prevents aliases during cache miss processing to maintain data consistency.

System performance heavily depends on application memory footprints and the extreme disparity between L1 and L2 miss penalties.

L1 Instruction Cache: Consistently achieves near-zero miss rates (under 1%) across standard benchmarks like SPECInt2006.
Data Caches: Miss rates are highly application-dependent. Benchmarks with large memory footprints (e.g., mcf) generate significantly higher miss rates in both the L1 data cache and the L2 global cache.
Average Memory Access Penalty: Even when L1 miss rates are high, the overall average memory access penalty is dominated by L2 misses. The architectural penalty for accessing main memory (124 cycles) overwhelms the L1 miss penalty (13 cycles), emphasizing the necessity of high L2 hit rates for optimal PMD battery life and processor speed.