Fundamentals of Quantitative Design and Analysis

1. Technological and Architectural Evolution

Historical Growth: Uniprocessor performance improved at roughly 50% per year from 1986 to 2003, driven by scaling and architectural optimizations (pipelining, multiple issue, caches).
End of Dennard Scaling: Power density is no longer constant as transistors shrink. Voltage and current cannot drop further without compromising integrated circuit dependability.
Slowing of Moore’s Law: Transistor counts no longer double every 1.5 to 2 years, decelerating the growth of devices per chip.
Architectural Shift: General-purpose uniprocessor performance growth has slowed significantly. The industry shifted to
- multicore processors (Task-Level Parallelism), and
- Domain-Specific Architectures (DSAs) to improve energy-performance-cost under fixed power budgets.

Internet of Things (IoT) / Embedded: Focus on minimizing price and energy. Performance is dictated by application-specific real-time constraints rather than peak speed.
Personal Mobile Devices (PMDs): Driven by energy efficiency, responsiveness, and media performance. Packaging constraints and battery life strictly limit power consumption.
Desktop Computing: Optimized for price-performance. Characterized by balanced performance for compute and graphics.
Servers: Prioritize availability, scalability, and throughput. Cost targets focus on Total Cost of Ownership (TCO), integrating lifetime power and maintenance expenses.
Clusters / Warehouse-Scale Computers (WSCs): Massive collections of commodity servers acting as a single entity. Designed for extreme price-performance and energy proportionality. Redundancy is managed via software to mask component failures.

Application Parallelism:
- Data-Level Parallelism (DLP): Simultaneous operations applied to multiple data items.
- Task-Level Parallelism (TLP): Independent tasks created to execute simultaneously.
Flynn’s Taxonomy (Hardware Parallelism):
- SISD: Traditional uniprocessors.
- SIMD: Exploits DLP. Includes vector architectures and GPUs.
- MISD: No commercial implementations.
- MIMD: Exploits TLP. Includes multicores and clusters.

Architecture encompasses three distinct components to meet functional requirements within power, cost, and availability constraints.

Instruction Set Architecture (ISA): The programmer-visible interface.
Organization (Microarchitecture): High-level system design, memory interconnect, and CPU logic execution.
Hardware: Detailed logic design and packaging technology.

Bandwidth vs. Latency: Across microprocessors, memory, networks, and disks, bandwidth (throughput) scales substantially faster than latency (response time).
Transistor Scaling: Transistor density scales quadratically with a linear reduction in feature size. Transistor performance scales linearly.
Wire Scaling: Wire signal delay scales poorly. Signal propagation delay consumes increasing fractions of the clock cycle.

Metrics: Energy (Joules) is the correct metric for completing a fixed workload. Power (Watts) acts as a constraint (Thermal Design Power, TDP) dictating cooling and packaging limits.
Dynamic Energy and Power: Driven by switching transistors.

Energy_{dynamic} \propto Capacitive Load \times Voltage^{2}

Power_{dynamic} \propto Capacitive Load \times Voltage^{2} \times Frequency

Static Power: Leakage current flowing even when transistors are inactive. Scales with the total number of devices. Leakage limits necessitate power gating.
Dark Silicon: Transistor budgets exceed thermal dissipation limits; not all areas of a chip can be powered simultaneously.
Energy Efficiency Techniques:
- Dynamic Voltage-Frequency Scaling (DVFS).
- Clock gating for idle modules.
- Temporary overclocking (Turbo mode) utilizing thermal margins.
- Heterogeneous cores (combining high-performance and high-efficiency cores).
- Race-to-halt: computing quickly to enter deep sleep modes.

Learning Curve: Manufacturing costs decrease over time as yield improves.
Volume and Commoditization: High manufacturing volume amortizes development costs and increases efficiency.
Integrated Circuit Cost Factors:
$Dies per wafer \approx \frac{π \times ( Wafer diameter /2 ) ^{2}}{Die area} - \frac{π \times Wafer diameter}{2 \times Die area}$ $Die yield = (1 + (Defects per unit area \times Die area))^{- N}$
Where $N$ is the process-complexity factor.
Chiplets: Breaking a large monolithic die into smaller, interconnected dies to increase yield and reduce manufacturing costs.

States of Service: Systems alternate between service accomplishment and service interruption.
Metrics:
- Mean Time To Failure (MTTF): A reliability measure of continuous service accomplishment.
- Mean Time To Repair (MTTR): Time spent in service interruption.
- Mean Time Between Failures (MTBF): $MTTF + MTTR$ .
- Failures in Time (FIT): Failures per billion hours ( $1 0^{9} / MTTF$ ).
- Availability: $\frac{MTTF}{MTTF + MTTR}$ .
Redundancy: Essential for improving MTTF. Systems implement spatial or temporal redundancy to tolerate independent faults.
Silent Data Errors: Faults in functional logic causing incorrect execution without halting the system, requiring hardware/software verification checks.
Security Vulnerabilities: Microarchitectural state changes during speculative execution (e.g., Spectre) create timing side channels that leak protected information.

Execution Time: The single most reliable measure of computer performance. CPU time isolates execution from I/O wait times.
Benchmarks: Standardized suites (e.g., SPEC, TPC) prevent overfitting to trivial kernels and establish baseline configurations.
Summarizing Performance (SPECRatio): Comparing execution times normalized to a reference machine requires geometric means.
$Geometric mean = n i = 1 \prod n sample_{i}$
The ratio of geometric means directly maps to the geometric mean of performance ratios.

Take Advantage of Parallelism: Exploit DLP, TLP, and Instruction-Level Parallelism (pipelining, multiple issue) at all levels of system design.
Principle of Locality:
- Temporal Locality: Recently accessed items will likely be accessed again soon.
- Spatial Locality: Items near recently accessed items will likely be accessed soon.
Focus on the Common Case: Optimize frequent operations over infrequent ones for highest impact.
Amdahl’s Law: Limits the speedup obtained from an enhancement based on the fraction of time the enhancement is usable.
$Speedup_{overall} = \frac{1}{( 1 - Fraction _{enhanced} ) + \frac{Fraction _{enhanced}}{Speedup _{enhanced}}}$
Processor Performance Equation:
$CPU time = Instruction count \times Cycles per instruction \times Clock cycle time$
Requires evaluating hardware implementation (Clock cycle time), organization (CPI), and compiler technology (Instruction count).