Arithmetic for Computers

Addition, Subtraction, and the ALU

Computer arithmetic relies on hardware to execute mathematical operations using binary representations.

Addition: Operates bit-by-bit from right to left, passing carry bits to the next higher-order digit.
Subtraction: Executed via addition by first negating the second operand using its two’s complement inverse.
Overflow: Occurs when an arithmetic result requires more bits than the fixed hardware word size (e.g., 64 bits) allows.
- Adding operands with different signs, or subtracting operands with the same sign, cannot cause overflow.
- Overflow occurs if the sum of two positive numbers yields a negative result, or the sum of two negative numbers yields a positive result.
- Hardware detects two’s complement overflow when the carry into the sign bit disagrees with the carry out of the sign bit.
- Unsigned integer overflow is monitored by the compiler using branch instructions to check if a sum is less than either addend.
Arithmetic Logic Unit (ALU): The physical hardware component responsible for performing addition, subtraction, and logical operations.

Managing fixed word sizes in basic integer arithmetic dictates how more complex operations, like multiplication, must scale their hardware and algorithms to handle expanding bit widths.

Multiplication

Multiplication involves combining a multiplicand and a multiplier to yield a product. The product of an $n$ -bit multiplicand and an $m$ -bit multiplier strictly requires $n + m$ bits to guarantee no overflow.

Sequential Multiplication Algorithm:
- Hardware utilizes a 128-bit Product register alongside a 64-bit ALU.
- Step 1: If the least significant bit of the multiplier is 1, the multiplicand is added to the product.
- Step 2: The product register is shifted right by 1 bit.
- This logic loop is repeated exactly 64 times.
Fast Multiplication:
- “Unrolls” the sequential loop by employing 63 adders organized in a parallel tree structure.
- This hardware optimization minimizes the calculation delay to $lo g_{2} (64)$ or six sequential 64-bit add times.
- Performance scales further using carry-save adders and pipelined execution.
RISC-V Multiplication Instructions:
- mul: Calculates the lower 64 bits of the 128-bit product.
- mulh: Calculates the upper 64 bits (signed $\times$ signed).
- mulhu: Calculates the upper 64 bits (unsigned $\times$ unsigned).
- mulhsu: Calculates the upper 64 bits (signed $\times$ unsigned).

Similar to multiplication, division requires iterative bit shifts and ALU operations, but introduces unique sequential dependencies that challenge parallelization.

Division

Division extracts a quotient and a remainder from a dividend and a divisor. The operation satisfies the equation: $D i v i d e n d = Q u o t i e n t \times D i v i sor + R e main d er$ . The remainder must possess the identical sign as the dividend to prevent mathematically anomalous behavior.

Restoring Division Algorithm:
- Step 1: Subtract the divisor from the Remainder register.
- Step 2a (Remainder $\geq 0$ ): Shift the Quotient register left, setting the new rightmost bit to 1.
- Step 2b (Remainder $< 0$ ): Restore the original value by adding the divisor back to the Remainder register; shift the Quotient register left, setting the new rightmost bit to 0.
- Step 3: Shift the Divisor register right by 1 bit. Repeat these steps 65 times.
Fast Division:
- Unlike multiplication, division cannot easily utilize a parallel tree of adders because the sign of the difference must be evaluated before the algorithm can proceed to the next step.
- SRT Division: Accelerates division by utilizing lookup tables based on the upper bits of the dividend and remainder to predict multiple quotient bits per step, correcting mispredictions in subsequent passes.
RISC-V Division Instructions:
- div, divu: Divide (signed/unsigned).
- rem, remu: Remainder (signed/unsigned).
- RISC-V hardware ignores division by zero and overflow; software routines must verify divisor validity and quotient boundaries manually.

Integer arithmetic relies on a fixed radix point, but representing fractional or macroscopically large numbers necessitates a dynamic binary point and an exponent-based system.

Floating-Point Representation

Floating-point arithmetic represents numbers in which the binary point is dynamic rather than fixed. Values are encoded using normalized scientific notation, enforcing a single nonzero digit to the left of the binary point.

IEEE 754 Standard Formats:
- Single Precision (32-bit): 1 sign bit ( $s$ ), 8-bit exponent, 23-bit fraction.
- Double Precision (64-bit): 1 sign bit ( $s$ ), 11-bit exponent, 52-bit fraction.
- Equation: Floating point values equate to: $(- 1)^{s} \times (1 + F r a c t i o n) \times 2^{(E x p o n e n t - B ia s)}$ .
- Implicit Leading 1: The leading 1 in the normalized significand is implied, effectively expanding the single and double precision significands to 24 and 53 bits respectively.
Biased Notation: Exponents are biased by a constant to strictly output positive binary representations, streamlining the sorting of floating-point numbers using standard integer comparison ALU operations.
- Single precision bias: $127$ .
- Double precision bias: $1023$ .
Special Values:
- An exponent of all $0$ s signifies the exact value $0$ (bypassing the implied leading 1).
- An exponent of all $1$ s signifies Infinity ( $\pm \infty$ ) or NaN (Not a Number, produced by invalid operations like $0/0$ ).
- Denormalized Numbers: Values with an exponent of zero but a nonzero fraction. These allow for “gradual underflow” to bridge the gap between $0$ and the smallest normalized representable number.
Exceptions:
- Overflow: Occurs when a positive exponent exceeds the maximum value of the exponent field.
- Underflow: Occurs when a negative exponent exceeds the maximum negative range of the exponent field.
- RISC-V software checks for these exceptions using the floating-point control and status register (fcsr) instead of relying on hardware interrupts.

Standardized representation formats explicitly dictate the specialized algorithms required to execute arithmetic correctly on these floating-point data structures.

Floating-Point Arithmetic Algorithms

Floating-Point Addition:
1. Align the binary point: Shift the significand of the operand with the smaller exponent to the right until its exponent matches the larger exponent.
2. Add the significands.
3. Normalize the sum: Shift the sum right or left and increment or decrement the exponent. Check for exponent overflow/underflow.
4. Round the significand to fit the hardware field size. If rounding causes the sum to become unnormalized, repeat step 3.
Floating-Point Multiplication:
1. Calculate the new exponent by adding the biased exponents of both operands, subtracting the bias once to correct the sum.
2. Multiply the significands.
3. Normalize the product, shifting it right and incrementing the exponent. Check for overflow/underflow.
4. Round the significand. Re-normalize and check exponent bounds if necessary.
5. Set the sign: Produce a positive sign bit if both operands have identical signs; produce a negative sign bit if the signs differ.
RISC-V Hardware Implementation:
- RISC-V maps floating-point operations onto a separate set of 32 floating-point registers (f0-f31), doubling the total register bandwidth of the architecture.
- Floating-point ALU instructions append .s for single precision or .d for double precision (e.g., fadd.s, fmul.d).
- Comparison operations (feq, flt, fle) resolve conditions into integer registers, which are then passed back to standard integer branch instructions (beq, bne) previously mapped in the datapath.

Mathematical operations on hardware approximations inevitably introduce calculation errors, necessitating rigid rounding protocols to ensure precision.

Precision, Rounding, and Accuracy

Because computer arithmetic utilizes limited precision to represent infinite real numbers, floating-point addition is not associative.

Extra Precision Bits: Maintained automatically during intermediate calculations to improve accuracy before truncation.
- Guard bit: The first extra bit kept to the right of the significand.
- Round bit: The second extra bit kept to the right.
- Sticky bit: Set to 1 if any bits to the right of the round bit are nonzero, distinguishing halfway tie-breaking cases during rounding.
Units in the Last Place (ulp): The standard metric for floating-point accuracy, representing the number of bits in error in the least significant bits of the significand. IEEE 754 arithmetic guarantees accuracy within 0.5 ulp.
Rounding Modes: Hardware supports rounding up, rounding down, truncation, and rounding to nearest even (the default, which guarantees a 0 in the least significant bit for halfway cases).
Fused Multiply Add (FMA): Executes a multiplication followed by an addition ( $a + (b \times c)$ ) with only a single rounding step performed at the very end of the sequence, yielding higher precision than independent multiply and add instructions.

While dedicated floating-point units provide precise single-thread calculations, the abundance of independent arithmetic operations required by graphics and dense matrices drives the need for data-level execution models.

Subword Parallelism and SIMD

Subword Parallelism: Datapaths achieve simultaneous operations on short vectors of data by partitioning the carry chains within a wide ALU (e.g., a 128-bit adder split to compute four 32-bit or eight 16-bit operands concurrently).
Originating to accelerate graphics and audio computations (processing independent pixel colors or audio samples), this acts as a form of Single Instruction, Multiple Data (SIMD) processing.
x86 Extensions:
- SSE (Streaming SIMD Extensions): Expands architecture with 128-bit XMM registers for packed single-precision and double-precision floating-point execution.
- AVX (Advanced Vector Extensions): Doubles vector execution capacity utilizing 256-bit YMM registers, allowing a single instruction to specify eight 32-bit floating-point operations or four 64-bit floating-point operations in parallel.
Matrix Multiply Efficiency: By leveraging subword parallel load intrinsic functions and broadcast memory instructions, operations like Double Precision General Matrix Multiply (DGEMM) drastically cut instruction counts and execute calculations concurrently across aligned register arrays.

My Knowledge Base

Explorer

03 Arithmetic for Computers