4 Arithmetic

Addition: Operates bit-by-bit right to left, passing carry bits to the next digit.
Subtraction: Executed as addition by negating the second operand using its two’s complement inverse.
Overflow: Occurs when a result requires more bits than the hardware word size allows.
- Adding operands with different signs, or subtracting operands with the same sign, cannot overflow.
- Overflow occurs when two positives sum to a negative, or two negatives sum to a positive.
- Hardware detects it when the carry into the sign bit disagrees with the carry out of the sign bit.
- Unsigned overflow is checked in software by testing whether the sum is less than either addend.

The product of an $n$ -bit and an $m$ -bit operand requires $n + m$ bits to avoid overflow.

Sequential algorithm: Uses a 128-bit Product register and a 64-bit ALU. Each iteration: if the LSB of the multiplier is 1, add the multiplicand to the product; shift the product right by 1. Repeat 64 times.
Fast multiplication: Unrolls the loop into 63 parallel adders arranged as a tree, reducing delay to $lo g_{2} (64)$ = six sequential add times. Further improved with carry-save adders and pipelining.

$D i v i d e n d = Q u o t i e n t \times D i v i sor + R e main d er$ . The remainder carries the same sign as the dividend.

Restoring division algorithm: Each iteration: subtract the divisor from the Remainder register. If remainder ≥ 0, shift the Quotient left and set the new LSB to 1. If remainder < 0, restore by adding the divisor back, shift Quotient left, set LSB to 0. Repeat 65 times.
Fast division (SRT): Uses lookup tables on the upper bits of the dividend and remainder to predict multiple quotient bits per step, correcting mispredictions in subsequent passes. Unlike multiplication, a parallel adder tree cannot be used — the sign of each difference must be known before the next step.

Values are encoded in normalized scientific notation: one nonzero digit to the left of the binary point.

$(- 1)^{s} \times (1 + Fraction) \times 2^{(Exponent - Bias)}$

Format	Sign	Exponent	Fraction	Bias
Single (32-bit)	1 bit	8 bits	23 bits	127
Double (64-bit)	1 bit	11 bits	52 bits	1023

Implicit leading 1: The normalized form always has a leading 1, so it is omitted, giving 24 and 53 bits of effective precision respectively.
Biased exponent: Stored as an unsigned value offset by the bias, so floating-point numbers sort correctly using integer comparators.
Special values:
- Exponent all 0s, fraction all 0s → exact zero.
- Exponent all 0s, nonzero fraction → denormalized: gradual underflow between zero and the smallest normalized number.
- Exponent all 1s → $\pm \infty$ (zero fraction) or NaN (nonzero fraction, from invalid operations like $0/0$ ).
Overflow / Underflow: Overflow when the exponent exceeds the maximum; underflow when it goes below the minimum.

Addition:

Align: shift the significand with the smaller exponent right until exponents match.
Add the significands.
Normalize: shift and adjust the exponent; check overflow/underflow.
Round to fit the field width; if rounding denormalizes, repeat step 3.

Multiplication:

Floating-point addition is not associative due to limited precision.

Guard, Round, Sticky bits: Three extra bits kept during intermediate computation. Guard is the first bit past the significand, Round is the second, Sticky is set if any further bits are nonzero. Together they enable correct rounding at the final step.
ULP (Units in the Last Place): Standard accuracy metric. IEEE 754 guarantees results within 0.5 ulp.
Rounding modes: Round up, round down, truncate, and round-to-nearest-even (default — breaks halfway ties toward an even LSB).
FMA (Fused Multiply-Add): Computes $a + (b \times c)$ with a single rounding step at the end, giving higher precision than a separate multiply followed by add.

My Knowledge Base