TURING MACHINE > VON NEUMANN MACHINE Computer as switching machine Logic gates as electronic switches: calculators, decoders, multiplexers, registers Binary switches representing numbers in decimal and binary systems

SOME QUESTIONS YOU ASK? >What are the functions of a computer? What is it supposed to do? >Modular decomposition of computer system? >Interconnection between them? >How to proceed design of the computer systems? \par Instruction set architectures, organizations or microarchitecture, hardware; same architecuture but different organization are amd and intel; while same architecure but different organization are server and desktop and mobile and embedded \par #By Morris Mano

Q: How internal hardware architecture of a digital computer is specified? REGISTER TRANSFER LANGUAGE: >The internal hardware organization of a digital computer is best defined by specifying: -the set of register it contains and their function -sequence of microoperations performed on the binary information stored in registers -control that initiates the sequence of microoperations

>P: R2 <- R1; P is a control function generated in control section
	-P is activated by rising edge of a clock pulse at time t,
	-Next transition at time t+1 finds the load input active and data inputs of R2 are then loaded into the register in parallel.
	-T: R2 <-R1

>Common bus system for multi-register computer

>One way is multiplexers. Zeroth bits of each registers are fed into the input of a multiplexer and the output is the zeroth bit of the bus. The line input of multiplexer select the source register whose binary information is then placed on the bus. No of multiplexers equals size of registers, size of multiplexers equal number of registers

>A bus system can be constructed with three-state gates instead of multiplexers. Zeroth bits of each reigsters are fed into each buffers and combined to form zeroth bit of the bus. The line input of decoder decides which source register's buffer whose binary information is then placed on the bus. No of buffers (and decoders) equals size of registers, size of buffers (and decoders) equal number of registers

>ARITHMETIC CIRCUIT: FIG (0)
>THE ALU: FIG (0)

BUILD A REAL CPU NOW:

CPU ORGANIZATION:

Questions:
	>Okay so if microoperations, registers and controls are what makes a computer?

INSTRUCTION CODES:
>Microoperations are defined by the instruction codes. A instruction code specify the operation and also register or the memory words where the operands are to be found, as well as the register or the memory words wehre the result is to be stored.
	-...Operation part...Address part...
	-Second part could: non addresses, directs or indirects
	
	-we choose word size of 16bit (hence the instruction size as well)
			15(direct or indirect)14..12(opcode)::11..0(address)
			
	-So 2^12 memory addresses possible, 2^3 instructions, while if non memory then lots of instruction possbile
	-Non memories are identified by 111 in opcode part (in which if 15th is 1 then i/o else others)
	
TIMING AND CONTROL:
>The clock pulses applied to each registers and the control unit but do not change the state of a register unless the register is enabled by a control signal. Two types:
>(HARDWIRED) The control unit consists of two decoders, a sequence counter, and a number of control logic gates. 
	>The IR divided into three parts is applied into the control logic gates, bit-15 through a flipflop, bit14-11 through a 3*8 decoder and rest directly
	>The 4-bit sequence counter can count in binary from 0 to 15 whose output is decoded into 16 timing signals T0 to T15. 
	>It responds to positive transition of clock. The first pulse clears SC to 0 whcih in turn activates the timing signal T0 out of decoder which will trigger only those registers whose control inputs are conencted to the timing signal T0. SC is incremented with every clock transition producing a sequence of timing signals T0,T1...
	
COMPUTER REGISTERS:
>The instruction code dictates a memory unit of capacity 4096 words each containing 16-bits. 
	PC(12bits), AR(12), IR(16), AC(16), DR(16), TR(16), INPR(8), OUTR(8)
>Connected using the bus as before


INSTRUCTION CYCLE:
>Each instruction cycle consists of the phases: fetch, decode

1. FETCH: The PC is loaded with the address of the first instruction in the program. SC is cleared to 0, providing a decoded timing signal T0. After each clock pulse, SC;s timing signal goes through T0,T1... and so on. 
	T0: AR <- PC
	T1: IR <- M[AR], PC <- PC+1
	T2: I <- IR(15), D0,...,D7 (CONROL) <- IR(12-14)*, AR <- IR(0-11), 

2. DECODE: During time T3, the control unit determines the type of instruction that was just read from memory. 
	Checks D7 (decoded from opcode):
		-1: 
			Checks I:
				-0: Executes reigster-reference instruction, SC <- 0 
				-1: Executes input-output instruction, SC <- 0
		-0:
			Checks I:
				-0: Nothing
				-1: AR <- M[AR]
			>Execute memory-reference instruction, SC <- 0
			
	>Register-Reference Instructions:
		-D7I'T3(BitIR): (one of 12 register instructions)
		-Eg:
			D7I'T3(one of the 12): AC <- 0, SC <- 0
		-Ie executed on T3 itself no more cycles
	
	>Memory-Reference Instructions:
		-Allows for 7 memory operations. 
		-The effective address of the instruction is in the address register AR as done in T3 then other cycle to actually execute the instruction
		-Eg: 
			D0T4: DR <- M[AR]
			D0T5: AC <- AC+DR, E <- Cout, SC <- 0
	
>INTERUPT:
	-The interrupt enable flip-flop IEN can be set and cleared with instructions, when IEN is set to 1, the computer can be interrupted.
	-An interrupt flipflop is included in the computer which when R=0, the computer goes through an instruction cycle
		1. The return address in PC is stored in a address where it can be found later when the program returns to the instruction at which it was interrupted. Here it will first be stored in TR then into the 0th memory location 
		2. Control then asserts address 1 into PC and clears IEN so that no more interuptions can occur unitl the interrupt request from the flag has been serviced.
		3. The branch instruction at address 1 causes the program to transfer to the service. One this is done ION is executed to set IEN to 1.
	
	>If IEN is set and interrupt recieved then hardware sets R to 1
	RT0: AR <- 0, TR <- PC
	RT1: M[AR] <- TR, PC <- 0
	RT2: PC <- PC+1, IEN <- 0, R <- 0, SC <- 0
	
CIRCUITRY:
	>Now devise circuitry for each register looking at the instructions wherever it is mentioned
	
FLOW CHART:
	>FIG (1)		
STATE DIAGRAM:
	>FIG (2)

MICROPROGRAMMED CONTROL: >The function of a control unit in a digital computer is to intiate sequences of microoperations. 1. When the control signals are generated by hardware using conventional logic design techniques, the control unit is said to be hardwired. In hardwired organzation, control logic is implemented with gates, registers, and other circuits. It has advantages that it can be optimized to produce a fast mode of operation. Requires changes in wiring among various components if design has to be modified or changed. 2. A control unit whose control values are stored in memory is called a microprogrammed control unit. In microprogrammed, the control information is stored in a control memory whcih is programmed to initate the required sequence of microoperations. Modifiations can be done by updating the microprogram in control memory. Each word in control memory, which is a ROM, contains within it a microinstruction. The microinstruction specifies one or more microooperations for the system. A sequence of microinstructions constitutes a microprogram.

BLOCK DIAGRAM:
>Next-address generator -> Control address register -> Control memory -> Control data register
>The control memory address register specifies the address of the microinstruction, and the control data register holds the microinstruction read from memory. The microinstruction contains a control word that specifies one or more microoperations for the data processor.
>Once these operations are executed, the control must determine the next address. The location of the next microinstruction may be the one next in sequence, or it may be located somewhere else in the control memory. For this reason it is necessary to use some bits of the present microinstruction to control the generation of the address of the next microinstruction. The next address may also be a function of external input conditions. 
>This configuration requires a two-phase clock, with one clock applied to the address register and the other to the data register
>The system can operate without the control data egister by applying a single-phase clock to the address register. The control word and next-address information are taken directly from the control memory.

ADDRESS SEQUENCING
>Microinstruction are stored in control memory in groups, with each group specifying a routine. Each computer instruction has its own microprogram routine in control memory to generate microoperations that execute the instruction
>The first address is usually the address of the first microinstruction that activates the instruction feetch routine. The fetch routine may be sequenced by incrementing the control address register through the rest of its microinstructions. At the end of the fetch routine, the instruction is in the instruction register of the computer. (INCREMENTAL)
>Next, the control must go through the routine that determines the effective address of the operand. A machine instruction may have bits that specify various addressing modes. The effective address computation routine in control memory can be reached through a branch microinstruction, which is conditioned on the sttus of the mode bits of the instruction. (BRANCH)
>The next step is to generate the microoperations that execute the instruction fetched from memory. The microoperation steps to be generated in processor register depend on the ooperation code part of the instruction. The transformation from the instruction code bits to an address in control memory where the routine is located is referred to as a mapping process. A mapping procedure is a rule that transforms the instruction code into a control memory address (MAPPING)
>Microprograms that employ subroutines will require an external register for storing the return address. When the execution of the instrution is completed, control must return to the fetch routine. (CALL AND RETURN)
>The address sequencing capabilites required in control memory are:
	-Incrementing of control address register
	-Uncondtiional and conditional branch depending on status bit conditions
	-A mapping process from bits of the instruction to an address for control memory
	-A faciity for subroutine call and return
>The organizational hardware needed for selecting next microinstruction address:
	
	1) Conditional branching:
		>The status bits, together with the field in the microinstruction that specifies a branch address, control the conditional branch decisions generated in the branch logic. The way is to test the specified condition and branch to the indicated address if the conditon is met; otherwise the address register is incremented. Three bits in the microinstruction are used to specify any one of eight status bit conditions. These three bits provide the selection varaibles for the multiplexer. In this configuration, the microprogram follows one of two possible paths, depending on the value of the selected status bit.
		
	2) Unconditional branching:
		>Can be achieved by fixing the value of one status bit at the input of the multiplexer so its always 1.
		
	3) Mapping of instruction:
		>A special type of branch exists when a microinstruction specifies a branch to the first word in control memory where a microprogram routine for an instruction is located. The status bits for this type of branch are the bit in the operation code part of the instruciton. One simply mapping process that converts the 4-bit operation code to a 7-bit address for control memory by placing a 0 in the most significant bit of the address, transferring the four operation code bits, and clearing the two least significant bits of the control address register. This provides for each computer instruction a microprogram routine with a capacity of four microinstructions. One can extend this concept to a more general mapping rule by using a ROM or PLD to specify the mapping function.
		
	4) Subroutines:
		>Many microprograms contain identical sections of code. Microinstructions can be saved by employing subroutines that use common sections of microcode. For eg: sequence of microoperations needed to generate the effective address of the operand for an instruction is common to all memory reference instrutions. Microprograms that use subroutines must have a provision for storing the return address during a subroutine return. Accomplished by placing the incremented output from the control register into a subroutine register and branching to the beginning of the subroutine. The subroutine register can then become the source for transferring the address for the return to the main routine. The best way to structure a register file that stores addresses for subroutines is to organize the registers in a last-in, first-out (LIFO) stack.
		
	5) FIG (3)
		
>NOW THE ACTUAL STUFF:
The computer will have 64 memory reference instructions only.
	I(1), OPCODE(4), ADDR(11)

MICROINSTRUCTION FORMAT:
	>The 20 bits of the microinstruction are divied into four functional parts. The three fields F1, F2 and F3 specify miroooperations for the computer. The CD field selects status bit conditions. The BR field specifies the type of branch to be used. The AD field contains a branch address. The address field is seven bits wide, since the control memory has 2^7 words.
		F1(3), F2(3), F3(3), CD(2), BR(2), AD(7)
		
	6. The microooperations are subdivided into three fields of three bits each. The three bits in each field are encoded to specify seven distinct microoperations which gives total of 21 microoperations. No more than three microoperations can be chosen for a microinstruction, one from each field. If fewer than three are to be used, one or more fields will use binary code (000) of no operation. Important ot relize that two or more conflicting microoperations cannot (should not) be specified simultanenously.
	
	7. The CD field consists of two bits which are encoded to specify four status bit conditions. The first condtiion is always a 1 so that a reference to CD = 00 will always find the condition to be true. When this condition is used with the BR field, it provides an unconditional branch operation. 
	
	8. The BR field consists of two bits, which in conjuction with the field AD choose the address of the next microinstruction. JMP (00) and CALL (01) are identical except that a call microinstruction stores return address in the subrotine register SBR. The jump and call depends on the value of CD, if the specified status bit is equal to 1, the next address in the AD field is transferred to the control address register CAR. Else CAR is incremented by 1. The return from subroutine is accomplished with a BR field equal to 10 which causes the transfer of the return address from SBR to CAR. The mapping from operation code to an adress for CAR is accomplished when BR field is equal to 11. The return and mapping are independent of value in CD and AD fields.
	
	9. FIG (4)
	
THE FETCH ROUTINE:
	>The control memory has 2^7 words each of 20 bts. It is necessary to determine the bit values of each of the 128 words. The first 64 words are to be occupied by the routines for the 16 instructions. Convinient starting location for fetch routine is 64.
	10. AR <- PC
	11. IR <- M[AR], PC <- PC+1
	12. I <- IR(15), CAR(2-5) <- DR(11-14), CAR(0,1,6) <- 0, AR <- IR(0-10)
	
	>The fetch routine needs three microinstructions, whcih are placed at control memory 64, 65 and 66.
	>As:	

		[F1]		  [F2]           [F3]    [CD]  [BR]    [ADDR]
		
		ORG 64
	FETCH:	[AR <- PC]        [NOP]          [NOP]   [U]   [JMP]   [NEXT]
		[DR <- M[AR]]     [PC <- PC+1]   [NOP]   [U]   [JMP]   [NEXT]
		[AR <- IR(0-10)]  [NOP]          [NOP]   [x]   [MAP]   [xxxx]
		
	>Others:
	
		ORG 67
	INDRCT:	[IR <- M[AR]]	  [NOP]          [NOP]   [U]   [JMP]    [NEXT]
		[AR <- IR(0-10)]  [NOP]          [NOP]   [x]   [RET]    [xxxx]		
	
		ORG 0
	ADD:	[NOP]		  [NOP] 	 [NOP]   [I]   [CALL]   [INDRCT]
		[DR <- M[AR]]  	  [NOP] 	 [NOP] 	 [U]   [JMP] 	[NEXT]
		[AC <- AC+DR] 	  [NOP] 	 [NOP] 	 [U]   [JMP]	[FETCH]
	
>The control memory output of each subfield must be decoded to provide the distinct microoperations. The outputs of the decoders are connected to the appropriate inputs in the processor unit.

COMPUTER ARITHMETIC: >The algorithm for the floating-points is slightly more complicated and its implementation requires circuits to add and subtract, and to compare the signs and magnitudes of the numbers.

1. Booth Multiplication Algorithm:
	>Booth algorithm gives a procedure for multiplying binary integers in signed-2's complement representation. It operates on the fact that a string of 1's in the multiplier from bit weight 2^k to weight 2^m can be treated as 2^(k+1)-2^m.
	>For example:
		-Binary number 001110 (+14) has a string of 1's from 2^3 to 2^1 (k=3, m=1). The number can be representation as:
			+14 = 2^4-2^1 = 16-2
		-Therefore, the multiplication M*14, where M is the multiplicand and 14 the multiplier, can be done as:
			M*14 = M*(2^4)-M*(2^1)
		-Thus the product can be obtained by shifting the binary multiplicand M four times to the left and subtracting M shifted left once.
	>Booth algorithm requires examination of the multiplier bits and shifting of the partial product. Prior to the shifting, the multiplicand may be added to the partial product, subtracted from the partial product, or left unchanged accoordint to the rules:
		1. The multiplicand is subtracted from the partial product upon encountering the first least significant 1 in a string of 1's in the multiplier.
		2. The multiplicand is added to the partial product upon encountering the

first 0 (provided that there was a previous 1) in a string of O’s in the multiplier. 3. The partial product does not change when the multiplier bit is identical to the previous multiplier bit.

	>The algorithm works for positive or negative multipliers in 2's complement representation. This is because a negative multiplier ends with a string of 1's and the last operation will be a subtraction of the appropriate weight. For eg: -14 is represented in 2's complement as 110010 and is treated as -2^4+2^2-2^1	
	HARDWARE:
		>(FIG 12)
		>Initially the multiplicand is in reigster BR and the multiplier in QR. The sum of registers AC and BR form the partial product. Both partial product and multiplier are shifted to the right. 
		>Q0 designates the least significant bit of the multiplier in register QR. An extra flip-flop Q(-1) is appended to AR to faciliate a double bit inspection of the multiplier.
		>The sequence counter is initially set to a number equal to the number of bits in the multiplier. The counter is decremented by 1 after forming each partial product.
		>The two bits of the mulitplier in Q0 and Q(-1) are inspected:
			1. If it is 10 ,it means that the first 1 in a string of 1's has been encounteed. This requires subtraction of the multiplicand from the partial prodcut in AC
			2. If the two bits are equal to 01, it means that the first 0 in the string of 0's has been encountered. This requies the addition of the multiplicand to the partial product in AC
			3. When the two bits are equal, the partial product does not occur
		>The next step is to shift right the partial product and the multiplier. This is an arithmetic shift right operation which shifts AC and QR to the right and leaves the sign bit in AC unchanged. 
		>The sequence counter is decremented and the computational loop is repetead n times.
		>(FIG 13)
		
2. Array Multiplier:
	>Checking the bits of the multiplier one at a time and forming partial products is a sequential operation that requires a sequence of add and shift microoperations. The multiplication of two binary numbers can be done with one microoperation by means of a combinational circuit that forms the product bits all at once. This is a fast way of multiplying two numbers since all it takes is the time for the signals to propagate through the gates that form the multiplication array. However, an array multiplier requires a large number of gates, and for this reason it was not economical until the development of integrated circuits.
	
3. Division: (works for negatives but i dont know how)
	>Division of two fixed-point binary numbers in signed-magnitude representation is done by a process of successive compare, shift and subtract operations. 
	>Example:
		-The divisor B consists of five bits (say 10001) and the dividend A (say 0111000000), of ten bits. 
		-The five most significant bits of the dividend are compared with the divisor. Since the 5-bit number is smaller than B, we try again by taking the six most significant bits of A and compare this number with B. The 6-bit number is greater than B, so we place a 1 for the quotient bit in the sixth position above the dividend. The divisor is then shifted once to the right and subtracted from the dividend. The difference is called a partial remainder because the divison could have stopped here to obtain a quotient of 1 and a remainder equal to the partial remainder. The process is continued by comparing a partial remainder with the divisor. If the partial remainder is greater than or equal to the divisor, the quotient bit is equal to 1 . The divisor is then shifted right and subtracted from the partial remainder. If the partial remainder is smaller than the divisor, the quotient bit is 0 and no subtraction is needed. The divisor is shifted once to the right in any case.
	>Hardware: 
		-The hardware for implementing the division operation is identical to that required for multiplication and consits of the same components. The dividend is in A and Q and the divisor in B. A constant is set into the sequence counter SC to specify the number of bits in the quotient. 
		-The dividend, or partial remainder, is shifted to the left, thus leaving the two numbers in the required relative position. 
		-Subtraction is achieved by adding A to the 2's complement of B. The information about the relative magnitudes is then available from the end-carry.
		-If E = 1, it signifies that A >= B, a quotient bit 1 is inserted into Q0 and the partial remainder is shifted to the left to repeat the process. 
		-If E = 0, it signifies that A < B so that quotient in Q0 remains 0 (inserted during the shift). 
			Restoring: The value of B is then added to restore the partial remainder in A to its previous value.
			Non-restoring: B is not added but instead, the negative difference is shifted left and then B is added.
	>A divide-overflow condition occurs if the high-order half bits of the dividend constitute a number greater than or equal to the divisor.		
	>FIG 14, FIG 15

4. Floating-point Addition:
		>Arithmetic operations with floating-point numbers are more complicated than the fixed-point numbers and their execution takes longer and requires more complex hardware. 
		>Adding or subtracting two numbers requries first an alignment of the radix point since the exponent parts must be made equal before adding or subtracting the mantissas. The alignment is done by shifting one mantissa while its exponent is adjusted until it is equal to the other exponent. 
		>Example:
			 .5372400 * 10^2
			+.1580000 * 10^(-1)
		>The usual alignment procedure is to shift the mantissa that has the smaller exponent to the right by a number of places equal to difference between the exponents. After this is done, the mantissas can be added
			 .5372400 * 10^2
			+.0001580 * 10^2	
		>After the mantissas are added or subtracted, the result may be unormalized. The normalization procedure ensures that the result is normalized prior to its transfer to memory.
		>POST NORMALIZATION: 
		1. Overflow check: If an overflow occurs when the magnitudes are added, it is transferred into a flip-flop O. 
			If E = Emax, then OverflowError;
			Else:
				Right shift AC, E = E+1 then END
		2. Normalization check: If AC does not require normalization then END
		3. Check for zero: If AC = 0, set E = 0 then END
		4. Underflow check: If E > Emin then Left shift AC, E = E-1; Goto 2:
			
		>FIG 16		

\par NEED A BRIEF INTERGRATION OF EMBEDDED SYSTEM, there is a text file for it somewhere \chapter{Instances of Instruction Sets} \section{RISC-V} \begin{table}[H] \centering \begin{tabular}{|c|c|} \hline Base or extension & Features \\hline RV32I & Base 32-bit integer set with 32 registers\ RV32E & Base 32-bit integer set with 16 registers for embedded systems\ RV64I & Base 64 instruction set; all registers are 64-bit\ M & Adds multiply and divide instructions\ F & Adds single precision floating points, includes 32 new 32-bit registers\ D & Extends fps to double precision, 64-bit, making the registers 64-bits\ A & Adds atomic instructions for concurrent processing\ Futures & \\hline \end{tabular} \end{table} Yeti extension le chai operating system chalauna sakdena hai, supporting envrironment ta hunxa tara modern OS haru chai mildena, probably old OS haru milxa \subsection{Registers} \subsection{Addressing modes} \subsection{Instructions} \begin{enumerate} \item Data Transfer: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Destination & Source & Examples\\hline GPRs & Memory & LB, LBU, LH, LHU, LW, LWU, LD; FLW, FLD\ Memory & GPRs & SB, SH, SW, SD; FSW, FSD\ Integer Rs & Floating Rs & FMV.X.S, FMV.X.D\ Floating Rs & Integer Rs & FMV.S.X, FMV.D.X\ \hline \end{tabular} \end{table} \item ALU: ADD, ADDI, ADDW, ADDIW, SUB, SUBI, SUBW, SUBIW, XOR, XORI, AUIPC \item Control: BEQ, BNE, BLT, BGE, JAL, JALR \item Others: FADD, FSUB, FMULT, FIV, FSQRT [Followed by .s or .d] \end{enumerate} \subsection{Assemblers}

\section{8086} \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline 1936’s&\multicolumn{5}{|c|}{[Universal Turing machine followed by early computers: UNIVAC, MARK I, IBM’s]}\\hline Date & Architecture & Register-size & Data-bus & Address-bus & Primaries \\hline 1978 & 8086 & 16- & 16- & 20- & general-purpose\ 1978 & 8088 & 16- & 16- & 20- & sold 250,000 by IBM\ 1980 & 8087 & - & - & - & 8 new 80- registers (60 news)\ 1982 & 286 & 16- & 16- & 24- & protected mode\ 1985 & 386 & 32- & 32- & 32- & new addressing modes\ 1989 & 486 & 32- & 32- & 32- & built-in 8k cache \ 1993 & Pentium & 32- & 64- & 32- & dual pipeline (4 news)\ 1997 & Pentium II & 32- & 64- & 36- & extended MMX (57 news)\ 1997 & Pentium III & 32- & 64- & 36- & extended SSE (70 news)\ 2000 & Pentium IV & 32- & 64- & 36- & extended SSE2 (144 news)\ 2003 & AMD & 64- & 64- & 40- & 16 new 80- registers (XMM)\ 2004 & Pentium D & 32- & 64- & 36- & dual core (SSE3)\ 2006 & Core & 32- & 64- & 36- & multi core\ 2008 & i7 & 64- & 643 & 36- & 6 cores\ 2009 & i5 & 64- & 642 & 36- & 4 cores\ 2010 & i3 & 64- & 642 & 36- & 2 cores\\hline \end{tabular} \end{table} \subsection{Registers} \subsection{Addressing modes} \begin{table}[H] \centering \begin{tabular}{|c|c|} \hline Mode & Allowed registers\\hline Immediate & 8, 16 or 32 bits; One of 14 major register (not IP and FLAGS)\ Register indirect & BX, SI, DI; EAX, ECX, EDX, EBX, ESI, EDI\ Memory direct & 16 or 32 bits\ Base w/ disp & BP, BX, SI, DI; EAX, ECX, EDX, EBX, ESI, EDI\ Base w/ index & BX+SI, BX+DI, BP+SI, and BP+DI\ Base w/ index and disp & ”\ Base w/ scaled index & Scale is 0, 1, 2 or 3; Base is any 32- except ESP; Index is any 32-\ B w/ scaled index + disp & “\\hline \end{tabular} References to IP use the CS, references to BP or SP use the SS, and default for others is the DS. \end{table} \subsection{Assemblers} \subsection{Instructions} \begin{enumerate} \item Data transfer: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Source/destination & Second source & Examples\\hline R8; R32 & Immediate & MOV AL, 0; MOV EAX, 0\ R8; R32 & R8; R32 & MOV AL, BL; MOV EAX, BAX\ R8; R32 & Memory & MOV AL, mem; MOV EAX, mem\ Memory & Immediate & MOV mem, 0\ Memory & R8; R32 & MOV mem, AL; MOV mem, EAX;\\hline \end{tabular} \end{table} \item ALU: ADD, SUB, INC, DEC, OR, XOR, CMP \item Control: JMP, JMPF, CALL, CALLF, RET, RETF, LOOP \item Others: Strings, Floatings, System calls \end{enumerate} \section{Instruction Set Principles} \begin{enumerate} \item Architecture classification: General-purpose with a load-store architecture \item Registers: At least 16, and preferably 32 general purpose registers in orthogonality \item Addressing modes: Immediates (8-16), Register indirect, Displacement (12-16) \item Data sizes: 8-, 16-, 32-, and 64-bit integers and 64-bit floatings \item Operations: Load, store, add, subtract, move, shift \item Controls: Jump, call and return with a PC-relative address of at least 8-bits \item Trade-off: Fixed instruction encoding for performance; variable inst coding for code size \item Compiler technology: \end{enumerate} \section{Comparisons} \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline Aspects & x86 & ARMv8 & RISC-V \\hline Architecture & gen-purpose memory & gen-purpose load-store & gen-purpose load-store \ Instruction size & 8, 16, 24, 32, 48 & 32 & 32\ Integer registers & 8 32 (or 16) & 31 64 + SP & 31 64\ FP registers & 8 80 & 32 64 (or 32) & 32 64 (or 32)\ Addressing & Segmented & Flat & Flat\ Alignment & Aligned preferred & Must aligned & Must aligned\ Addressing modes & Many & Few & Few\ Sizes of operands & 8, 32 (or 16) & 8, 16, 32, 64 & 8, 16, 32, 64\ Operations & Rich & Not so & Not so\ Control flow & Checks flags & Checks flags & Checks registers\ Encoding & Variable & Fixed & Fixed \\hline \end{tabular} \end{table} \section{ARMv8} \subsection{Registers} \subsection{Addressing modes} [Base Disp (Pre, Post), Base Index (Post), Base Index w/ Disp, Base Scaled Index, PC-relative] \subsection{Instructions} \begin{enumerate} \item Data Transfer: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Destination & Source & Examples\\hline GPRs & Memory & LDRSB, LDRB, LDRSH, LDRH, LDRSW, LDR, LDRX; LD…\ Memory & GPRs & STB, STW, STL, STX; ST…\ Integer Rs & Floating Rs & -\ Floating Rs & Integer Rs & -\ \hline \end{tabular} \end{table} \item ALU: ADDX, -, ADD, ADDI, SUBX, -, SUB, SUBI, XOR, XORI \end{enumerate} \chapter{Architectural Viewpoint} \begin{itemize} \item Will include general principles and all that \item Two kinds of parallelism in applications: \begin{enumerate} \item Data parallelism: many data items can be opere \end{enumerate} \end{itemize} \section{History} \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Date & Growth & Embarks \\hline 1978 to 1986 & 25%/year & CISC\ 1986 to 2003 & 52%/year & RISC\ 1978 to 1986 & 23%/year & End of Dennard scaling; Amdhal’s parallelism\ 2003 to 2011 & 12%/year & End of Moore’s law\\hline \end{tabular}\par Parallelism: Instruction-level, Vector architectures, Thread-level, Request-level \end{table} \section{Quantitative Principles} \begin{enumerate} \item Principle of temporal and spatial locality \item Take advantages of parallelism \item While doing so focus on the common case (Amdahl’s law) \item Processor performance: \end{enumerate} \chapter{Memory System} When the address leaves the processor, cache is encountered first. When found in cache its called a hit else miss, if not found then retrieved from memory in blocks (localities). The time required to retrieve those blocks depend on latency and bandwidth both. Not all data are present in memory, some are on disk so they come from there. \begin{itemize} \item Latency: First word retrieve garna kati time lagxa. Processor bata request pathako cache ma miss bhayo, main memory ma bhetiyo. Bhanesi latency le chai request pathako time dekhi memory bata farkina kati time lagxa sabai consider garxa. Mostly processor dekhi memory ko distance ko measure ho yo. \item Bandwidth: Rest of the word nikalna kati time lagxa. Aba nikalesi block nai nikalna paryo, kati fast niskinxa ta block tyo rate of niskini chai bandwidth ho. Mostly memory tech ra number of ports ma depend garxa. Ports meaning kati thaubata niskina milxa tyo. \end{itemize} \section{Hierarchy} \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|} \hline Name & Registers & Cache & Main memory & Disk storage \\hline Bandwidth (MiB/s) & 1,000,000-10,000,000 & 20,000-50,000 & 10,000-30,000 & 100-1,000\ Size & 4KiB & 32KiB to 8MiB & 1TB & 1TB\ Access time (ns) & 0.1-0.2 & 0.5-10 & 30-150 & 5,000,000\ Managed by & Compiler & Hardware & OS & OS\ Implementation & Multiport CMOS & On-chip CMOS SRAM & CMOS DRAM & Magnetic\\hline \end{tabular} \end{table} From CPU performance, The CPI can be expanded as We will try to increase memory cycles without decreasing the clock frequency \section{Caches} For each memory system in hierarchy we need to understand four questions. Cache lai blocks ma move garni bhanisakesi block size chai important huni bhayo. Cache ma kati ota block hunxa is the first question? then memory ma kati ota blocks hunxa is the second question? Bhanesi cache ma block frames huni bhayo. As does in memory. We ask following questions, others may not be important but the cache placement affect our cache performance hai sathi haru \begin{enumerate} \item Placement: Suppose (dont give into details) memory bata euta block leuna paryo bhane cache ma kun frame ma halni? Kasari aayo matlab xaina\par On the other hand euta address generate bhayo tyo ta aba memory ko address generate hunxa bhanesi tyo corresponding block cache ma kata xa? Each block is attached a tag, you SEARCH for it. So its like an array search \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Placement & Memory block goes to? & Identification\\hline Fully-associative & Anywhere in the cache & Associative search over whole cache\ Direct-mapped & One designated place& Index (search)\ Set-associative & A set of restricted places& Index into a set, associative in a set\ \hline \end{tabular} \begin{itemize} \item Index search: Just like in array, its not even a search you just put base address plus index and you get there. And once you get there you check you tag if its a match we are done else also we are done \item Associative search: More like a map. So each element would be a pair. To do associative search you check first pair, if its match then okay, else check another. In set associative its a combo \end{itemize} \end{table} \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline \multicolumn{2}{|c|}{Block address} & {Block offset}\\cline{1-2} Tag & Index () & ((Block size))\\hline \end{tabular} : \begin{tabular}{|c|c|c|c|} \hline \multirow{2}{Tag}&\multirow{2}{Use bit}&\multirow{2}{Dirty bit bit}&\multirow{2}*{Valid bit}\ &&&\\hline \end{tabular} \end{table} \item Replacement: If when bringing block we found that the place is full then we replace it, on the basis of some replacement algorithms: for direct there is none, while for associative, we prefer to do LRU, but approximately do FIFO or random. \par To know which elements were recently used we could have an use bit that tells that it was recently used try another. \item Write strategy: In the 100 percent of cache traffic only 28 percent are writes:\par What do you do when your block which you want to put data into, is found on cache? You definitely write into the cache, but what about memory? In write through you write to the memory also at the same time. Consistent with peripherals so stalls when its gets written to memory (write stalls). Instead of writing directly to memory we opt to writing to write buffers. So as soon as you write to write buffers we start processor while it gets written to the memory. Unfortunatley then on read miss we have to wait until the buffer gets empty, because the latest values are on write buffers\par In write back you update a dirty bit implying that this cache data is changed in cache but not in memory and only updated in memory when its replaced not rewritten. Saves traffic. What if when we have to discard a block, multiple of those were dirty? To solve this we have victim buffers, so when they get dirty they are put into those buffer. With victim buffers, when a element gets dirty its updated in buffer, if its rewritten then no need to update the memory yes, only need to update the victim buffers. \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline Strategy & What happens on write? &Features& On write-miss?\\hline Write through & At once&Consistency& Write no-allocates\\hline Write back & Through dirty bits on write buffers &Low traffic& Write allocates\\hline \end{tabular} \end{table} What when the write is a miss meaning that what you want to write is not present in the cache blocks? Our not so popular write through uses a thing called write no-allocates which mean that we directly write into memory. No shit given to cache. On the other hand, a popular way is write-back ko friend write allocate which treats write miss as read miss. So a block is fetched from main memory followed by writing to the cache. And then does what we already did when block was found on the cache \item These assumptions were based on that we would always find valid element on cache but that could be untrue. Initially caches could be emtpy. So before comparing tag we check for valid bit which is along the tag in same array. \item Cache optimization:\par The Miss rate and miss penalty was seen there already in the equation but one thing is the clock time. When you have hard time reaching to cache (cause you have to reach to the cache in one cycle, why? cause whats the point if not). Then if you dont you have to increase the clock cycle rate. \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|} \hline Techniques & Rate & Penalty & Time&Complexity\\hline Larger block size &+&-&&\ Larger cache size &+&&-&-\ Higher associativity&+&&-&-\\hline Multi caches: &&+&&- -\ Reads priority over writes&&+&&-\\hline Avoid address translation&&&+&-\ \hline \end{tabular} \end{table} Block size badhayo bhane chai thulo block liyera aauna paryo memory bata. \par Cache size badhayo bhane feri search garnai gaaro hunxa\par Associativity badhayera ni tei search garna gaarko hunxa dherai associative search garnu parera\par Cache multi bhayo bhane penalty ghatxa yar, kinaki direct memory bata leuna parena ni ta, complexity high hunxa kinaki aba feri cache haru ko questions consider garnu paryo, some choices which were illchosen in L1 comes alive in L2 and more. One further concern is whether to have inclusion or exclusion property. Meaning L1 cache ko data L2 ma ni bhaye ramro consistent hunthyo memory-L1 cache jastai. Else space save garna exclusive garayera miss ko bela exchange garne ni euta alternative ho\par Write through ma hamle, write buffer empty huni bela samma wait garthim tara aba hamle write buffer check garxam, teha data bhaye teha bata tanxam\par Address translation: Internally processors use large bit address size than physical which requries a translation between before heading towards cache. \end{enumerate} \section{Virtual memory} \begin{enumerate} \item Comparison: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Parameter & First-level cache & Virtual memory \\hline Features & Efficiency & Mngmnt w inherent protection\ Block (page) size & 16-128 bytes & 2096 to 65,636 bytes \ Hit time & 1-3 clock cycles & 100-200 clock cycles\ Miss penalty & 8-200 clock cycles & 1,000,000-10,000,000 cycles\ Miss rate & 0.1-10% & 0.00001%-0.001%\ Address mapping & 25-45 physical to 14-20 cache & 32-64 virtual to 25-45 physical\ Controller& Hardware & OS, support from Hardware\\hline \end{tabular} \end{table} \item Paging vs segmentation: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Parameter & Paging & Segmentation \\hline Words per address & 1 & 2 \ Programmer visible? & Not & May be\ Replacement & Trivial since same sizes & Difficult to find contiguous memory\ Memory efficiency & Internal fragmentation& External fragmentation\ Disk traffic efficiency & Yes & Not always\ \hline \end{tabular}\par [Hybrid approach of paged segment in which segment is an integral multiple of pages] \end{table} \item Placement: Fully associative, with TLB \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline {Page number}&{Valid bit}&{Use bit}&{Dirty bit}&{Physical address}&{Protection fields}\\hline \end{tabular} \end{table} \item Replacement: LRU \item Write strategy: Write-back \item Optimizations (Selecting a page size): \begin{table}[H] \centering \begin{tabular}{|c|c|} \hline Larger page sizes & Efficienct TLB\ Smaller page sizes & Conserving storage, as well as I/O bandwidth\ Multiple page sizes & Fruits from both worlds\ \hline \end{tabular} \end{table} \item Working with cache: \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Aspects & Virtual cache & Physical cache \\hline Efficiency & Less hit time & More hit time \ Flushing & Cleared when switching processes [soln: PID’s] & No need of clearence\ OS mapping & Multiple virtual addresses for same physical’s & No aliases\ Peripherals & Virtual conversion to communicate with cache & Directly accessible\ \hline \end{tabular}\par [Soln allows cache read to begin imm, and yet tag comparison is still with phy addresses.] \end{table} \item Hypothetical: \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline Phys bits & L1 Cache & L2 Cache & Virt bits & Page size & TLB Entries\\hline 40 & Direct 16KiB (64) & 4-way 4MiB (64) & 64 & 16KiB & 2-way 256\ \hline \end{tabular} \end{table} \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline \multicolumn{4}{|c|}{Virtual address } \\hline \multicolumn{2}{|c|}{Virtual page number } & \multicolumn{2}{|c|}{Page offset }\\hline TLB compare addr & TLB index & L1 index & Block offset \\hline \multicolumn{4}{|c|}{}\\hline TLB tag & TLB data & L1 tag & L1 data\\hline \end{tabular} \end{table} \item Architecture protection minimums \begin{itemize} \item Enable several processes to have their state in memory at the same time \item Provide at least two modes, user and system with exclusive privileges \item Mechanism to switch modes via system calls and returns \end{itemize} \begin{table}[H] \centering \begin{tabular}{|c|c|} \hline Architecture & Protection features \\hline 8086 & First 640KB for RAM and reserve remaining 384KB for the BIOS\ 80286 & Protected mode with shortcomings of interchangeability\ 80386 & Backwards compatible dubbed virtual mode solves 286\ \hline \end{tabular} \end{table} \begin{itemize} \item Segmented memory (opt. paging) with global and local descriptor tables \item Four levels of protection allowing user to pass parameters to OS \item AMD64: Pages sizes 4KiB, 2MiB and 4MiB

\end{itemize} \end{enumerate} \section{Advanced cache optimizations} \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|c|c|} \hline Techniques & Rate & Penalty & Time & Complexity & Bandwidth & Power \\hline Small and simple caches & - &&+&0&&+\ Way-prediction caches &&&+&1&&+\ Pipelining banked caches &&&-&1&+&\ Critical word first &&+&&2&&\ Nonblocking caches &&+&&3&+&\ Merging write buffer &&+&&1&&\ Hardware prefetching &+&+&&2 inst. 3 data.&&-\ HBM as additional level of cache &+&-&& 3&+,-&+\ Compiler-controlled prefetching &+&+&& 3&&\ Compiler techniques &+&&&0&&\ \hline \end{tabular} \end{table} \lstset{language=C} \begin{lstlisting} /Before Interchange/ for (j = 0; j < 100; j = j+1) for (i = 0; i < 100; i = i+1) x[i][j] += x[i][j];

/After Interchange/ for (i = 0; i < 100; i = i+1) for (j = 0; j< 100; j = j+1) x[i][j] += x[i][j]; \end{lstlisting} \begin{lstlisting} /Before Blocking [2N^3+N^2]/ for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) { r = 0; for (k = 0; k < N; k = k + 1) r = r + x[i][k]*y[k][j]; z[i][j] = r; };

/After Blocking [2N^3/B+N^2]/ for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B) for (i = 0; i < N; i = i + 1) for (j = jj; j < min(jj + B,N); j = j + 1) { r = 0; for (k = kk; k < min(kk + B,N); k = k + 1) r = r + x[i][k]y[k][j]; z[i][j] = z[i][j] + r; }; \end{lstlisting} \section{Advanced virtualization} \begin{itemize} \item Page tables for VMM and also for each guest OSes \item Trap guest VMs to VMM on executing privileged instructions \item Allow VMM and guest OS’s to operate at different privileges \item Virtualization of virtual memory through indirecting real memory or shadow page tables \item Other virtualizations \end{itemize} \section{Single core evaluations} \begin{enumerate} \item Cortex IP A53: \begin{itemize} \item Instructions per cycle: 2 (in order) \item Clock rate: Upto 1.3 GHz \item Memory hierarchy: 32-bit, 32-bit \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline Structure & Size & Organization & Miss penalty\\hline MicroTLB’s & 10 entries & Fully associative & 2 \ L2 Unified TLB & 512 & 4-way & 20\ L1 caches (vir index) & 8 to 64KiB (64) & 2-way & 13\ L2 Unified cache& 128Kib to 2MiB (64) & 16-way; LRU; Write-back & 124\ \hline \end{tabular} \end{table} \item L2 to main memory: 64-128 bits \end{itemize} \item Intel core 7: \begin{itemize} \item Instructions per cycle: 4 (out of order) \item Clock rate: Upto 4.0 GHz \item Memory hierarchy: 48-bit, 36-bit \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline Structure & Size & Organization & Miss latency (penalty)\\hline Instruction TLB & 128 entries & 8-way & 1(9) \ Data MicroTLB & 64 entries & 4-way & 1(9) \ L2 Unified TLB & 1536 & 12-way & 8(Hundreds)\ L1 caches (vir index) & 32 KiB (64) & 8-way & 4\ L2 Unified cache& 256KiB (64) & 4-way & 12\ Shared L3 cache& 2MiB (64) & 16-way; Pseudo-LRU; Write-back & 44 \ \hline \end{tabular} \end{table} \item L3 to memory: 364 bits \item Multibanking: Each and every [Pipelined: L1] \item Nonblocking: L1, L2, L3 \item Merging write buffer: L1 \item Supports prefecthing: L1, L2 \item Additional L4 with HLB support \end{itemize} \end{enumerate} \section{Final countdown} \begin{enumerate} \item The page frame of the instruction’s address (36 bits) is sent to the instruction TLB. \item At the same time, the 12-bit page offset from the virtual address is sent to the instruction cache \item The instruction TLB is accessed to find a match between the address and a valid page table entry (PTE) (steps 3 and 4). In addition to translating the address, the TLB checks to see if the PTE demands that this access result in an exception because of an access violation. An instruction TLB miss first goes to the L2 TLB, which contains 1536 PTEs of 4 KiB page sizes and is 12-way set associative. It takes 8 clock cycles to load the L1 TLB from the L2 TLB, which leads to the 9-cycle miss penalty including the initial clock cycle to access the L1 TLB. If the L2 TLB misses, a hardware algorithm is used to walk the page table and update the TLB entry. Sections L.5 and L.6 of online Appendix L describe page table walkers and page structure caches. In the worst case, the page is not in memory, and the operating system gets the page from secondary storage. Because millions of instructions could execute during a page fault, the operating system will swap in another process if one is waiting to run. Otherwise, if there is no TLB exception, the instruction cache access continues \item The index field of the address is sent to all eight banks of the instruction cache (step 5). The instruction cache tag is 36 bits - 6 bits(index)-6 bits (block offset), or 24 bits \item The four tags and valid bits are compared to the physical page frame from the instruction TLB (step 6) \item Because the i7 expects 16 bytes each instruction fetch, an additional 2 bits are used from the 6-bit block offset to select the appropriate 16 bytes. Therefore 6 + 2 or 8 bits are used to send 16 bytes of instructions to the processor. The L1 cache is pipelined, and the latency of a hit is 4 clock cycles (step 7). A miss goes to the second-level cache \item the 30-bit block address (36-bit physical address- 6-bit block offset) is divided into a 20-bit tag and a 10-bit index (step 8). \item Once again, the index and tag are sent to the four banks of the unified L2 cache (step 9), which are compared in parallel. If one matches and is valid (step 10), it returns the block in sequential order after the initial 12-cycle latency at a rate of 8 bytes per clock cycle \item The 13-bit index (step 11) is sent to all 16 banks of the L3 (step 12). The L3 tag, which is 36-(13 + 6)=17 bits, is compared against the physical address from the TLB (step 13). If a hit occurs, the block is returned after an initial latency of 42 clock cycles, at a rate of 16 bytes per clock and placed into both L1 and L3. If L3 misses, a memory access is initiate \item f the instruction is not found in the L3 cache, the on-chip memory controller must get the block from main memory. The i7 has three 64-bit memory channels that can act as one 192-bit channel, because there is only one memory controller and the same address is sent on both channels (step 14). \item Each channel supports up to four DDR DIMMs (step 15). When the data return they are placed into L3 and L1 (step 16) because L3 is inclusiv \item he total latency of the instruction miss that is serviced by main memory is approximately 42 processor cycles to determine that an L3 miss has occurred, plus the DRAM latency for the critical instructions. \end{enumerate} \begin{figure} \centering \includegraphics[width=\textwidth]{inteli7.png} \includegraphics[width=\textwidth]{inteli72.png} \end{figure} \chapter{Instruction and thread levels} \section{Pipelining} One way of ILP is through pipeline ie by operlappin the execution of instructions Actual, \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|c|} \hline Stages & \multicolumn{5}{|c|}{Micro-operations}\\hline IF & \multicolumn{5}{|c|}{IR [PC]}\ & \multicolumn{5}{|c|}{NPC PC+4}\ \hline ID & \multicolumn{5}{|c|}{DECODE}\ &\multicolumn{5}{|c|}{TempA Regs[rs1], TempB Regs[rs2]}\ & \multicolumn{5}{|c|}{Imm signexten immIR}\ \hline EX&A Imm & A B & \multicolumn{2}{|c|}{A + Imm} & (A B)?\ &&&\multicolumn{2}{|c|}{}&TargetNPC+(Imm 2)\\hline MEM& - & - & LMD [ALU] & [ALU] B & [PC Target : NPC]* \\hline WB& Rs[rd] ALU & Rs[rd] ALU & Rs[rd] LMD & - &-\\hline \end{tabular}\par\par [A typical branch frequency of 12% and a store frequency of 10%, leads to an overall CPI of 4.66]

    [Cache tried to minimize CPI while pipelining tries to make CPI to 1]
\end{table}
        \begin{itemize}
        \item  The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access
        \item The register file is used in the two stages: one for reading in ID and one for writing in WB, we perform the register write in the first half of the clock cycle and the read in the second half
        \item Separation by introducing pipeline registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle
    \end{itemize}
[Note for future: Processor states are changed at 4th and 5th stage only, which is nice]
\begin{table}[H]
    \centering
    \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}
        \hline
        Instruction & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 5th\\\hline
        i & IF & ID & EX & MEM & WB &&&&& Register write  \\\cline{11-11}
        i+1 && IF & ID & EX & MEM & WB &&&& Data memory \\\cline{11-11}
        i+2 &&& IF & ID & EX & MEM & WB &&& ALU \\\cline{11-11}
        i+3 &&&& IF & ID & EX & MEM & WB && Register read \\\cline{11-11}
        i+4 &&&&& IF & ID & EX & MEM & WB & Instruction memory \\\hline
    \end{tabular}
\end{table}

[Essentially two ways to pipeline: in-order maintaining instructions order and out-of-order schedulings] \subsection{Branch issues} You cannot just let branch instructions pass. Branch could result in need to flush a pipeline hence requiring to start again, which is a performance loss of 20 to 30 percent for modern applications of even branch frequency of 10 percent \begin{itemize} \item Predictors: One of the best way is to actually predict the branch based on just history of its address on the memory. So we know when to fetch the from target and when the successor. Saves only few percent in our five stage but is important in deeper pipeline.

Maintains a small memory indexed by the lower portion of the address of the branch instruction, so for certain bits of lower address maintain a single bit buffer which stores whether the branch at that address was taken or not, could be wrong due to mismatching the branches at same lower address\par

A improvement is a 2-bit predictor as shown in following state diagram    
\begin{center}

\begin{tikzpicture}[scale=0.2] \tikzstyle{every node}+=[inner sep=0pt] \draw [black] (16.5,-5.7) circle (5); \draw (16.5,-5.7) node {}; \draw [black] (42.1,-5.7) circle (5); \draw (42.1,-5.7) node {}; \draw [black] (16.5,-21.8) circle (5); \draw (16.5,-21.8) node {}; \draw [black] (42.1,-21.8) circle (5); \draw (42.1,-21.8) node {}; \draw [black] (37.501,-7.644) arc (-72.37724:-107.62276:27.088); \fill [black] (21.1,-7.64) — (21.71,-8.36) — (22.01,-7.41); \draw (29.3,-9.41) node [below] {}; \draw [black] (12.034,-7.904) arc (324:36:3.75); \draw (4.75,-5.7) node [left] {}; \fill [black] (12.03,-3.5) — (11.68,-2.62) — (11.09,-3.43); \draw [black] (21.078,-3.707) arc (108.10918:71.89082:26.453); \fill [black] (37.52,-3.71) — (36.92,-2.98) — (36.61,-3.93); \draw (29.3,-1.9) node [above] {}; \draw [black] (42.1,-10.7) — (42.1,-16.8); \fill [black] (42.1,-16.8) — (42.6,-16) — (41.6,-16); \draw (42.6,-13.75) node [right] {}; \draw [black] (46.566,-19.596) arc (144:-144:3.75); \draw (53.85,-21.8) node [right] {}; \fill [black] (46.57,-24) — (46.92,-24.88) — (47.51,-24.07); \draw [black] (21.049,-19.743) arc (108.7527:71.2473:25.667); \fill [black] (37.55,-19.74) — (36.95,-19.01) — (36.63,-19.96); \draw (29.3,-17.88) node [above] {}; \draw [black] (37.527,-23.804) arc (-71.78046:-108.21954:26.314); \fill [black] (21.07,-23.8) — (21.68,-24.53) — (21.99,-23.58); \draw (29.3,-25.62) node [below] {}; \draw [black] (16.5,-16.8) — (16.5,-10.7); \fill [black] (16.5,-10.7) — (16,-11.5) — (17,-11.5); \draw (16,-13.75) node [left] {}; \end{tikzpicture} \end{center}

\item Advanced predictors: Past history of branches (does not matter the address of branch) can add information on how to predict the branch. A (m,n) predictor uses behaviour of last m branches to choose from $2^m$ branch predictors, each of which is an n-bit predictor for single branch. \par
Store the history in shift register, and predictor buffer can be index using those m-bits concatenated with lower-order address bits. For eg (2,2) with 64 entries, would have 2 history bits and 4 lower address bits. \par

\end{itemize} Note: All the hazards are detected before ID stage so as the isntruction crosses, it is said to be issued \par \subsection{Data dependencies} If we implement the pipeline just like that we see how dependencies could convert into hazards. Three crucial class of problems:

Arise due to data dependencies the inherent property of a program that need to be respected to maintain correctness which if are near the causes hazards in the pipeline \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|} \hline Types & Order of i then j on x & Damage & Dependencies\\hline Read After Write & i writes, j reads & j gets wrong x & Data\ Write After Read & i reads, j writes & i gets wrong x & Anti\ Write After Write &i writes, j writes& x gets wronged & Output\\hline \end{tabular}\par [The WAR and WAR not possible in RISC-V in-order execution] \end{table} \begin{lstlisting} ld x1, 0(x1) ;sources from load instruction add x2, x1, x2 ;stall RAW add x3, x1, x3 ;forward RAW add x4, x1, x4 ;noteventhe RAW add x5, x5, x5 ;completelynot RAW \end{lstlisting} \begin{itemize} \item Interlocks: Hardware interlocks are implemented in the ID stage, but how does it detect hazards? by keeping track of source and destination registers. Like okay you have these sources and this destination go on. Hey you you have this source? oh no its destination will come only after few cycles so wait here until i tell you to go. When the first add comes into ID and says his source registers are x1 and x2, it gets stalled because the register x1 is in progress from the earlier load instruction, bro thats too much stalling though. Note: Memory interlocking is not an issue here see by taking an example of store and load successively \item Forwarding: A little insight that data in registers are actually written in WB phase but are produced earlier so can forward the early produced data from the pipeline registers to stages that require it. Here the load instruction actually produce its data at MEM stage after one clock cycle the second add needs it, so instead of waiting for WB stage of load to complete the data from MEM is forwarded to ALU so the add can use it, the extra mux logic will do just about right \item WAR and WAW: Just stall them as they are pretty rare in occurence even in other pipelines \end{itemize} \subsection{Exceptions handling} You think hazards are the only ones? Haha there is a beast lurking in here. Are the events that abrupt the normal flow of the processor

Bad exceptions are those which are not requested by a user and are synchronous:

What else could be there? External, timer and environment calls are some examples \begin{table}[H] \centering \begin{tabular}{|c|c|} \hline Stages & Possible exceptions \\hline IF & Page faults on IM; Misaligned access; Protection violations\ ID & Illegal opcode \ EX & Arithmetic exceptions\ MEM & Page faults on DM; Misaligned access; Protection violations \ WB & None\\hline \end{tabular} \end{table} [Three points at which things happen in RISC-V pipeline as a prerequistite information] \begin{itemize} \item Hazards are discovered at ID stage only \item Exceptions are know at or after the MEM stage \item State of processor changes only at or after MEM stage \end{itemize} Several problems arise while dealing with exceptions: \begin{enumerate} \item Precise exceptions:\par When an exception occur need to force a trap instruction into the next stream. If pipeline can be stopped so preceding are completed and faulty and upcoming can be run from scratch is called precise exceptions\par \begin{itemize} \item Faulty instruction changes state of the processor before causing exception [not in RISC-V] \item Some early instruction has already changed state of processor [not in RISC-V] \item Two exceptions occur at the same time [slight in RISC-V] \item Early instruction might cause exception than the later ones which are not yet completed \end{itemize} [Solution for RISC-V: Maintain a register vector register associated with each instruction which goes down the pipeline with the instruction. When an exception is registered to the vector, then all writes are disabled (so no change in states), even the ahead instruction cannot change. No more instructions are fetched and pipeline is stalled for now. But even if cannot change it can register its possible exceptions to the vector. At the end of MEM (or WB) for each instruction, the registered exceptions in the vector are checked. Then each is handled in the order they would have been in an unpipelined processor. After which the pipeline is restarted at the very next instruction]

[For other processors dont know man could be rollbacks or any other weird and wild stuffs]
\item Imprecise exceptions:\par
Lets ignore everything and allow imprecise precisions:
\begin{itemize}
    \item Solution 1: A solution is to buffer the results of an operation until all the operations that were issued earlier are complete. They say its complicated and all I believe them: But say there are two variations: 
    \begin{itemize}
    \item History file: The history file keeps track of the original values of registers. When an exception occurs and the state must be rolled back earlier than some instruction that completed out of order, the original value of the register can be restored from the history file
    \item Future file: Another approach, the future file, keeps the newer value of a register; when all earlier instructions have completed, the main register file is updated from the future file
     \end{itemize}

     \item Solution 2: Somewhat imprecise means knowing what operations were in the pipline and their PCs so they can be simulated later then afterhandling the execptions the software finishes any instructions that precede the latest instructions completed and the sequence can restart

\end{itemize}

\end{enumerate} \section{Out of order execution} These solutions put stalls but do not change the order of the instructions i.e. program order is maintained, but it is only necessary to maintain the order when it affects the outcome. To minimize these stalls we reorder the instructions through scheduling techniques but that would come at the cost of other problems \begin{itemize} \item An instruction that is control-dependent on a branch branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For example, we cannot take an instruction from the then portion of an if statement and move it before the if statement. \item An instruction that is not control-dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For example, we cannot take a statement before the if statement and move it into the then portion \end{itemize} Following code as before: \begin{lstlisting} ld x1, 0(x1) ;sources from load instruction add x2, x1, x2 ;stall RAW add x3, x1, x3 ;forward RAW add x4, x1, x4 ;noteventhe RAW add x5, x5, x5 ;completelynot RAW \end{lstlisting} The last instruction can be moved to the place of stall of the first add \subsection{Compiler techniques} The compiler can reorder the instructions around to exploit the parallelism and avoid stalls. Such are static pipeline and are in compiler design course probably \subsection{Dynamic scheduling} We try scheduling at hardware level, dynamic scheduling offers several advantages. \begin{enumerate} \item It allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline, eliminating the need to have multiple binaries and recompile for a different microarchitecture (ISA is same but hardware is different) \item Enables handling some cases when dependencies are unknown at compile time that may result from a modern programming environment that uses dynamic linking or dispatching. \item Most importantly, it allows the processor to tolerate unpredictable delays, such as cache misses, by executing other code while waiting for the miss to resolve \item Hardware speculation, a technique with additional performance advantages, which builds on dynamic scheduling \end{enumerate} Bad stuffs: Increase in data hazards solves in their own way and require imprecise exceptions

A natural way of reorganizing pipeline to suit scheduling: When an instruction could execute without hazards, it was issued from ID, with the recognition that all data hazards had been resolved. Now, the issue process in ID is separated into two parts: checking for any structural hazards and waiting for the absence of data hazards. Thus we still use in-order instruction issue (i.e., instructions issued in program order), but we want an instruction to begin execution as soon as its data operands are available (out of order pipeline). \begin{enumerate} \item Fetch : Fetch instructions in order into a queue \item Issue : Each check for hazards and wait until to read operands \item Execs : When you are hazard free do your memory or arithmetic stuffs \item Write back : Update the processor state of writing into the registers or not \end{enumerate} An instruction fetch stage precedes the issue stage and may fetch either to an instruction register or into a queue of pending instructions; instructions are then issued from the register or queue. In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or can bypass each other in second stage and thus enter execution out of order\par [Easier said than done] \subsection{Tomasulo approach} Tomasulo uses reservation stations which buffer instructions waiting to issue and are associated with the functional units. So a instructions waits in a RS corresponding to its functional unit. The basic idea is that a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. In addition, pending instructions designate the reservation station that will provide their input i.e. required registers specifiers for pending operands are renamed to the names of the reservation stations\par [Convince yourself this solves all the data hazards possible] \begin{enumerate} \item Issue: As the instructions enters RS it updates its destination register with the RS tag it is waiting on to inform other instructions that this register value is old and I will produce another new value for you guys. Also checks its operand registers and if they are tagged with some RSes it waits for them to produce operands. There is a bus which broadcasts the produced operands and its associated RS. Hence the pending instruction do not refer to register files, they refer to the respective RS. Extra cautions for memory stuffs \item Execute: Monitors the bus until operands are available and are issued as many as functional units available \item Write back: When the result is available write it on common bus and from which goes into the registers and into any RS that are waiting for them
\end{enumerate} [Exceptions are imprecise]

Speculation: Need 100% branch prediction accuracy or speculation to get around branches. Whenever an instruction issues it’s put on a place just like RS called re-order buffer to indicate the order in which the instructions update the state of the processors. At write back stage writes are performed in order of their ROB entry. When a branch with incorrect prediction reaches the head of the ROB, it indicates that the speculation was wrong, ROB is flushed and execution restarts at the successor of the branch

Hence,

\section{Multiple-issue processor} Preceding techniques eliminate stalls to achieve an ideal CPI of one, to decrease it to less than one are the multi-issue processors. Architectures along those lines are: \begin{table}[H] \centering \begin{tabular}{|c|c|c|c|c|} \hline Name & Issue structure & Scheduling & Hazards detection & Examples \\hline Static sscalar & Dynamic & Static & Hardware & Embedded space ARM\ Specul sscalar & Dynamic & Dynamic & Hardware & Intel cores\ VLIW & Static & Static & Primarily software & Signal processing\ EPIC & Primarily static & Mostly static & Primarily software & Itanium NASA\\hline \end{tabular} \end{table} With incurring overheads and hardware requirements current technology make it possible for superscalar architectures to have 4 issues per cycle VLIW issues a fixed number of instructions usually five including one integer two floating points and two memroy references formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction. EPIC is along the same line which composes 6 instructions onto a packet

\section{Multithreading} ILP is reasonably transparent to programmers in addition during cache misses the utilization of functional units drops dramatically. The CPU just stalls waiting for the data from lower level of memories; understandable for data but for instructions is a heavy loss. Other form of parallelism namely multithreading allow multiple threads to share the functional units of a single processor in an overlapping fashion. Multithreading, however, does not duplicate the entire processor as a multiprocessor does, instead, multithreading shares most of the processor core among a set of threads, duplicating only private state, such as the registers and program counter

Requirements: \begin{itemize} \item Software part: A program must contain explicit multiple threads that could execute in concurrent fashion. These threads are identified by compiler or by programmer itself \item Hardware part: Separate instruction queue for each thread fetching in parallel while functional units are shared but not the private states so creating separate register file and PC along with RSs and others for each thread. The memory itself can be shared through the virtual memory mechanisms which already support multiprogramming. \end{itemize} Hardware must support the ability to change to a different thread relatively quickly; in particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles. Simultaneous multithreading dispatches instructions from more than one hardware threads on each clock cycle causing the execution of instruction from multiple threads to be interleave so even if one thread stalls others can make use of the functional units

Current processor opts for a four-issue superscalar with SMT supporting two or four threads. Why not push harder? cause its better to go for multi-core in terms of energy performance and price. \section{Multicore} Computers consisting of tightly coupled processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space. To take advantage of an MIMD multiprocessor with n processors, we must usually have at least n threads or processes to execute; with multithreading, is 2–4 times higher. Shared memory multiprocessors refer to the fact that address space shared: \begin{enumerate} \item Symetric or centralized multiprocessors: Some levels of private cache and usually a shared one while memory is centralized among the cores. Some multicores has non uniform access to outermost cache even though centralized so are not really SMPs \item To support larger processor counts and their bandwidth requirements memory must be distributed among the processors yet address space is shared known as non uniform memory access NUMAs \end{enumerate} Hardware requirements: \begin{itemize} \item Need atomic primitives that can be used to implement synchronization key property that they read and update a memory value in such a manner that we can tell whether the two operations executed atomically \item Coherence: When shared data are cached the shared value may replicate in multiple caches which result in two processors seeing different values for same memory location. Key to implementing cache coherence protocols is tracking the state of any sharing of the block similar to valid and dirty bits in uni-processor caches. One method is to ensure that a processor has exclusive access to a data item before writing that item. Any copy held by reading processor is invalidated therefore when the read occurs it misses the cache \item Consistency: The simplest way to implement sequential is to require a processor to delay the completion of any memory access until all invalidation causes by the access are completed. Such sequential consistency reduces potential performance especially in a large number of processors system.

With synchronized data race free programs a relaxed model as though the processor were sequentially consistent. Release consistency is based on the observation that in synchronized programs, an acquire operation must precede a use of shared data, and a release operation must follow any updates to shared data and also precede the time of the next acquire which allows us to slightly relax the ordering by observing that a read or write that precedes an acquire need not complete before the acquire, and also that a read or write that follows a release need not wait for the release.
    
Although synchronization operations always ensure that previous writes have completed, we may want to guarantee that writes are completed without an identified synchronization operation in such cases, an explicit instruction, called FENCE in RISC V, is used to ensure that all previous instructions in that thread have completed, including completion of all memory writes and associated invalidates

\end{itemize} \section{Vectors} SIMD provide significant data level parallelism are easier on programmers than MIMD. Three variations of SIMD: \subsection{Vectors architectures} Vector architectures grab sets of data elements scattered about memory, place them into large sequential register files, operate on data in those register files, and then disperse the results back into memory. A single instruction works on vectors of data, which results in dozens of register-register operations on independent data elements.

Hardware requirements: \begin{itemize} \item Configure register bank into n-bit vectors say you want six 64bit registers from a 256 byte bank would get to store eight 64bit FPs. \item Need not dynamically schedule but large functional units to execute vectors operations. Too much load on memory bandwidth. \item New instructions to mention whether vector-scalar or scalar-vector or vector-vector operation. Masking a part of a vector. Dealing with multiple loop elements in memory may not be in same adjacent location need non-stride load and stores \end{itemize} \subsection{Multimedias} Multimedias operate on narrower data types than the processors were optimized for and are parallizable. Exporting principles of vector architectures to the functional units allow new set of SIMD instructions that perform same operation on vectors of data. So a 64-bit addition with minimum modifications and overheads could perform eight of 8-bit addition operations. In contrast to vector achitectures SIMD majorly does not have strided memory access \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Instruction set & Register file size & Floating point equivalents\\hline MMX & 128 bytes FPs & 1 FP; 8 word or four double word operations\ SSE & 256 bytes XMMs & 4 SPFP\ SSE2,3,4 & 256 bytes XMMs & 4 SPFP or 2 DPFP\ AVX256 & 512 bytes YMMs & 8 SPFP or 4 DPFP\ AVX512 & 2 MBs ZMMs & 32 SPFP or 16 DPFP\\hline \end{tabular} \end{table} \subsection{GPUs} GPUs come from graphics accelerator and have been moving to mainstream computing thus the design of GPU is based on given the hardware invested to do graphics well how can we supplement it to improve the performance of a wider range of applications? Makes use of all types of parallelism: instruction level, multithreading, MIMD, SIMD. Since the GPU go with system processor, its not just about performance, its about coordinating the scheduling of computation on system and GPU and transfer of data between system and GPU memory.\par

Computation resources:

The GPU hardware is a multiprocessor composed of multithreaded SIMD processors (each called streaming processor) each consisting of parallel functional units called SIMD lanes whose vertical instance is called the CUDA Thread. A GPU can have one to several dozens multithreaded SIMD processors for example the Pascal P100 has 56 of them each of which contains 64 CUDA cores meaning functional units. A single SIMD instruction waits on a per-processor queue and is issued by a warp scheduler when they are ready to execute. With the pascal GPU each 32 element wide SIMD instruciton is mapped to 16 physcial SIMD lanes so each SIMD instruction in the SIMD thread takes 2 cycles to complete. So there is SIMD thread consisting of SIMD instructions which are then converted into lane threads as the smallest units

Memory resources:

There is a global section of memory called the off-chip DRAM is shared by the whole GPU. Local memory is a scratchpad memory typically of size 48KBs shared by the SIMD Lanes within a multithreaded SIMD Processor, but this memory is not shared between multithreaded SIMD Processors. To hold vectors lanes have 1KB of registers and also get private section of offchip RAM for stack frame, for spilling registers and local varaibales that dont fit in registers. L1 cache for each streaming processor which caches local and private memory. L2 cache shared for caching global memory.

Programming:

NVIDIA: Decided to develop a C-like and programming environment CUDA, for Compute Unified Device Architecture produces C/C++ for the system processor (host) and a C and C++ dialect for the GPU (device, thus the D in CUDA). OpenCL is another alternative for vendor independent GPUs. The each lane vertical splits is called the CUDA threads. They are blocked together in a hierarchial order and executed as a Thread Block which consists of SIMD instructions and go into a SIMD multiprocessor hence can communicate through local memory. Say you want to do matrix addition. First ofc you transfer the data to the gpu. Then one element addition fucntion is called a kernel, each cuda thread performs it. So how do the threads identify which (i,j)th element they will add? As said cuda threads are organized in 3dimensional blocks and blocks are themselves organized into two dimensional structure called grids. So think about the vectorizable loop as a grid. Can make a kernel call by specifying the paramters and number of blocks (gridDim) and number of threads in each block (blockDim). By using the specific identifier of each block (blockIdx -x,y coordinates of a block in grid) its dimension (blockDim), and the identifier of each thread (threadIdx - x,y,z coordinates for the thread within the block), it is possible to differentiate the data accessed by each thread and code to be executed

\par \begin{table}[H] \centering \begin{tabular}{|c|c|c|} \hline Features & Intel i7 6700K & Tesla P100 \\hline Core & Two 512-bit FPs per 4 core & 64 FPUs per 56 SMs\ DFLOPS & 256 G & ~5000 G \ Memory bandwidth & ~70 GB/s & ~720 GB/s\ Power & 65W & 250W\\hline \end{tabular} \label{tab:my_label} \end{table} Nvidia comes in three flavours, geforce for gaming, quadro for visuals and tesla for computation. Each of these brands come with different models such as pasal, ampere and so on \section{Domain Specific Architectures} Moore’s law is dead and there is nothing left up our sleeves to continue major improvements in cost-performance and energy efficiency for general purpose architectures. Note that instances such as caches are overkill for applications with easily predictable memory access patterns.

The heart of the TPU is a 65,536s of 8-bit ALU Matrix Multiply Unit and a large software-managed on-chip memory. The TPU’s single-threaded, deterministic execution model is a good match to the 99th-percentile response-time requirement of the typical DNN inference application. Thus the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU, which fetches instructions from its memory. The host CPU sends TPU instructions over the PCIe bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide (2048-bits) path \section{Deals} Memory organization:\par CPU organization:

Programming stuffs will be in text files

\chapter{Interfaces IO} INPUT/OUTPUT ORGANIZATION: >A computer serves no useful purpose without the ability to receive information from an outside source and to transmit results in a meaningful form >Input or output devices attached to the computer are also called peripherals. Among the most common peripherals are keyboards, display units, and printers. Peripheral that provide auxiliary storage for the system are magnetic disks and tapes.

INTERFACE:
	>Input-output interface provides a method for transferring information between internal storage and external I/O devices. Peripherals connected to a conputer need special communication links for interfacing them to resolve the difference that exist between central computer and each peripheral
		-Peripheral are electromechancal and electromagnetic devices and their manner of operation is different from the operation of the CPU and memory, which are electronic devices
		-The data transfer rate of perihperal is usually shower than the transfer rate of the CPU, and consequently, a synchronization mechanism may be needed
		-Data codes and formats in perihperal differ from the word format in the CPU and memory
		-The operating modes of peripherals are different from each other and each must be controlled so as not to disturb the operation of other peripheral connected to the CPU
		
	>To resolve these differences, the computer systems include special hardware components between the CPU and peripherals to supervise and synchronize all input and output transfers. These components are call interface units because they interface between the processor bus and the peripheral device. In addition, each device may have its own controller that supervises the operaitons of the particular mechanism in the peripheral
		1. To communicate with a particular device, the processor places a deice address on the address lines. Each interface attached to the I/O bus contains an address decoder that monitors the address lines
		2. When interface detects its own address, it activates the path between the bus lines and the device that it controls. All peripherals whose address does not correspond to the address in the bus are disabled by their interface
		3. At the same time that the address is made available in the address lines, the processor provides a function code in the control lines. The interface selected responds to the function code and proceeds to execute it. The function code is refered to as an IO command and is in essence an instruction that is executed iin the interface and its peripheral unit. Four types of commands that an interface may receive are control command, status command, data output and data input. 
	>(FIG 18)
	
IO BUS:
	>The computer buses are used in following ways to communicate with memory and I/O:
		-Use two separate buses, one for memory and the other for IO (Requires IOP processor)
		-Use one common bus for both memory and IO but have separate control lines for each
		-Use one common bus for memory and IO with common control lines
	
	
	ISOLATED v/s MEMORYMAPPED:
	>Many computers use one common bus to transfer information between memory or IO and the CPU. The distinction between a memory transfer and IO transfer is made through separate read and write lines. The IO read and IO write control lines are enabled during an IO transfer. The memory read and memory write control lines are enabled during a memory transfer
		4. Isolated:
			>In isolated IO configuration, the CPU has distinct input and output instructions, and each of these instructions is associated with the address of an interface register. When the CPU fetches and decodes the operation code of an input or output instruction, it places the address associated with the instruction into the common address lines. At the same time, it enables the IO read (for input) or IO write (for output) control line. This informs the external components that are attached to the common bus that the address in the address lines is for an interface register and not for a memory word
			>The isolated IO method isolates memory and IO addresses so that memory address values are not affected by interface address assignment since each has its own address space. 
			
		5. Memory-mapped:
			>The other alternative is use the same address space for both memory and I/O. This is the case in computers that employ only one set of read and write signals and do not distinguish between memory and IO addresses. In a memory-mapped IO organization there are no specific input or output instructions. The CPU can manipulate IO data residing in interface registers with the same instructions that are used to manipulate memory words. Each interface is organized as a set of registers that respond to read and write requests in the normal address space. The advantage is that the load and store instructions used for reading and writing from memory can be used to input and output data from  registers.
	
	
	SYNCRHONIZATION:
	>Two units, such as a CPU and an IO interface, are designed independently of each other. Asynchronous data transfer between two independent units requires that control signals be transmitted between the communicating units to indicate the time at which data is being transmitted. 
		6. One way of achieving this is by means of a strobe pulse supplied by one of the units to indicate to the other unit when the transfer has to occur. 
		7. Another method commonly used is to accompany each data item being transferred with a control signal that indicates the presence of data in the bus. The unit receiving the data item responds with another control signal to acknowledge receipt of the data. This type of agreement between two independent units is referred to as handshaking
		
	>The transfer of data between two units may be done in parallel or serial. In long-distant serial transmission, each unit is driven by a separate clock. Synchronization signals are transmitted periodically between the two units to keep their clocks in step with each other. Binary information is sent only when it is avaialable and the line remains idle when there is no information to be transmitted. 
	>Integrated circuits are available which are specifically designed to provide the interface between computer and its attached serial terminals. Such a circuit is called UART
		
	>Example of 8255 and USB here
		
	
	MODES OF TRANSFER:
	>Data transfer between the central computer and IO devices may be handled in a variety of modes. Some modes use the CPU as an intermediate path; others transfer the data directly to and from the memory unit. Data transfer to and from peripherals may be handled in one of three possible modes
		
	1. Programmed IO
		>Programmed IO operations are the result of IO instructions written in the computer program. Each data item transfer is initiated by an instruction in the program. Programmed IO is so called because a special IO program is in full control of all data transfers. Using a software techinque called polling the microprocessor is synchronized to the speed of the peripherals.
		>Advantages: The control program is always in control of the perihperal and the hardware required is not too complex. Particulary useful in small low-speed computers or in systems that are dedicated to monitor a device continuosly. 
		>Disadvantages: In the programmed I/O method, the CPU stays in a program loop until the IO unit indicates that it is ready for data transfer. This is a time-consuming process since it keeps the processor busy needlessly. Particularly inefficient in multitasking environment
		>Example: In the programmed IO method, the IO device does not have direct access to memory. A transfer from an IO device to memory requires the execution of several instructions by the CPU, including an input instruction to tranfer the data from the device to the CPU and a store instruction to transfer the data from the CPU to memory. A program is written for the computer to check the flag i the status register to determine if a byte has been placed in the data register by the IO device. 
			-Read the status register
			-Check the status of the flag bit and branch to step 1 if not set or to step 3 if step
			-Read the data register
		>Priority solving: The polling concept can be extended to mroe than one perihperal. A routine can now be writtein to poll each device (bit 0, then bit 1, etc.) and branch to the appropriate peripheral service routine when ready. In case of multiple ready devices, the first one to be polled will be serviced firt. In fact, this suggests that the polling routine can be written in such a way that priorities are assigned to each peripheral
		
	2. Interrupt-initiated IO
		>An alternative to the CPU constantly monitoring the flag is to let the interface inform the computer when it is ready to transfer data. This mode of transfer uses the interrupt facility. While the CPU is running a program, it does not check the flag. However, when the flag is set, the computer is momentarily interrupted from proceeding with the current program and is informed of the fact that the flag has been set. The CPU deviates from what it is doing to take care of the input or output transfer. After the transfer is completed, the computer returns to the previous program to continue what it was doing before the interrupt.
		>Example: The input on interrupt pin is sampled by the processor during every instruction and if the input is found active, the control is transferred to a special interrupt service routine. In principle, there are two methods for accomplishing this. One is called vectored interrupt and the other, non vectored interrupt. In a nonvectored interrupt, the branch address is assigned to a fixed location in memory. In a vectored interrupt, the source that interrupts supplies the branch information to the computer. 
			-When an interrupt occurs at INTR pin during normal processing, the processor finishes its current instruction and saves the program counter on the stack. 
			-Control is transferred to appropriate ISR and the routine is executed. The ISR ends with return instruction causing program counter to be popped from the stack
		>Advantages: Has the advantages that processor responds to the periherals only when the peripheral is ready
		>Disadvantages: Too many interrupts can cause the microprocessor to become asynchronous
		
		>Priorities: 
			-A priority interrupt is a system that establishes a priority over the various sources to determine which condition is to be services first when two or more requests arrive simultaneously. The system may also determine which conditions are permitted to interrupt the computer while another interrupt is being serviced. 
			-Establishing the priority of simultaneous interrupts an be done by software of hardware
				1. A polling procedure can be used to identify the highest-priority source by software means. In this method there is one common branch address for all interrupts. The program that takes care of interrupts begin at the branch address and polls the interrupt sources in sequence. The order in whcih they are tested determines the priority of each interrupt. The highest-priority source is tested first, and if its interrupt signal is on, control branches to a service routine for this source. Otherwise, the next-lower priority source is tested, and so on.
				2. A hardware priority-interrupt unit functions as an overall manager in an interrupt system environment. It accepts interrupt requests from many sources, determines which of the incoming requests has the highest priority, and issues an interrupt request to the computer based on this determination. 
		
			DAISY-CHAINING:
				>The daisy-chaning method of establishing priority consits of a serial connection of all devices that request an interrupt. The device with the highest priority is placed in the first position, followed by lower-priority devices up to the device with the lowest priority, which is placed last in the chain. 
				>The interrupt line is common to all devices and forms a wired logic connection. If any device has its interrupt signal in the low-level state, the interrupt line goes to the low-level state and enables the interrupt input in the CPU. When no interrupts are pending, the interrupt line stays in the high-level state and no interrupts are recognized by the CPU. 
				>The CPU responds to an interrupt by enabling the interrupt acknowledge line. The signal is received by device 1 at its PI (priority in) input. The acknowledge signal passes on to the next device through the PO (priority out) output only if device 1 is not requesting an interrupt. If device 1 has a pending interrupts, it bloks the acknowledge signal from the next device by placing a 0 in the PO output. It then proceeds to insert its own interrupt vector address into the datda bus for the CPU to use during the interrupt cycle.
				>A device with a 0 in its PI input generates a 0 in its PO output to inform the lower prioriy device that the acknowledge signal has been blocked. A device that is requesting an interrupt and has a 1 in its PI input will intercept the acknowledge signal by placing a 0 in its PO output. If the device does not have pending interrupts, it transmits the acknowledge signal to the next device by placing 1 in its PO output. Thus the device with PI = 1 and PO = 0 is the one with the highest priority that is requesting an interrupt, and this device places its VAD on the data bus	
				>FIG 19	
				
				
			PARALLEL PRIORITY:
				>The parallel priority interrupt method uses a register whose bits are set separately by the interrupt signal from each device. Priority is established according to the positino of the bits in the register. In addition to the interrupt register, the circuit may include a mask register whose purpose is to control the status of each interrupt request. The mask register can be programmed to disable lower-priority interrupts while a higher-priority device is being serviced. It can also provide a facility that allows a high-priority device to interrupt the CPU while a lower-priority device is being serviced.
				>FIG 20
				>By means of program instructions, it is possible to set or reset any bit in the mask register. Each interrupt bit and its corresponding mask bit are applied to an AND gate to produce the four inputs to a priority encoder. In this way an interrupt is recognized only if its corresponding mask bit is set to 1 by the program
				>The priority encoder generates two bits of the vector address, which is transferred to the CPU.
				>Another output from the encoder sets an interrupt status flip-flp IST when an interrupt that is not masked occurs. The interrupt enable flipflop IEN can be set of cleared by the program to provide an overall control over the interrupt system.
				>The outputs of lST ANDed with lEN provide a common interrupt signal for the CPU
				>The interrupt acknowledge IN'TACK signal from the CPU enables the bus buffers in the output register and a vector address VAD is placed into the data bus. We will now explain the priority encoder circuit and then discuss the interaction between the priority interrupt controller and the CPU.

		
	SOFTWARE CONSIDERATIONS:
		>A computer must also have software routines for controlling peripherals and for transfer of data between the processor and peripherals. IO routines must issue control commands to activate the peripheral and to check the device status to determine when it is ready for data transfer. Once ready, information is transferred item by item until all the data are transferred. Error checking and other useful steps often accompany the transfers. In interrupt-controlled transfers, the VO software must issue commands to the peripheral to interrupt when ready and to service the interrupt when it occurs.
		>For this reason VO routines for standard peripherals are provided by the manufacturer as part of the computer system. They are usually included within the operating system. Most operating systems are supplied with a variety of IO programs to support the particular line of peripherals offered for the computer. IO routines are usually available as operating system procedures and the user refers to the established routines to specify the type of transfer required without going into detailed machine language programs		
				
		Interrupt routines:
			>The computer must also have software routines for servicing the interrupt requests and for controlling the interrupt hardware registers. Each device has its own service program that can be reached through a JMP instruction stored at the assigned vector address. Each interrupt service routine must have an initial and final set of operations for controlling the registers in the hardware interrupt system.
			Initial sequence:
				-Clear lower-level mask registers bits
				-Clear interrupt status bit IST
				-Save contents of processor registers
				-Set interrupt enable bit IEN
				-Proceed with service routine.
				
			Final sequence:
				-Clear the interrupt enable bit IEN
				-Restore contents of processor registers
				-Clear the bit in the interrupt reigster belonging to the source that has been serviced
				-Set lower level priority bits in the mask register
				-Restore return address into PC and set IEN	
		
	3. Direct memory access
		>Using DMA, the peripheral is synchronized directly to main memory, not the microprocessor. When a text file is to output to a disk drive, we are concerned with transferring data from memory to that drive, microprocessor acts as a bottleneck in this process. During DMA transfer, the CPU is idle and has no control of the memory buses. A DMA controller takes over the buses to manage the transfer directly between the IO device and memory
		>The bus request (BR) input is used by the DMA controller to request the CPU to relinquish control of the buses. When this input is active, the CPU terminates the execution of the current instruction and places the address bus, the data bus, and the read and write lines into a high-impedance state. The high-impedance state behaves like an open circuit, which means that the output is disconnected and does not have a logic significance.
		
		>The CPU activates the bus grant (BG) output to inform the external DMA that the buses are in the high-impedance state. The DMA that originated the bus request can now take control of the buses to conduct memory transfers without processor intervention. When the DMA terminates the transfer, it disables the bus request line. The CPU disables the bus grant, takes control of the buses, and returns to its normal operation.
	
		>When the DMA takes control of the bus system, it communicates directly with the memory. The transfer can be made in several ways. 
			1. In DMA burst transfer, a block sequence consisting of a number of memory words is transferred in a continuous burst while the DMA controller is master of the memory buses. This mode of transfer is needed for fast devices such as magnetic disks, where data transmission cannot be stopped or slowed down until an entire block is transferred.
			2. An alternative technique called cycle stealing allows the DMA controller to transfer one data word at a time, after which it must return control of the buses to the CPU. The CPU merely delays its operation for one memory cycle to allow the direct memory I/O transfer to “steal” one memory cycle
	
	
		DMA CONTROLLER:
			>The DMA controller needs the usual circuits of an interface to communicate with the CPU and IO device. In addition, it needs an address reigster, a word count register, and a set of address lines. The address register and address lines are used for direct communication with the memory. The word count register specifies the number of words that must be transferred. 
			>FIG 21
			1. The DMA is first initialized by the CPU. The CPU communicates with the DMAC via the data bus and control lines. The registers in DMA are selected by the CPU through the address bus by enabling the DS (DMA select) and RS (register select) inputs. The RD (read) and WR (write) inputs are bidirectional. The initialization process is essentially a program consisting of IO instructions that include the address for selecting particular DMA registers. The CPU initializes the DMA by sending the following information through the data bus. 
				-The starting address of the memory block where data are available (for rea) or where data are to be stored
				-The word count, which is the number of words in the memory block
				-Control to specify the mode of transfer such as read or write
				-A control to start the DMA transfer
				
			Once the DMA receives the start control command, it can start the transfer between the peripheral device and the memory.
				
			2. Once the DMA is initialized, the CPU stops communicating with the DMA unless it receives an interrupt signal or if it wants to check how may words have been transferred. Until BG input is 0, the CPU can communicate with the DMA registers through the data bus to read from or write to the DMA registers.
			
			3. When the peripheral devie sends a DMA request, the DMA controller activates the BR line, informing the CPU to relinquish the buses. The CPU responds with BG line, informing the DMA that its buses are disabled. The DMA puts the current value of its address registers into the address bus, intiates the RD or WR signal, and sends a DMA acknowledge to the peripheral device. The direction of transfer depends on the status of the BG line. When BG = 0, the RD and WR are input lines allowing the CPU to communicate with the internal DMA registers. When BG = 1, the RD and WR are output lines from the DMA controller to the random-access memory to specify the read or write operation for the data
			
			4. When the peripheral receives a DMA acknowledge, it puts a word in the data bus (for write) or receives a word from the data bus (for read). Thus the DMA controls the read or write operations and supplies the address for the memory. The peripheral unit can then communicate with memory through the data bus for direct transfer between the two units while the CPU is momentarily disabled
			
			5. For each word that is transferred, the DMA increments its address register and decrements its word count register. If the word count does not reach

zero, the DMA checks the request line coming from the peripheral. For a high-speed device, the line will be active as soon as the previous transfer is completed. A second transfer is then initiated, and the process continues until the entire block is transferred. If the peripheral speed is slower, the DMA request line may come somewhat later. In this case the DMA disables the bus request line so that the CPU can continue to execute its program. When the peripheral requests a transfer, the DMA requests the buses again

			6. If the word count register reaches zero, the DMA stops any further transfer and removes its bus request. It also informs the CPU of the termination by means of an interrupt. When the CPU responds to the interrupt, it reads the content of the word count register. The zero value of this register indicates that all the words were transferred sucessfully. The CPU can read this register at any time to check the number of words already transferred.					
				
			7. A DMA controller may have more than one channel. In this case, each channel has a request and acknowledge pair of control signals which are connected to separate peripheral devices. Each channel also has its own address register and word count register within the DMA controller. A priority among the channels may be established so that channels with high priority are serviced before channels with lower priority.		
				
					
INPUT-OUTPUT PROCESSOR: (What is this, how different from DMA? Where located? How communicate with Perhiperahls? CPU?)
	>Instead of having each interface communicate with the CPU, a computer may incorporate one or more external processors and assign them the task of communicating directly with all IO devices. An input–output processor (IOP) may be classified as a processor with direct memory access capability that communicates with I/O devices.
	>In this configuration, the computer system can be divided into a memory unit, and a number of processors comprised of the CPU and one or more IOPS. Each IOP takes care of input and output tasks, relieving the CPU from the housekeeping chores involved in I/O transfers. A processor that communicates with remote terminals over telephone and other communication media in a serial fashion is called a data communication processor (DCF)
	>The IOP is similar to CPU except that it is designed to handle the details of IO processing. Unlike the DMA controller that must be set up entirely by the CPU, the IOP can fetch and execute its own insructions. In addition, the IOP can perform other processing tasks, such as arithmetic, logic, branching, and code translation.
	
	>FIG 22
	>The memory unit occupies a central position and can communicate with each processor by means of direct memory access. The CPU is responsible for processing data needed in the solution of computational tasks. The IOP provides a path for transfer of data between various peripheral devices and the memory unit. The CPU is usually assigned the task of initiating the I/O program. From then on the IOP operates independent of the CPU and continues to transfer datda from external devices and memory
	
	IOP-Peripheral Communication:
	
	>The data formats of peripheral devices differ from memory and CPU data formats. The IOP must structure data words from many different soruces. For example, it may be necessary to take 8 bytes from an input device and pack them into one 64-bit word before the transfer to memory. Data are gathered in the IOP at the device rate and bit capacity while the CPU is executing its own program. After the input data are assembled into a memory word, they are transferred from IOP directly into memory by stealing one memory cycle from the CPU. Similarly, an output word transferred from memory to the IOP is directed from the IOP to the output device at the device rate and bit capacity.
	>The communication between the IOP and the devices attached to it is similar to the program control method of transfer. Communication with the memory is similar to the direct memory access method.
	
	IOP-CPU Communication:
		
	>In very-large-scale computers, each processor is independent of all others and any one processor can initiate an operation. In most computer systems, the CPU is the master while the IOP is a slave processor. CPU is assigned the task of initiating all operations, but I/O instructions are executed in the IOP. 
	>CPU instructions provide operations to start an I/O transfer and also to test I/O status conditions needed for making decisions on various I/O activities. The IOP, in turn, typically asks for CPU attention by means of an interrupt. It also responds to CPU requests by placing a status word in a prescribed location in memory to be examined later by a CPU program. When an I/O operation is desired, the CPU informs the IOP where to find the I/O program and then leaves the transfer details to the IOP
	
	4. The CPU sends an instruction to test the IOP path. The IOP responds by inserting a status word in memory for the CPU to check. The bits of the status word indicate the condition of the IOP and I/O device, such as IOP overload condition, device busy with another transfer, or device ready for I/O transfer. If all is in order, the CPU sends the instruction to start I/O transfer. The memory address received with this instruction tells the IOP where to find it? program.
	
	5. The CPU can now continue with another program while the IOP is busy with the I/O program. Both programs refer to memory by means of DMA transfer. When the IOP terminates the execution of its program, it sends an interrupt request to CPU. The CPU responds to the interrupt by issuing an instruction to read the status from the IOP. The IOP responds by placing the contents of its status report into a specified memory location. The status word indicates whether the transfer has been completed of if any error occurred during the transfer. From inspection of the bits in the status word, the CPU determines if the I/O operation was completed satisfactorily without errors.
	
	
	IBM 370 IO CHANNEL:
		>The IO processor in IBM 370 computer is called a channel. A typical computer system configuration includes a number of channels with each channel attached to one or more IO devices. There are three types of channels: multiplexer, selector, and block-multiplexer.
			-The multiplexer channel can be connected to a number of slow- and medium speed devices and is capable of operating with a number of IO devices simultaneously
			-The selector channel is designed to handle one IO operation at a time and is normally used to connect one high speed device. 
	-		The block-multiplexer channel combines the features of both the multiplexer and selector channels. It provides a connection to a number of high-speed devices, but all I/O transfers are conducted with an entire block of data as compared to a multiplexer channel, which can transfer only one byte at a time
	
		>The CPU communicates directly with the channel through dedicated control lines and indirectly through reserved storage areas in the memory.
		IO instruction: Operation code| Channel address| Device address
	
	

	To learn: IO processor, channel, controller, bridges, bus


MULTIPROCESSORS: >A multiprocessor system is an interconenction of two or more CPUs with memory and input-output equipment. Multiprocessors are classified as MIMD systems.

CHARACTERISTICS:

1. VLSI:
>Very-large-scale integrated circuit technology has reduced the cost of computer components to such a low level that the concept of applying multiple processors to meet system performance requirements has become an attractive design possibility.

2. Single OS:
>There are some similarities between multiprocessor and multicomputer systems since both support concurrent operations. However, there exists an important distinction between a system with multiple computers and a system with multiple processors. Computers are interconnected with each other by means of communication lines to form a computer network. The network consists of several autonomous computers that may or may not communicate with each other. A multiprocessor system is controlled by one operating system that provides interaction between processors and all the components of the system cooperate in the solution of a problem

3. Reliability:
>Multiprocessing improves the reliability of the system so that a failure or error in one part has a limited effect on the rest of the system. If a fault causes one processor to fail, a second processor can be assigned to perform the functions of the disabled processor. The system as a whole can continue to function correctly with perhaps some loss in efficiency

4. Performance:
>The system derives its high performance from the fact that computations can proceed in parallel in one of two ways
	-Multiple independent jobs can be made to operate in parallel
	-A single job can be partitioned into multiple parallel tasks
>An overall function can be partitioned into a number of tasks that each processor can handle individually

CLASSIFICATION:
	>Multiprocessors are classified by the way their memory is organized. 
	1. A multiprocessor system with common shared memory is classified as a shared-memory or tightly coupled microprocessor. This does not preclude each processor from having its own local memory. In fact, most commerical tighlty coupled multiprocessors provide a cache memory with each CPU. In addition, there is a global common memory that all CPUs can access. Informatino can therefore be shared among the CPUs by placing it in the common global memory
	
	2.  An alternative model of microprocessor is the distributed-memory or loosely coupled system. Each processor element in a loosely coupled system has its own private local memory. The processors are tied together by a switching scheme designed to route information from one processor to another through a message-passing scheme.  The processors relay program and data to other processors in packets. A packet consists of an address, the data content, and some error detection code. The packets are addressed to a specific processor or taken by the first available processor, depending on the communication system used.	
	
INTERCONNECTION STRUCTURE:
	>The components that form a mutliprocessor system are CPUs, IOPs connected to input-output devices, and a memory unit that may be partitioned into a number of separate modules. The interconnection between the components can have different physical configurations, depending on the number of transfer paths that are available between the processors and memory in a shared memory system or among the processing elements in a loosely coupled system.

	1. Time-shared common bus
	2. Multiport memory
	3. Crossbar switch
	4. Multistage switching network
	5. Hypercube system

INTERPROCESSOR COMMUNICATION:
	>The various processors in a multiprocessor system must be provided with a facility for communicating with each other. A communication path can be established through common input-output channels.
	>In a shared multiprocessor system, the most commo procedure is to set aside a portion of memory that is accessible to all processors. The primary use of common memory is to act as a message center similar to a mailbox, where each processor can leave messages for other processors and pick up messages intended for it.
		-The sending processor structures a request, a message, or a common procedure, and places it in the memory mailbox.
		-Status bits residing in common memory are generally used to indicate the condition of the mailbox, whether it has meaningful information, and for which processor it is intended. 
		-The receiving processor can check the mailbox periodically to determine if there are valid messages for it. The response time of this procedure can be time consuming since a processor will recognize a request only when polling messages.

		-A more efficient procedure is for the sending processor to alert the receiving processor directly by means of an interrupt signal. This can be accomplished through a software-initiated interprocessor interrupt by means of an instruction in the program of one processor which when executed produces an external interrupt condition in a second processor.

	>In addition to shared memory, a multiprocessor system may have other shared resources. For example, a magnetic disk storage unit connected to an IOP may be available to all CPUs.

	>Synchronization refers to the special case where the data used to communicate between processors is control information. Synchronization is needed to enforce the correct sequence of processes and to ensure mutually exclusive access to shared writable data
	>The instruction set of a multiprocessor contains basic instructions that are used to implement communication and synchronization between cooperating processes
	>Multiprocessor systems usually include various mechanisms to deal with the synchronization of resources. Low-level primitives are implemented directly by the hardware.









\end{document}