### ארכיטקטורת יח' עיבוד מרכזית 361-1-4201 ז מס' ז Microprogrammed Microarchitectures and Pipelining I ד"ר גיא תל-צור ### **Multi-Cycle Control** ### הבקר ב- Multi-cycle FIGURE D.3.1 The finite-state diagram for multicycle control. FIGURE D.3.2 The control unit for MIPS will consist of some control logic and a register to hold the state. The state register is written at the active clock edge and is stable during the clock cycle ### Multi-Cycle Microprogram Control # Implementing the **Next-State**Function with a Sequencer Appendix D Part 2 PPTs FIGURE D.4.1 The control unit using an explicit counter to compute the next state. In this control unit, the next state is computed using a counter (at least in some states). By comparison, Figure D.3.2 encodes the next state in the control logic for every state. In this control unit, the signals labeled *AddrCtl* control how the next state is determined. FIGURE D.4.2 This is the address select logic for the control unit of Figure D.4.1. FIGURE D.4.6 The control unit as a microcode. The use of the word "micro" serves to distinguish between the program counter in the datapath and the microprogram counter, and between the microcode memory and the instruction memory. # Translating a Microprogram to Hardware (Control Signals) אפשר לממש מנגנון של מיקרופרוגרם באחד משני האופנים: כמכונת מצבים סופים או SEQUENCER-כ- עכשיו בנקודת הסתכלות מעט שונה: יש לנו יכולת ליצור מעין שפת תכנות (חדשה) של מיקרו הוראות אשר שולטת על 18 אותות הבקרה ועל 4 מצבי התזמון – ראו השקף הבא | Field name | Value | Signals active | Comment | | |---------------------|--------------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--| | | Add | ALUOp = 00 | Cause the ALU to add. | | | ALU control | Subt | ALUOp = 01 | Cause the ALU to subtract; this implements the compare for branches. | | | | Func code | ALUOp = 10 | Use the instruction's function code to determine ALU control. | | | CD04 | PC | ALUSrcA = 0 | Use the PC as the first ALU input. | | | SRC1 | Α | ALUSrcA = 1 | Register A is the first ALU input. | | | | В | ALUSrcB = 00 | Register B is the second ALU input. | | | SRC2 | 4 | ALUSrcB = 01 | Use 4 as the second ALU input. | | | | Extend | ALUSrcB = 10 | Use output of the sign extension unit as the second ALU input. | | | | Extshft | ALUSrcB = 11 | Use the output of the shift-by-two unit as the second ALU input. | | | Register<br>control | Read | | Read two registers using the rs and rt fields of the IR as the register number and putting the data into registers A and B. | | | | Write ALU | RegWrite,<br>RegDst = 1,<br>MemtoReg = 0 | Write a register using the rd field of the IR as the register number and the contents of ALUOut as the data. | | | | Write MDR | RegWrite,<br>RegDst = 0,<br>MemtoReg = 1 | Write a register using the rt field of the IR as the register number and the contents of the MDR as the data. | | | | Read PC | MemRead,<br>lorD = 0, IRWrite | Read memory using the PC as address; write result into IR (and the MDR). | | | Memory | Read ALU | MemRead,<br>lorD = 1 | Read memory using ALUOut as address; write result into MDR. | | | | Write ALU | MemWrite,<br>IorD = 1 | Write memory using the ALUOut as address, contents of B as the data. | | | | ALU | PCSource = 00,<br>PCWrite | Write the output of the ALU into the PC. | | | PC write control | ALUOut-cond | PCSource = 01,<br>PCWriteCond | If the Zero output of the ALU is active, write the PC with the contents of the register ALUOut. | | | | Jump address | PCSource = 10,<br>PCWrite | Write the PC with the jump address from the instruction. | | | | Seq | AddrCtl = 11 | Choose the next microinstruction sequentially. | | | Convension | Fetch | AddrCtl = 00 | Go to the first microinstruction to begin a new instruction. | | | Sequencing | Dispatch 1 | AddrCtl = 01 | Dispatch using the ROM 1. | | | | Dispatch 2 | AddrCtl = 10 | Dispatch using the ROM 2. | | 22 הוראות = 18 קוי בקרה > 4 מצבי sequencer FIGURE D.5.1 Each microcode field translates to a set of control signals to be set. These 22 different values of the fields specify all the required combinations of the 18 control lines. Control lines that are not set, which correspond to actions, are 0 by default. Multiplexor control lines are set to 0 if the output matters. If a multiplexor control line is not explicitly set, its output is a don't care and is not used. The sequencing field can have 4 values: Fetch, Dispatch1, Dispach2 and Sequential. | dispatch table 1 | | | | | | |------------------|-------------|----------|--|--|--| | Opcode field | Opcode name | Value | | | | | 000000 | R-format | Rformat1 | | | | | 000010 | jmp | JUMP1 | | | | | 000100 | beq | BEQ1 | | | | | 100011 | 1 w | Mem1 | | | | | 101011 | SW | Mem1 | | | | | Opcode field | Opcode name | Value | |--------------|-------------|-------| | 100011 | 1 w | LW2 | | 101011 | SW | SW2 | 2 השורות האלה מופיעות פעמיים, לכן נדרש<u>ים</u> 2 רומים FIGURE D.5.2 The two microcode dispatch ROMs showing the contents in symbolic form and using the labels in the microprogram. #### **Summary** Independent of whether the control is represented as a finite-state diagram or as a microprogram, translation to a hardware control implementation is similar. Each state or microinstruction asserts a set of control outputs and specifies how to choose the next state. The <u>next-state function</u> may be implemented by either <u>encoding it in a finite-state machine</u> or using an explicit <u>sequencer</u>. The <u>explicit sequencer is more efficient</u> if the number of states is large and there are many sequences of consecutive states without branching. ### End of Appendix D slides ### Thoughts on Control & Microprogramming ### The Control Store: Some Questions - What control signals can be stored in the control store? Those independent on data אלה שאינם תלויים בנתונים vs. - What control signals have to be generated in hardwired logic? - i.e., what signal cannot be available without processing in the datapath? Those dependent on data אלה שתלויים בנתונים - Remember the MIPS datapath - One PCSrc signal depends on processing that happens in the datapath (bcond logic) ### Variable-Latency Memory - The ready signal (R) enables memory read/write to execute correctly - Example: transition from one state to another state is controlled by the R bit asserted by memory when memory data is available - Could we have done this in a single-cycle microarchitecture? להראות דוגמה מ- LC3b – בשקף הבא ונשים לב למעבר משלב 33 לשלב 35 ### Aside: Memory Mapped I/O - Address control logic determines whether the specified address of LDx and STx are to memory or I/O devices - Correspondingly enables memory or I/O devices and sets up muxes - This is another instance where the final control signals (e.g., MEM.EN or INMUX/2) cannot be stored in the control store - □ These signals are dependent on address ### The Microsequencer: Advanced Questions - What happens if the machine is interrupted? - What if an instruction generates an exception? - How can you implement a complex instruction using this control structure? - Think REP MOVS ### The Power of Abstraction - The concept of a control store of microinstructions enables the hardware designer with a new abstraction: microprogramming - The designer can translate any desired operation to a sequence of microinstructions - All the designer needs to provide is - □ The sequence of microinstructions needed to implement the desired operation - □ The ability for the control logic to correctly sequence through the microinstructions - Any additional datapath control signals needed (no need if the operation can be "translated" into existing control signals) ### Some good examples for Microprogramming - Implement REP MOVS in a microarchitecture using microprogramming - Guidelines: What changes, if any, do you make to the - state machine? - datapath? - □ control store? - microsequencer? - Another good example: Implement unaligned word memory access using microprogramming ## Software Optimization Guide for AMD Family 16h Processors **Table 1. Typical Instruction Mappings** | Instruction | Macro-ops | Micro-ops | Comments | |----------------------|-----------|------------------------------|--------------------------| | MOV reg,[mem] | 1 | 1: load | Fastpath single | | MOV [mem],reg | 1 | 1: store | Fastpath single | | MOV [mem],imm | 1 | 2: move-imm, store | Fastpath single | | REP MOVS [mem],[mem] | Many | Many | Microcode | | ADD reg,reg | 1 | 1: add | Fastpath single | | ADD reg,[mem] | 1 | 2: load, add | Fastpath single | | ADD [mem],reg | 1 | 2: load/store, add | Fastpath single | | MOVAPD [mem],xmm | 1 | 2: store, FP-store-data | Fastpath single | | VMOVAPD [mem],ymm | 2 | 4:2 · {store, FP-store-data} | 256b AVX Fastpath double | | ADDPD xmm,xmm | 1 | 1: addpd | Fastpath single | | ADDPD xmm, [mem] | 1 | 2: load, addpd | Fastpath single | | VADDPD ymm,ymm | 2 | 2:2 · {addpd} | 256b AVX Fastpath double | | VADDPD ymm,[mem] | 2 | 4: 2 · {load, addpd} | 256b AVX Fastpath double | ### **Advantages of Microprogrammed Control** - Allows a very simple design to do powerful computation by controlling the datapath (using a sequencer) - High-level ISA translated into microcode (sequence of microinstructions) - □ Microcode enables a minimal datapath to emulate an ISA - Microinstructions can be thought of as a user-invisible ISA (micro ISA) - Enables easy extensibility of the ISA - Can support a new instruction by changing the microcode - □ Can support complex instructions as a sequence of simple microinstructions - If I can sequence an arbitrary instruction then I can sequence an arbitrary "program" as a microprogram sequence - will need some new state (e.g. loop counters) in the microcode for sequencing more elaborate programs ### **Update of Machine Behavior** - The ability to update/patch microcode in the field (after a processor is shipped) enables - Ability to add new instructions without changing the processor! - □ Ability to "fix" buggy hardware implementations - Examples - □ **IBM 370** Model 145: microcode stored in main memory, can be updated after a reboot - □ **IBM System z**: Similar to 370/145. - Heller and Farrell, "Millicode in an IBM zSeries processor," IBM JR&D, May/Jul 2004. - □ **B1700** microcode can be updated while the processor is running - User-microprogrammable machine! ### **IBM 370/145** Figure 1. General microprogram flow ### מחשבים בע"מ COMPUTERS LTD. מחשבים בע"מ IS PROUD TO ANNOUNCE THE RELEASE IN ISRAEL OF THE BURROUGHS FOURTH GENERATION B-1700 COMPUTER, THE YOUNGEST MEMBER OF THE YOU SERIES. #### THE B - 1700 COMPUTER sets a new record in Burroughs advanced technology, for the first time the cost [from E1,600 a month) justifies the operation of a fully-fledged computer by mediam six biolineira. #### THE B - 1700 COMPUTER brings you micrologic multiprogramming, virtual memory sequential and random access, data-com facilities, all adapted and controlled - without any operator intervention by the "Mader Cooked Program", Burroughs' special gift to the uses. #### THE B - 1700 COMPUTER is ariseled to non programs in R.P.G., Cobol, Fortran, Basic, including progums written for other computers. #### THE B - 1700 COMPUTER is fully compatible with the larger 700 saries computers. #### THE B - 1700 COMPUTER provides you with the unique apportunity of developing earlier and advancing factor in tolving your Dataprocessing requirements. Please call us at No. 218174. #### המחשב 1700 **–** 8 פרווה נפודת שיא חדשה בעכנולוגיה המתקדמת של signorus זון הפעם מראשונה מעריפת העלות והחל מ-2000 לחודשו הפעלת מחשב עדוואי גם עד מפעלים come cross. #### 8-1700 annin מביא לך מייפוולוגיקה, יובוד concré sera gen, sran פריאלית האקראית ואפעורות תפישורת, כל זה מאואם וסכוכול - ללא עודך בהחערבות מפעיל -עלייף יתובנית הפתוח DRIEDE, GRUDO wrozensk forenegls for #### 8-1700 awmma סותאם לקבלה הכניות בשווח Basis, Fortree, Cobal, R.P.C. ואחרות, אבילו נכחבו למחשבים אחרים. #### 8-1700 awmon מפיים שומפעיביליות מלאה עם מחשבי עדרת ה'ספק הגדולים ממנו #### 8-1700 ammin נותו לד את ההודמנות הגדולה להפדים את התפתחותך העצבואית בשעוח עבוד הנחונים. עלצל שלינו לני חסי הטלפת בקופנב, ## המחשב המהפכני של Burroughs המחשב הראשון מדור רביעי המשווק בארץ הספרייה הלאומית 1975 במחשב הזה, בשונה מהמקובל, ניתן היה u-ISA לראות את ה ואף ניתן היה לשנות את המיקרוקוד בצורה דינמית בעת שהמערכת רצה. > אין שקיפות כזו כיום. זו גם אפשרות טובה אבל יש גם צד שלילי והוא סיכוני אבטחת מידע. ### Microcode patch, an example: CVE-2020-8696 Improper removal of sensitive information before storage or transfer in some Intel(R) Processors may allow an authenticated user to potentially enable information disclosure via local access. #### References: https://lists.debian.org/debian-lts-announce/2021/02/msg00007.html https://nvd.nist.gov/vuln/detail/CVE-2020-8698 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-8698 ### **Review: The Power of Abstraction** - The concept of a control store of microinstructions enables the hardware designer with a new abstraction: microprogramming הקונספט של microprograming משכלל עוד יותר את מושג האבסטרקציה - The designer can translate any desired operation to a sequence of microinstructions - All the designer needs to provide is - The sequence of microinstructions needed to implement the desired operation - The ability for the control logic to correctly sequence through the microinstructions (micro-sequencer) - Any additional datapath elements and control signals needed (no need if the operation can be "translated" into existing control signals) ניתן להמיר כל הוראה לסדרה של מיקרו-הוראות. כל מה שנותר הוא לקבוע את הסדר של המיקרו-הוראות. אם ההוראה ניתנת לתרגום של סיגנלים קיימים אין 64 צורך בשינוי בדאטה-פס ### Microcoded Multi-Cycle MIPS Design ### More Micro-Programming terminology ### Horizontal Microcode A single control store provides the control signals ### Vertical Microcode **Two-level control store**: the first specifies abstract operations $(2^n \times m + 2^m \times k \text{ bit})$ should be smaller than horizontal microcode ROM $(2^n \times k \text{ bit})$ #### Nanocode and Millicode - Nanocode: a level below traditional microcode - microprogrammed control for sub-systems (e.g., a complicated floating-point module) that acts as a slave in a microcontrolled datapath - Millicode: a level above traditional microcode - ISA-level subroutines that can be called by the microcontroller to handle complicated operations and system functions - E.g., Heller and Farrell, "Millicode in an IBM zSeries processor," IBM JR&D, May/Jul 2004. ראו גם השקף הבא. זהו מאמר רשות. מוטיבציה: - In both cases, we avoid complicating the main u-controller - You can think of these as "microcode" at different levels of abstraction ## Millicode in an IBM zSeries processor # Millicode in an IBM zSeries processor Because of the complex architecture of the zSeries® processors, an internal code, called millicode, is used to implement many of the functions provided by these systems. While the hardware can execute many of the logically less complex and high-performance instructions, millicode is required to implement the more complex instructions, as well as to provide additional support functions related primarily to the central processor. This paper is a review of millicode on previous zSeries CMOS systems and also describes enhancements made to the z990 system for processing of the millicode. It specifically discusses the flexibility millicode provides to the z990 system. L. C. Heller M. S. Farrell ## Nanocode Concept Illustrated #### Microcode #### credit: https://everything2.com/title/microcode Microcode is often used to refer to assembly language or even raw machine language, but it isn't either of these things. **It is actually lower-level code**, hard-coded on the silicon itself, that determines how the processor responds to a given opcode and operand. . . . What is done is that a miniature ROM is on the chip. Every clock cycle, based on an internal counter, a given set of bits is read from the ROM. This set of bits covers all of the outputs needed to manipulate the internal logic as well as a few of the external signals. This set of bits is what asserts load and store lines, chip select lines, activates adders, and so forth. The idea behind RISC was to make each instruction just one line of these digital outputs. #### Nanocode #### credit: https://everything2.com/title/nanocode As the name suggests, nanocode is at a lower-level than microcode. Machine code is the raw instructions fed into a processor, this is first decoded by the microcode. This is a translation from the hierarchical organization of instructions that makes sense to humans, to the actual steps that need to be taken in various parts of the processor. Microcode is written in a lookup table to be referenced as each instruction is executed. Nanocode is more finely grained than microcode. It is responsible for converting the logic of the microcode to the low-level electrical signals that will cause the desired result. Nanocode is hardwired to do such things as enabling logic gates to fire their output onto interconnects, enabling the proper gates to input same, and setting the proper flags on the ALU. Very low level stuff that each individual block of a processor needs to carry out its little part of the instruction. ## **Summary of findings** ## Microcode מומלץ לעיין בערך Wikipedia-ב https://en.wikipedia.org/wiki/Microcode ## Multi-Cycle vs. Single-Cycle uArch Advantages Disadvantages You should be very familiar with this right now ## Microprogrammed vs. Hardwired Control Advantages Disadvantages השקף הבא מתייחס לכך #### Comparison | Attributes | Hardwired Control | Microprogramming<br>Control | | | | |---------------------------------------|---------------------|-----------------------------|--|--|--| | Speed | Fast | Slow | | | | | Cost of<br>Implementation | More expensive | Cheaper | | | | | Flexibility | Difficult to modify | Flexible | | | | | Ability to handle complex instruction | Difficult | Easier | | | | | Decoding | Complex | Easy | | | | | Application | RISC | CISC | | | | | Instruction Set Size | Small | Large | | | | Credit: https://ictbyte.com/microprocessor/difference-between-hardwired-and-micro-programmed-control-unit/ ## Multi-Cycle עד כאן נושא ## 361.1.4201 Computer Architecture Pipelining I ## Dr. Guy Tel-Zur Based on slides by Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/30/2015 With Dr. Guy Tel-Zur & Danny's modifications #### Agenda for Today & Next Few Lectures - Single-cycle Microarchitectures - Multi-cycle and Microprogrammed Microarchitectures - Pipelining - Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, ... - Out-of-Order Execution ■ Issues in OoO Execution: Load-Store Handling, ... ## Recap of Last & this Lecture - Multi-cycle and Microprogrammed Microarchitectures - Benefits vs. Design Principles - When to Generate Control Signals - MIPS State Machine, Datapath, Control Structure - Microprogrammed Control: uInstruction, uSequencer, Control Store - Microprogramming benefits - Power of abstraction (for the HW designer) - Advantages of uProgrammed Control - Update of Machine Behavior ## Why Pipelining? #### Can We Do Better? What limitations do you see with the multi-cycle design? #### Limited concurrency - Some hardware resources are idle during different phases of instruction processing cycle - "Fetch" logic is idle when an instruction is being "decoded" or "executed" - Most of the datapath is idle when a memory access is happening #### Can We Use the Idle Hardware to Improve Concurrency? - Goal: More concurrency → Higher instruction throughput (i.e., more "work" completed in one cycle) - Idea: When an instruction is using some resources in its processing phase, process other instructions on idle resources not needed by that instruction - □ E.g., when an instruction is being decoded, fetch the next instruction - E.g., when an instruction is being executed, decode another instruction - □ E.g., when an instruction is accessing data memory (ld/st), execute the next instruction - E.g., when an instruction is writing its result into the register file, access data memory for the next instruction # Pipelining ## Pipelining: Basic Idea - More systematically: - Pipeline the execution of multiple instructions Analogy: "Assembly line processing" of instructions – דוגמה אנלוגית: קו ייצור - Idea: - Divide the instruction processing cycle into distinct "stages" of processing - Ensure there are enough hardware resources to process one instruction in each stage - Process a different instruction in each stage - Instructions consecutive in program order are processed in consecutive stages Benefit: Increases instruction processing throughput =IPC (=1/CPI) Downside: Start thinking about this... רעיונות???? #### **Example: Execution of Four Independent ADDs** אנימציה Multi-cycle: 4 cycles per instruction Pipelined: 4 cycles per 4 instructions (steady state) ## The Laundry Analogy - "place one dirty load of clothes in the washer" - "when the washer is finished, place the wet load in the dryer" - "when the dryer is finished, take out the dry load and fold" - "when folding is finished, ask your roommate (??) to put the clothes away" - steps to do a load are sequentially dependent - no dependence between different loads - different steps do not share resources ### **Pipelining Multiple Loads of Laundry** #### Pipelining Multiple Loads of Laundry: In Practice the slowest step decides throughput #### Pipelining Multiple Loads of Laundry: In Practice throughput restored (2 loads per hour) using 2 dryers #### An Ideal Pipeline ?מה נדרש למיקבול אידאלי בצנרת - Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing) - Repetition of identical operations - ☐ The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps) - Repetition of independent operations - No dependencies between repeated operations - Uniformly partitionable suboperations חלוקה אחידה לחבילות זמן - Processing can be evenly divided into uniform-latency suboperations (that do not share resources) - Fitting examples: automobile assembly line, doing laundry - What about the instruction processing "cycle"? ## **Ideal Pipelining** ## More Realistic Pipeline: Throughput Nonpipelined version with delay T $$BW = 1/(T+S)$$ where $S =$ latch delay T=combinational logic delay S=latch delay k-stage pipelined version $$BW_{k-stage} = 1 / (T/k + S)$$ Latch delay reduces throughput (switching overhead b/w stages) $$BW_{max} = 1 / (1 \text{ gate delay} + S)$$ שואף לערך הכי גדול שאפשר k בגבול כאשר ## More Realistic Pipeline: Cost Nonpipelined version with combinational cost G → G gates G gates k-stage pipelined version $$Cost_{k-stage} = G + L*k$$ Latches increase hardware cost מה המספר האידאלי של דרגות מיקבול? חסרונות: # **Pipelining Instruction Processing** #### **Remember: The Instruction Processing Cycle** ## Remember the Single-Cycle Uarch Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] ## **Dividing Into Stages** Is this the correct partitioning? Why not 4 or 6 stages? Why not different boundaries? #### השוואה בין multicycle Pipelined ## **Instruction Pipeline Throughput** 5-stage speedup is 4, not 5 as predicted by the ideal model. Why? תשובה: בתרשים העליון: BW=1/800, בתרשים התחתון: BW=1/200. ההאצה היא היחס ביו המצב הישו למצב החדש והמשופר: 4 #### **Enabling Pipelined Processing: Pipeline Registers** All instruction classes must follow the same path and timing through the pipeline stages. #### **Illustrating Pipeline Operation: Operation View** אנימציה #### Illustrating Pipeline Operation: Resource View אנימציה | | t <sub>o</sub> | t <sub>1</sub> | t <sub>2</sub> | t <sub>3</sub> | t <sub>4</sub> | t <sub>5</sub> | t <sub>6</sub> | t <sub>7</sub> | t <sub>8</sub> | t, | t <sub>10</sub> | |-----|----------------|----------------|----------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------| | IF | Io | I <sub>1</sub> | | <b>I</b> <sub>3</sub> | <b>I</b> <sub>4</sub> | <b>I</b> <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | I <sub>9</sub> | I <sub>10</sub> | | ID | | I <sub>o</sub> | | | <b>I</b> <sub>3</sub> | <b>I</b> <sub>4</sub> | <b>I</b> <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | <b>I</b> <sub>9</sub> | | EX | | | I <sub>o</sub> | I <sub>1</sub> | | <b>I</b> <sub>3</sub> | <b>I</b> <sub>4</sub> | <b>I</b> <sub>5</sub> | I <sub>6</sub> | <b>I</b> <sub>7</sub> | <b>I</b> <sub>8</sub> | | MEM | | | | I <sub>o</sub> | I <sub>1</sub> | | I <sub>3</sub> | I <sub>4</sub> | <b>I</b> <sub>5</sub> | I <sub>6</sub> | <b>I</b> <sub>7</sub> | | WB | | | | | I <sub>o</sub> | | | <b>I</b> <sub>3</sub> | <b>I</b> <sub>4</sub> | <b>I</b> <sub>5</sub> | I <sub>6</sub> | ## **Control Points in a Pipeline** ## Control Signals in a Pipeline - For a given instruction - same control signals as single-cycle, but - control signals required at different cycles, depending on stage ⇒ Option 1: decode once using the same logic as single-cycle and buffer ⇒ Option 2: carry relevant "instruction word/field" down the pipeline and decode locally within each or in a previous stage Which one is better? RegWrite בדוגמה הנוכחית נתמקד באות הבקרה Pipelined Control Signals RESERVED.1 ## Remember: An Ideal Pipeline החזון - Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing) הגדלת כמות החישובים ללא - Repetition of identical operations חזרתיות על צעדים קבועים - □ The same operation is repeated on a large number of different inputs (e.g., all laundry loads go through the same steps) - Repetition of independent operations יכולת ביצוע פעולות שונות - □ No dependencies between repeated operations אי תלות בין הפעולות - Uniformly partitionable suboperations השהיות אחידות - □ Processing an be evenly divided into uniform-latency suboperations (that do not share resources) - Fitting examples: automobile assembly line, doing laundry - □ What about the instruction processing "cycle"? ### Instruction Pipeline: Not An Ideal Pipeline - Identical operations ... NOT! - ⇒ different instructions → not all need the same stages Forcing different instructions to go through the same pipe stages - → external fragmentation (some pipe stages idle for some instructions) - Uniform suboperations ... NOT! - $\Rightarrow$ different pipeline stages $\Rightarrow$ not the same latency Need to force each stage to be controlled by the same clock - → internal fragmentation (some pipe stages are too fast but all take the same clock cycle time) - Independent operations ... NOT! - ⇒ instructions are not independent of each other Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results → pipeline stalls (pipeline is not always moving) # Pipelining continues in next lecture slides