Pipeline with Multicycle Operations

It is impractical to require that all DLX floating-point operations complete in one clock cycle, or even two. Doing it would mean accepting a slow clock, or using enormous amounts of logic in the floating-point units, or both. Instead, the floating-point pipeline will allow for a longer latency for operations.

This is easier to grasp if we imagine the floating-point instructions as having the same pipeline as the integer instructions, with two important differences:
The  EX cycle may be repeated as many times as needed to complete the operation;
There may be multiple floating-point functional units.

Let's assume that there are four separate functional units :
The main integer unit
FP and integer multiplier
FP adder (handles FP add, subtract, and conversion)
FP and integer divider

If we also assume that the execution stages of these functional units are not pipelined, then the resulting pipeline looks like:


Because only one instruction issues on every clock cycle, all instructions go through the standard pipeline for integer operations. The floating-point operations simply loop when they reach the EX stage.

In reality, the intermediate results are probably not cycled around the EX unit, but the EX pipeline stage has some number of clock delays larger then 1. We can generalize the structure of the FP pipeline to allow pipelining of some stages and multiple ongoing operations.

To describe such a pipeline we must define the latency of the functional units and the initiation interval.

Latency  is defined as the number of intervening cycles between an instruction that produces a result and an instruction that uses the result.

The initiation interval is the number of cycles that must elapse between issuing two operations of a given type.

For example, we will use the latencies and initiation intervals as shown:
Functional unit Latency Initiation interval
Integer ALU 0 1
Data memory(integer and FP loads) 1 1
FP add 3 1
FP multiply (also integer multiply) 6 1
FP divide (also integer divide and FP sqrt) 24 24
Integer ALU operations have a latency of 0, since the results can be used on the next clock cycle (right after EX). Loads have a latency of 1, since their results can be used after one intervening cycle (MEM).

Pipeline latency is essentially equal to one cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. Thus, for the example pipeline above, the number of stages in an FP add is 4, while the number of stages in FP multiply is 7.

Extended pipeline is show below:


The example pipeline structure allows up to 4 outstanding FP adds, 7 outstanding FP/integer multiplies, and 1 FP divide.

The FP multiplier and adder are fully pipelined and have a depth of 7 and 4 stages, respectively. The FP divider is not pipelined.
 
 
The pipeline timing of a set of independent FP operations
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADDD   IF ID A1 A2 A3 A4 MEM WB    
LD     IF ID EX MEM WB        
SD       IF ID EX MEM WB       
The stages in italic show where data is needed, while the stages in bold show where a result is available.