To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline.
To avoid stalls, a dependent instruction must be separated from the
source instruction by a distance in clock cycles equal to the pipeline
latency of that source instruction.
| Instruction producing result | Instruction using result | Latency in clock cycles |
| FP ALU op | Another FP ALU op | 3 |
| FP ALU op | Store double | 2 |
| Load double | FP ALU op | 1 |
| Load double | Store double | 0 |
Our example uses a simple loop that adds a scalar value to an array in memory:
for (i=1; i<=1000; i++)
x[i] = x[i] + s;
1) The first step is to translate the above segment to DLX assembly language:
| Loop: | LD | F0, 0(R1) | ;F0 - array element |
| ADDD | F4, F0, F2 | ;add scalar in F2 | |
| SD | 0(R1), F4 | ;store result | |
| SUBI | R1, R1, #8 | ;decrement pointer
;8 bytes (per double) |
|
| BENZ | R1, Loop | ;branch R1 != zero |
2) Show how the loop would look on DLX, both scheduled and unscheduled,
including any stalls or idle clock cycles. Schedule for both delays
- from floating-point operations and
- from the delayed branches.
| Without any scheduling | Scheduled | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 clock cycles per element | 6 clock cycles per element |
3) Show the loop unrolled (scheduled and unscheduled) so that there
are 4 copies of the loop body, assuming R1 is initially a multiple of 32,
which means that the number of loop iterations is a multiple of 4. Eliminate
any obviously redundant computations, and do not reuse any of the registers.
| Without any scheduling | Scheduled | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 clock cycles per iteration;
27/4 = 6.8 clock cycles per element |
14 clock cycles per iteration;
14/4 = 3.5 clock cycles per element |
Determine that it was legal to move the SD instruction after the SUBI and BNEZ, and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code.
Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust loop maintenance code.
Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing memory addresses and finding that they do not refer to the same address!
Schedule the code, preserving any dependencies needed to yield the same result as the original code.