Loop Unrolling

To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline.

To avoid stalls, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.
 
Latencies of FP operations used in the example
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0

Our example uses a simple loop that adds a scalar value to an array in memory:

for (i=1; i<=1000; i++)
   x[i] = x[i] + s;

1) The first step is to translate the above segment to DLX assembly language:
Loop: LD F0, 0(R1) ;F0 - array element
ADDD F4, F0, F2 ;add scalar in F2
SD 0(R1), F4 ;store result
SUBI R1, R1, #8 ;decrement pointer 
;8 bytes (per double)
BENZ R1, Loop ;branch R1 != zero

2) Show how the loop would look on DLX, both scheduled and unscheduled, including any stalls or idle clock cycles. Schedule for both delays
 - from floating-point operations and
 - from the delayed branches.
 
Without any scheduling  Scheduled
      Cycles
Loop: LD F0, 0(R1) 1
  stall 2
  ADDD F4, F0,F2 3
  stall 4
  stall 5
  SD 0(R1), F4 6
  SUBI R1, R1,#8 7
  BENZ R1, Loop 8
  stall 9
 
       Cycles
Loop: LD F0, 0(R1) 1
  stall 2
  ADDD F4, F0, F2 3
  SUBI R1, R1, #8 4
  BENZ R1,Loop 5 ;delayed branch
  SD 8(R1), F4 6 ;altered and interchanged with SUBI
 
9 clock cycles per element 6 clock cycles per element

3) Show the loop unrolled (scheduled and unscheduled) so that there are 4 copies of the loop body, assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations, and do not reuse any of the registers.
 
Without any scheduling  Scheduled
 
Loop: LD F0, 0(R1) 1
  stall 2
ADDD F4, F0, F2 3
stall 4
stall 5
SD 0(R1), F4 6 ;drop SUBI &BNEZ 
LD F6, -8(R1) 7
stall 8
ADDD F8, F6, F2 9
stall 10
stall 11
SD -8(R1), F8 12 ;drop SUBI &BNEZ
LD F10,-16(R1) 13
stall 14
ADDD F12,F10,F2 15
stall 16
stall 17
SD -16(R1), F12 18 ;drop SUBI &BNEZ
LD F14,-24(R1) 19
stall 20
ADDD F16,F14,F2 21
stall 22
stall 23
SD -24(R1),F16 24
SUBI R1, R1, #32 25
BENZ R1, Loop 26
stall 27
 
Loop: LD F0, 0(R1) 1
LD F6, -8(R1) 2
LD F10,-16(R1) 3
LD F14,-24(R1) 4
ADDD F4, F0, F2 5
ADDD F8, F6, F2 6
ADDD F8, F6, F2 7
ADDD F16, F14, F2 8
SD 0(R1), F4 9
SD -8(R1), F8 10
SD -16(R1), F12 11
SUBI R1, R1, #32 12
BENZ R1, Loop 13
SD 8(R1), F16 14 ;8-32=-24
 
27 clock cycles per iteration;
27/4 = 6.8 clock cycles per element
14 clock cycles per iteration;
14/4 = 3.5 clock cycles per element
 

Summary

To obtain the final unrolled code we had to make the following decisions and transformations:
Determine that it was legal to move the SD instruction after the SUBI and BNEZ, and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code.
Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust loop maintenance code.
Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing memory addresses and finding that they do not refer to the same address!
Schedule the code, preserving any dependencies needed to yield the same result as the original code.