Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken.
If instruction i is a taken branch, then the PC is normally not changed until the end of MEM stage, after the completion of the address calculation and comparison (see diagram).
The simplest method
of dealing with branches is to stall
the pipeline as soon as the branch is detected until we reach the MEM
stage, which determines the new PC. The pipeline behavior looks
like :
| Branch | IF | ID | EX | MEM | WB | |||||
| Branch successor | IF(stall) | stall | stall | IF | ID | EX | MEM | WB | ||
| Branch successor+1 | IF | ID | EX | MEM | WB |
This control hazards stall must be implemented differently from a data hazard, since the IF cycle of the instruction following the branch must be repeated as soon as we know the branch outcome. Thus, the first IF cycle is essentially a stall (because it never performs useful work), which comes to total 3 stalls.
Three clock cycles wasted for every branch is a significant loss. With a 30% branch frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the ideal speedup from pipelining!
The number of clock cycles can be reduced by two steps:
Both steps should be taken as early in the pipeline as possible.Find out whether the branch is taken or not taken earlier in the pipeline;
Compute the taken PC (i.e., the address of the branch target) earlier.
By moving the zero test into the ID stage, it is possible to know if the branch is taken at the end of the ID cycle. Computing the branch target address during ID requires an additional adder, because the main ALU, which has been used for this function so far, is not usable until EX.
The revised datapath :
With this datapath we will need only one-clock-cycle stall on branches.
| Branch | IF | ID | EX | MEM | WB | ||
| Branch successor | IF(stall) | IF | ID | EX | MEM | WB |
In some machines, branch hazards are even more expensive in clock cycles. For example, a machine with separate decode and register fetch stages will probably have a branch delay - the length of the control hazard - that is at least one clock cycle longer. The branch delay, unless it is dealt with, turns into a branch penalty. Many older machines that implement more complex instruction sets have branch delays of four clock cycles or more.
In general, the deeper the pipeline, the worse the branch penalty in clock cycles.
There are many methods for dealing with the pipeline stalls caused by
branch delay. Further we will discuss four simple branch
prediction schemes.Then we will look at more powerful compile-time
scheme such as loop unrolling that reduces
the frequency of loop branches.