Control Hazards

Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken.

If instruction i is a taken branch, then the PC is normally not changed until the end of MEM stage, after the completion  of the address calculation and comparison (see diagram).

The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the MEM stage, which determines the new PC. The pipeline behavior looks like :
 
Branch IF ID EX MEM WB          
Branch successor   IF(stall) stall stall IF ID EX MEM WB  
Branch successor+1           IF ID EX MEM WB
The stall does not occur until after ID stage (where we know that the instruction is a branch).

This control hazards stall must be implemented differently from a data hazard, since the IF cycle of the instruction following the branch must be repeated as soon as we know the branch outcome. Thus, the first IF cycle is essentially a stall (because it never performs useful work), which comes to total 3 stalls.

Three clock cycles wasted for every branch is a significant loss. With a 30% branch frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the ideal speedup from pipelining!

The number of clock cycles can be reduced by two steps:

Find out whether the branch is taken or not taken earlier in the pipeline;
Compute the taken PC (i.e., the address of the branch target) earlier.
Both steps should be taken as early in the pipeline as possible.

By moving the zero test into the ID stage, it is possible to know if the branch is taken at the end of the ID cycle. Computing the branch target address during ID requires an additional adder, because the main ALU, which has been used for this function so far, is not usable until EX.

The revised datapath :

With this datapath we will need only one-clock-cycle stall on branches.
 
Branch IF ID EX MEM WB    
Branch successor   IF(stall) IF ID EX MEM WB

In some machines, branch hazards are even more expensive in clock cycles. For example, a machine with separate decode and register fetch stages will probably have a branch delay - the length of the control hazard - that is at least one clock cycle longer. The branch delay, unless it is dealt with, turns into a branch penalty. Many older machines that implement more complex instruction sets have branch delays of four clock cycles or more.

In general, the deeper the pipeline, the worse the branch penalty in clock cycles.

There are many methods for dealing with the pipeline stalls caused by branch delay. Further we will discuss four simple branch prediction schemes.Then we will look at more powerful compile-time scheme such as loop unrolling that reduces the frequency of loop branches.