Stall pipeline
Predict taken
Predict not taken
Delayed branch
The simplest scheme to handle branches is to freeze
or flush the pipeline, holding
or deleting any instructions after the branch until the branch destination
is known.
Advantage: simple both to software and hardware (solution
described earlier)
A higher performance, and only slightly more complex, scheme is to predict the branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Care must be taken not to change the machine state until the branch outcome is definitely known.
The complexity arises from:
we
have to know when the state might be changed by an instruction;
we
have to know how to "back out" a change.
The pipeline with this scheme implemented behaves as shown below:
| Untaken Branch Instr | IF | ID | EX | MEM | WB | ||
| Instr i+1 | IF | ID | EX | MEM | WB | ||
| Instr i+2 | IF | ID | EX | MEM | WB |
| Taken Branch Instr | IF | ID | EX | MEM | WB | |||
| Instr i+1 | IF | idle | idle | idle | idle | |||
| Branch target | IF | ID | EX | MEM | WB | |||
| Branch target+1 | IF | ID | EX | MEM | WB |
An alternative scheme is to predict the branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target address.
Because in DLX pipeline the target address is not known any
earlier than the branch outcome, there is no advantage in this approach.
In some machines where the target address is known before the branch
outcome a predict-taken scheme might make sense.
In a delayed branch, the execution cycle with a branch delay of length n is
Branch instrSequential successors are in the branch-delay slots. These instructions are executed whether or not the branch is taken.
sequential successor 1
sequential successor 2
. . . . .
sequential successor n
Branch target if taken
The pipeline behavior of the DLX pipeline, which has one branch delay
slot is shown below:
| Untaken branch instr | IF | ID | EX | MEM | WB | ||||
| Branch delay instr(i+1) | IF | ID | EX | MEM | WB | ||||
| Instr i+2 | IF | ID | EX | MEM | WB | ||||
| Instr i+3 | IF | ID | EX | MEM | WB | ||||
| Instr i+4 | IF | ID | EX | MEM | WB |
| Taken branch instr | IF | ID | EX | MEM | WB | ||||
| Branch delay instr(i+1) | IF | ID | EX | MEM | WB | ||||
| Branch target | IF | ID | EX | MEM | WB | ||||
| Branch target+1 | IF | ID | EX | MEM | WB | ||||
| Branch target+2 | IF | ID | EX | MEM | WB |
The job of the compiler is to make the successor instructions valid
and useful.
We will show three branch-scheduling schemes:
From before branch
From target
From fall through
|
Scheduling the branch-delay slot.
The left box in each pair shows the code before scheduling, the right box shows the scheduled code In (a) the delay slot is scheduled with an independent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch condition prevents the ADD instruction (whose destination is R1) from being moved after the branch. In (b) the branch-delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path. In (c) the branch-delay slot is scheduled from the not-taken fall through To make this optimazation legal for (b) and (c), it must be OK to execute the SUB instruction when the branch goes in the unexpected direction. OK means that the work might be wasted but the program will still execute correctly. |
| Scheduling strategy | Requirements | When Improves Performance |
| From before branch | Branch must not depend on the rescheduled instructions | Always |
| From target | Must be OK to execute rescheduled instructions if branch is not taken | When branch is taken. May enlarge program if instructions are duplicated |
| From fall though | Must be OK to execute instructions if branch is taken | When branch is not taken |
The limitations on delayed-branch scheduling arise from
the
restrictions on the instructions
that are scheduled into the delay slots and
our
ability to predict
at compile time whether a branch is likely to be taken or
not.
To improve the ability of the compiler to fill branch delay slots, most
machines with conditional branches have introduced a cancelling
branch. In a cancelling branch the instruction includes
the direction that the branch was predicted.
- if the branch behaves as predicted, the instruction in the branch
delay slot is fully executed;
- if the branch is incorrectly predicted, the instruction in the delay
slot is turned into no-op(idle).
The behavior of a predicted-taken cancelling
branch depends on whether the branch is taken or not:
| Untaken branch instr | IF | ID | EX | MEM | WB | ||||
| Branch delay instr(i+1) | IF | ID | idle | idle | idle | ||||
| Instr i+2 | IF | ID | EX | MEM | WB | ||||
| Instr i+3 | IF | ID | EX | MEM | WB | ||||
| Instr i+4 | IF | ID | EX | MEM | WB |
| Taken branch instr | IF | ID | EX | MEM | WB | ||||
| Branch delay instr(i+1) | IF | ID | EX | MEM | WB | ||||
| Branch target | IF | ID | EX | MEM | WB | ||||
| Branch target+1 | IF | ID | EX | MEM | WB | ||||
| Branch target+2 | IF | ID | EX | MEM | WB |
The advantage of cancelling branches is that they eliminate the requirements on the instruction placed in the delay slot.
Delayed branches are an architecturally
visible feature of the pipeline. This is the source both of their advantage
- allowing the use of simple compiler scheduling to reduce branch penalties;
and
their disadvantage - exposing
an aspect of the implementation that is likely to change.