About

This post is a small writeup addressed to programmers who are interested to learn more about how GPU handles branching and targeted as an introduction to the topic. I recommend skimming through [1], [2], [17] and [8] to get an idea of what GPU execution model looks like in general because here we’re gonna take a closer look at one particular detail. A curious reader shall find all references at the end of the post. If you find any mistakes please reach out.

Table of content

Vocabulary

  • GPU - Graphics processing unit
  • Flynn’s taxonomy
    • SIMD - Single instruction multiple data
    • SIMT - Single instruction multiple threads
  • SIMD wave - Thread executing in SIMD mode
  • Lane - designated data stream in SIMD model
  • SMT - Simultaneous multi-threading (Intel Hyper-threading)[2]
    • Multiple threads share computational resources of a core
  • IMT - Interleaved multi-threading[2]
    • Multiple threads share computational resources of a core but only one executes per clock
  • BB - Basic Block - linear sequence of instructions with only one jump at the end
  • ILP - Instruction Level Parallelism[3]
  • ISA - Instruction Set Architecture
  • SLM - Shared Local Memory, Group Local

Throughout this post I’m referring to this fictional taxonomy. It approximates how a modern GPU is organized.

Hardware:
GPU  -+
      |- core 0 -+
      |          |- wave 0 +
      |          |         |- lane 0
      |          |         |- lane 1
      |          |         |- ...
      |          |         +- lane Q-1
      |          |
      |          |- ...
      |          +- wave M-1
      |            
      |- ...
      +- core N-1

* ALU - SIMD ALU unit for short

Software:
group +
      |- thread 0
      |- ...
      +- thread N-1

Other names:

  • core could be CU, SM, EU
  • wave could be wavefront, HW thread, warp, context
  • lane could be SW thread

What is so special about GPU core compared to CPU core?

Any current generation single GPU core is less beefy compared to what you may encounter in CPU world: simple ILP/multi-issue[6] and prefetch[5], no speculation or branch/return prediction. All of this coupled with tiny caches frees up quite a lot of the die area which gets filled with more cores. Memory load/store machinery is able to handle bandwidths of an order of magnitude larger(not true for integrated/mobile GPUs) than that of a typical CPU at a cost of more latency. GPU employs SMT[2] to hide this latency - while one wave is stalled, another utilizes free computation resources of a core. Because of the high latency for memory accesses GPU keeps all the contexts in registers, spills are supported but to be avoided at all cost. Typically the number of waves handled by one core depends on the number of registers used and determined dynamically by allocating on a fixed register file[8]. The instruction scheduling is hybrid dynamic and static[6] [11 4.4]. SMT cores execute in SIMD mode yielding high number of FLOPS.
From software point of view the programmer usually writes his kernel in some high level language which does not expose all the HW details like SIMD width or instruction scheduling. However, typically kernels get compiled into SIMD instructions where each lane corresponds to one software thread. Therefore the actual hardware thread(wave) executes N software threads in lockstep. Different ISAs expose their SIMD nature differently, some use explicit SIMD instructions(Intel SSE/AVX and Intel Gen) some use implicit(AMD GCN, Nvidia PTX).

Illustrations of GPU core

Figure 1

Diagram color-state coding

Figure 1

Figure 1. Execution history 4:2

Made on my toy gpu simulator(not user friendly).
The image shows history of execution mask where the x axis is time from left to right 1 clock per pixel and the y axis is 1 lane state per pixel and waves in SIMD32 mode from top to bottom. If it does not make sense to you, please return to it after reading the next sections.
This is an illustration of how a GPU core execution history might look like for a fictional configuration: four waves share one sampler and two ALU units. Wave scheduler dispatches two instructions from two waves each cycle. When a wave stalls on memory access or long ALU operation, scheduler switches to another pair of waves making ALU units almost 100% busy all the time.
Figure 2

Figure 2. Execution history 4:1

This is the same workload but this time only one wave issues instructions each cycle. Note how the second ALU is starving.
Figure 3

Figure 3. Execution history 4:4

This time four instructions are issued each cycle. Note that ALUs are oversubscribed in this case so two waves idle almost all the time(actually it’s a pitfall of the scheduling algorithm).
Update Read more about scheduling challenges[12].

Real world GPUs have different configurations per core: some may have up to 40 waves per core and 4 ALUs, some have fixed 7 waves and 2 ALUs. It all depends on a variety of factors and is determined through thorough architecture simulation process. Also real SIMD ALUs may have narrower width than those of waves they serve, it then takes multiple cycles to process one issued instruction, the multiplier is called ‘chime’ length[3].

What is coherence/divergence?

Lets look at the following kernel:

Example 1
uint lane_id = get_lane_id();
if (lane_id & 1) {
    // Do smth
}
// Do some more

Here we see instruction stream where execution path depends on the id of the lane being executed. Apparently different lanes have different values. So what should happen? There are different approaches to tackle this problem [4] but eventually they do approximately the same thing. One of such approaches is execution mask which I will focus on. This approach is employed by pre-Volta Nvidia and AMD GCN GPUs. The core of execution mask is that we keep a bit for each lane within wave. If a lane has 0 set to its corresponding execution bit no registers will be touched for that lane by the next issued instruction. Effectively the lane shouldn’t feel the impact of all the executed instruction as long as it’s execution bit is 0. The way it works is that a wave traverses control flow graph in depth first order keeping a history of branches taken until no bits are set. I think it’s better to follow an example.
So lets say we have waves of width 8. This is how execution mask will look like for the kernel:

Example 1. Execution mask history
                                  // execution mask
uint lane_id = get_lane_id();     // 11111111
if (lane_id & 1) {                // 11111111
    // Do smth                    // 01010101
}
// Do some more                   // 11111111

Now, take a look at more complicated examples:

Example 2
uint lane_id = get_lane_id();
for (uint i = lane_id; i < 16; i++) {
    // Do smth
}
Example 3
uint lane_id = get_lane_id();
if (lane_id < 16) {
    // Do smth
} else {
    // Do smth else
}

You’ll notice that history is needed. With execution mask approach usually some kind of stack is employed by the HW. A naive approach is to keep a stack of tuples (exec_mask, address) and add reconvergence instructions that pop a mask from the stack and change the instruction pointer for the wave. In that way a wave will have enough information to traverse the whole CFG for each lane.
From the performance point of view, it takes a couple of cycles just to process a control flow instruction because of all the bookkeeping. And don’t forget that the stack has limited depth.
Update By courtesy of @craigkolb I’ve read [13] in which it is noted that AMD GCN fork/join instructions select the path with the fewer number of threads first [114.6] which guarantees that log2 depth of the mask stack is enough.
Update Apparently it’s almost always possible to inline everything/structurize CFGs in a shader and therefore keep all execution mask history in registers and schedule CFG traversal/reconvergence statically[15]. Skimming through LLVM backend for AMDGPU I didn’t find any evidence of stack handling ever being emitted by the compiler.

HW support for execution mask

Now take a look at these control flow graphs(image from Wikipedia):
Figure 4

Figure 4. Some types of control flow graphs

So what is the minimal set of mask control instructions we need to handle all cases? Here is how it looks in my toy ISA with implicit parallelization, explicit mask control and fully dynamic data hazard synchronization:

push_mask BRANCH_END         ; Push current mask and reconvergence pointer
pop_mask                     ; Pop mask and jump to reconvergence instruction
mask_nz r0.x                 ; Set execution bit, pop mask if all bits are zero

; Branch instruction is more complicated
; Push current mask for reconvergence
; Push mask for (r0.x == 0) for else block, if any lane takes the path
; Set mask with (r0.x != 0), fallback to else in case no bit is 1
br_push r0.x, ELSE, CONVERGE 

Lets take a look at how d) case might look like.

A:
    br_push r0.x, C, D
B:
C:
    mask_nz r0.y
    jmp B
D:
    ret

I’m not an expert in control flow analysis or ISA design so I’m sure there is a case that could not be tamed with my toy ISA, although it does not matter as structured CFG should be enough for everyone.
Update Read more on GCN support for control flow instructions [11] ch.4 and LLVM implementation [15].

Bottom line:

  • Divergence - emerging difference in execution paths taken by different lanes of the same wave
  • Coherence - lack of divergence
  • HW needs extra instructions/registers/stack to handle execution mask
    • More branches - more register pressure
  • If one lane enters the branch all lanes enter the branch
    • It could get as worse as 1/N efficiency

Execution mask handling examples

Fictional ISA

I compiled the previous code snippets into my toy ISA and run it on simulator at SIMD32. Take a look at how it handles execution mask.
Update Note that the toy simulator always selects the true path first which is not the best method.

Example 1
; uint lane_id = get_lane_id();
    mov r0.x, lane_id
; if (lane_id & 1) {
    push_mask BRANCH_END
    and r0.y, r0.x, u(1)
    mask_nz r0.y
LOOP_BEGIN:
    ; // Do smth
    pop_mask                ; pop mask and reconverge
BRANCH_END:
    ; // Do some more
    ret

Figure 5

Figure 5. Example 1 execution history

Did you Notice the black area? It is wasted time. Some lanes are waiting for others to finish iterating.

Example 2
; uint lane_id = get_lane_id();
    mov r0.x, lane_id
; for (uint i = lane_id; i < 16; i++) {
    push_mask LOOP_END        ; Push the current mask and the pointer to reconvergence instruction
LOOP_PROLOG:
    lt.u32 r0.y, r0.x, u(16)  ; r0.y <- r0.x < 16
    add.u32 r0.x, r0.x, u(1)  ; r0.x <- r0.x + 1
    mask_nz r0.y              ; exec bit <- r0.y != 0 - when all bits are zero next mask is popped
LOOP_BEGIN:
    ; // Do smth
    jmp LOOP_PROLOG
LOOP_END:
    ; // }
    ret

Figure 6

Figure 6. Example 2 execution history
Example 3
    mov r0.x, lane_id
    lt.u32 r0.y, r0.x, u(16)
    ; if (lane_id < 16) {
        ; Push (current mask, CONVERGE) and (else mask, ELSE)
        ; Also set current execution bit to r0.y != 0
    br_push r0.y, ELSE, CONVERGE
THEN:
    ; // Do smth
    pop_mask
    ; } else {
ELSE:
    ; // Do smth else
    pop_mask
    ; }
CONVERGE:
    ret

Figure 7

Figure 7. Example 3 execution history

AMD GCN ISA

Update GCN also uses an explicit mask handling, you can read more about it here[11 4.x]. I decided it’s worth putting some examples with their ISA, thanks to shader-playground it is easy. Maybe some day I’ll come across a simulator and pull out some cool diagrams.
Note that the compiler is smart, you may get a different result. I tried to fool the compiler into not optimizing my branches by putting pointer chase loops in there then cleaned up the assembly.
Also note that S_CBRANCH_I/G_FORK and S_CBRANCH_JOIN instructions are not used in these snippets due to their simplicity/lack of compiler support. Therefore unfortunately the mask stack is not covered. If you know how to make the compiler spit stack handling please convey this information.
Update Watch this talk by @SiNGUL4RiTY about the implementation of vectorized control flow in LLVM backend employed by AMD.

Example 1
; uint lane_id = get_lane_id();
; GCN uses 64 wave width, so lane_id = thread_id & 63
; There are scalar s* and vector v* registers
; Executon mask does not affect scalar or branch instructions
    v_mov_b32     v1, 0x00000400      ; 1024 - group size
    v_mad_u32_u24  v0, s12, v1, v0    ; thread_id calculation
    v_and_b32     v1, 63, v0
; if (lane_id & 1) {
    v_and_b32     v2, 1, v0
    s_mov_b64     s[0:1], exec        ; Save the execution mask
    v_cmpx_ne_u32  exec, v2, 0        ; Set the execution bit
    s_cbranch_execz  ELSE             ; Jmp if all exec bits are zero
; // Do smth
ELSE:
; }
; // Do some more
    s_mov_b64     exec, s[0:1]        ; Restore the execution mask
    s_endpgm
Example 2
; uint lane_id = get_lane_id();
    v_mov_b32     v1, 0x00000400
    v_mad_u32_u24  v0, s8, v1, v0     ; Not sure why s8 this time and not s12
    v_and_b32     v1, 63, v0
; LOOP PROLOG
    s_mov_b64     s[0:1], exec        ; Save the execution mask
    v_mov_b32     v2, v1
    v_cmp_le_u32  vcc, 16, v1
    s_andn2_b64   exec, exec, vcc     ; Set the execution bit
    s_cbranch_execz  LOOP_END         ; Jmp if all exec bits are zero
; for (uint i = lane_id; i < 16; i++) {
LOOP_BEGIN:
    ; // Do smth
    v_add_u32     v2, 1, v2
    v_cmp_le_u32  vcc, 16, v2
    s_andn2_b64   exec, exec, vcc     ; Mask out lanes which are beyond loop limit
    s_cbranch_execnz  LOOP_BEGIN      ; Jmp if non zero exec mask
LOOP_END:
    ; // }
    s_mov_b64     exec, s[0:1]        ; Restore the execution mask
    s_endpgm
Example 3
; uint lane_id = get_lane_id();
    v_mov_b32     v1, 0x00000400
    v_mad_u32_u24  v0, s12, v1, v0
    v_and_b32     v1, 63, v0
    v_and_b32     v2, 1, v0
    s_mov_b64     s[0:1], exec        ; Save the execution mask
; if (lane_id < 16) {
    v_cmpx_lt_u32  exec, v1, 16       ; Set the execution bit
    s_cbranch_execz  ELSE             ; Jmp if all exec bits are zero
; // Do smth
; } else {
ELSE:
    s_andn2_b64   exec, s[0:1], exec  ; Inverse the mask and & with previous
    s_cbranch_execz  CONVERGE         ; Jmp if all exec bits are zero
; // Do smth else
; }
CONVERGE:
    s_mov_b64     exec, s[0:1]        ; Restore the execution mask
; // Do some more
    s_endpgm

AVX512

Update @tom_forsyth pointed out that AVX512 extension comes with an explicit mask handling too, so here are some examples. You can read more about it at [14] par. 15.x and 15.6.1. It’s not precisely a GPU but still a legit SIMD16 at 32 bit. Snippets are made using godbolt’s ISPC(–target=avx512knl-i32x16) and tampered with heavily.

Example 1
    ; Imagine zmm0 contains 16 lane_ids
    ; AVXZ512 comes with k0-k7 mask registers
    ; Usage:
    ; op reg1 {k[7:0]}, reg2, reg3
    ; k0 can not be used as a predicate operand, only k1-k7
; if (lane_id & 1) {
    vpslld       zmm0 {k1}, zmm0, 31  ; zmm0[i] = zmm0[i] << 31
    kmovw        eax, k1              ; Save the execution mask
    vptestmd     k1 {k1}, zmm0, zmm0  ; k1[i] = zmm0[i] != 0
    kortestw     k1, k1
    je           ELSE                 ; Jmp if all exec bits are zero
; // Do smth
    ; Now k1 contains the execution mask
    ; We can use it like this:
    ; vmovdqa32 zmm1 {k1}, zmm0
ELSE:
; }
    kmovw        k1, eax              ; Restore the execution mask
; // Do some more
    ret
Example 2
 ; Imagine zmm0 contains 16 lane_ids
    kmovw         eax, k1               ; Save the execution mask
    vpcmpltud     k1 {k1}, zmm0, 16     ; k1[i] = zmm0[i] < 16
    kortestw      k1, k1
    je            LOOP_END              ; Jmp if all exec bits are zero
    vpternlogd    zmm1 {k1}, zmm1, zmm1, 255   ; zmm1[i] = -1
; for (uint i = lane_id; i < 16; i++) {
LOOP_BEGIN:
; // Do smth
    vpsubd        zmm0 {k1}, zmm0, zmm1 ; zmm0[i] = zmm0[i] + 1
    vpcmpltud     k1 {k1}, zmm0, 16     ; masked k1[i] = zmm0[i] < 16
    kortestw      k1, k1
    jne           LOOP_BEGIN            ; Break if all exec bits are zero
LOOP_END:
; // }
    kmovw        k1, eax                ; Restore the execution mask
; // Do some more
    ret
Example 3
 ; Imagine zmm0 contains 16 lane_ids
; if (lane_id & 1) {
    vpslld       zmm0 {k1}, zmm0, 31  ; zmm0[i] = zmm0[i] << 31
    kmovw        eax, k1              ; Save the execution mask
    vptestmd     k1 {k1}, zmm0, zmm0  ; k1[i] = zmm0[i] != 0
    kortestw     k1, k1
    je           ELSE                 ; Jmp if all exec bits are zero
THEN:
; // Do smth
; } else {
ELSE:
    kmovw        ebx, k1
    andn         ebx, eax, ebx
    kmovw        k1, ebx              ; mask = ~mask & old_mask
    kortestw     k1, k1
    je           CONVERGE             ; Jmp if all exec bits are zero
; // Do smth else
; }
CONVERGE:
kmovw            k1, eax              ; Restore the execution mask
; // Do some more
    ret

How to fight divergence?

I tried to come up with a simple yet complete illustration for the inefficiency introduced by combining divergent lanes.
Imagine a simple kernel like this:

uint thread_id = get_thread_id();
uint iter_count = memory[thread_id];
for (uint i = 0; i < iter_count; i++) {
    // Do smth
}

It may resemble a raymarcher, where it takes more iterations for missed rays near surface and less iterations for straight hit/miss rays.
Let’s spawn 256 threads and measure the duration:
Figure 8

Figure 8. Divergent threads execution time

The x axis is SW thread id, the y axis is clock cycles; the different bars show how much time is wasted by grouping threads with different wave widths compared to single threaded execution.
The execution time of a wave is equal to the maximum execution time among confined lanes. You can see that the performance is already ruined at SIMD8, further widening just makes it slightly worse.
Figure 9

Figure 9. Coherent threads execution time

This figure shows the same bars but this time iteration counts are sorted over thread ids, so that threads with similar iteration counts get dispatched to the same wave.
For this example the potential speedup is around 2x.

Bottom line:

  • Sort input data

For example, if you are writing a ray tracer, grouping rays with similar direction and position could be beneficial because they are likely to be traversing the same nodes in BVH. For more details please follow [10] and related articles.

  • Keep CFG simple

If your cfg is complex it typically makes sense to split the kernel and classify your data. For example, on a deferred shading pass you could detect tiles on the offscreen buffer that contain some expensive material and spawn pixel shaders separately for those tiles instead of doing full screen uber shader.

It’s worth mentioning that there are some techniques to grapple with divergence on HW level, some of them are Dynamic Warp Formation[7] and predicated execution for small branches.

Divergent memory access

Update Jun 26, 2019 Added a few thoughts on memory access in a divergent control flow. It feels like there’s more of the speculation going on than usual. So if you spot any errors please reach out.
On SIMD architecture every load is gather and every store is scatter. A memory operation is generated for each active lane when a store/load instruction is issued, typically if inactive lanes have invalid addresses at the corresponding address slots, no exception is going to be generated.
Memory coalescing machinery is going to take care of optimizing apparent patterns to the global memory which is one of the benefits of gather loads. In case you access SLM you may hit a bank collision issue.

Sample in a branch

Now, what about this 10000 pound elephant in the room? Things indeed might get hairy if you are trying to sample a texture in a branch. Particularly, if you are sampling in a pixel shader and use anisotropic/trilinear filtering - those kind of features depend on HW gradients which require that all lanes participating in a 2x2 pixel group have valid arguments. The way it works is that HW packs adjacent pixel groups of 2x2 in the same wave. This has a consequence that 1 pixel triangle is going to spawn 4 pixels and at least 1 wave, the same happens with lone pixels on triangle boundaries. Invisible pixels are called helpers. Some helper pixels are going to have their barycentric coordinates outside the triangle(not sure what happens here, reach out if you do).
A good read on the subject matter is DirectX-Specs 16.8.2 Restrictions on Derivative Calculations. In short the spec allows such samples if the texture address is a shader input or a statically indexed constant. It makes sense because even if the HW does not know how to handle divergent samples, its compiler can just hoist that sample from the branch and eliminate the issue completely. I expect other APIs to enforce similar behaviour.
Worth mentioning that the amount of in-flight memory requests being served is somehow limited by the HW. So if you have a kernel with tons of samples or memory loads, it might stall just trying to issue those requests. Which means that some amount of interleaving of ALU instructions and memory requests is needed to avoid such stalls. But fear not, it’s usually taken care of in the compiler, fortunately shader memory models are quite permissive when it comes to reordering.

When to branch?

I had one use case in mind. Purely theoretical yet may be interesting.
Sometimes you might be hitting the limit of on-flight requests on your HW. In this case it makes sense to reduce the amount of issued samples. Imagine this example:

a = mask.Sample(sampler, uv);
b = diffuse_1.Sample(sampler, uv);
c = diffuse_2.Sample(sampler, uv);
pixel = b * a.x + c * (1.0 - a.x);

Basically we sample a mask and then blend 4 textures. It could be a terrain rendering and we mix grass and ground textures. The mask could look like that:
Figure 10

Figure 10. Example of a texture mask containing interleaved spots of constant value

Notice that the mask consists of big isles of constant values with a little bit of blend on boundaries.
If our HW is going to stall on issue limit we are going to pay more than 1 sample latency.
Now take a look at this transformed example:

a = mask.Sample(sampler, uv);
b = diffuse_1.Sample(sampler, uv);
if (a.x < 1.0) {
    c = diffuse_2.Sample(sampler, uv);
} else {
    c = float4(0.0, 0.0, 0.0, 0.0);
}
pixel = b * a.x + c * (1.0 - a.x);

If the wave already stalls on sampler oversubscription, introducing a data dependency between a and e may not make things worse(the wave has to wait for a to arrive before issuing c). When a.x is 1.0 we are not paying for that redundant sample. In the case of the mask depicted in Figure 10. waves skip the second sample most of the time.
This is highly HW/compiler dependent. The compiler may eliminate this branch completely, however, AMD compiler keeps it.
Note that this example is artificially bad, you shouldn’t have that much of samples to stall on issue limit and if you do probably something like virtual texture cache could be employed [16].

Links

[1]A trip through the Graphics Pipeline

[2]Kayvon Fatahalian: PARALLEL COMPUTING

[3]Computer Architecture A Quantitative Approach

[4]Stack-less SIMT reconvergence at low cost

[5]Dissecting GPU memory hierarchy through microbenchmarking

[6]Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

[7]Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

[8]Maurizio Cerrato: GPU Architectures

[9]Toy GPU simulator

[10]Reducing Branch Divergence in GPU Programs

[11]“Vega” Instruction Set Architecture

[12]Joshua Barczak:Simulating Shader Execution for GCN

[13]Tangent Vector: A Digression on Divergence

[14]Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual

[15]Vectorizing Divergent Control-Flow for SIMD Applications

[16]Jason Booth: Terrain Shader Generation Systems

[17]Matthäus G. Chajdas: Introduction to compute shaders

Comments

I don’t have comments so here’s a link to the wrapper tweet