Saturday 6 June 2015

Week 2 - Milestone 1

This week I managed to implement the OpenCL image attribute getter builtins for 2D images (and partially for 3D). These builtins allow the query of image metadata - width, height, depth (for 3D), channel data type and channel order - within an OpenCL kernel. In the remainder of this post I will describe how the implementation works and how did I got there.

Back in April when I wrote my GSoC proposal, my mentor Tom Stellard suggested that examining the ISA emitted by the Catalyst driver could help in the design process, so I did just that. I used the simple OpenCL kernel below:

__kernel void imgtest_basic(read_only image2d_t img, __global int * out)
{
*out = get_image_width(img);
}

and obtained the following ISA using AMD's CodeXL on Windows with a Mobility Radeon HD 5850 (Juniper, Evergreen) GPU:

; --------  Disassembly --------------------

00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15) 

      0  x: LSHR        R1.x,  KC0[2].x,  2      

         t: MOV         R0.x,  KC0[0].x      

01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4)  VPM 

END_OF_PROGRAM

What I think is important (or what I managed to understand) here is that the width of the image is stored in KC0[0].x which references a constant cache. The location of the data is known at compile time, no pointer dereferencing is needed. Furthermore the value of the out pointer is stored in a similar manner - in KC0[2].x, so one may assume that kernel arguments are passed this way.
Note: global addresses are 4-byte aligned on this architecture, so the least significant 2 bits should be omitted from the load addresses. This is why a right shift (LSHR) by 2 is required to transform the address stored in the argument out to a valid load input.

Clover, a software component of mesa which manages OpenCL state (a state tracker in gallium parlance), compiles OpenCL C sources to ISA using clang and LLVM. The OpenCL builtins are implemented as an LLVM bitcode library (libclc), which gets linked to the kernel in IR form. The fact that LLVM IR represents the image[23]d_t type as an opaque pointer imposes some restrictions on the implementation of the getter builtins, because this pointer is the sole input which these builtins may use to do their thing. And you can't do much with a pointer without dereferencing it.

What all this rambling lead us is that a code transformation is required to achieve something similar to what Catalyst does. So I've implemented an LLVM pass which replaces calls to the getter builtins with accesses to hidden kernel arguments containing the image attributes. The builtins are implemented in libclc as nonexistent intrinsics to mark the places where the pass needs to perform a substitution, and also to avoid problems with mangled symbol names created because of overloading. For each explicit image argument (only 2d for now) 5 new ones are created for the 5 attributes in a specific order. The LLVM pass assumes the hidden argumens are in this order when performing the substitutions. The hidden arguments will receive their values when the image argument is bound to the execution context.

A patch is on the way, I'll link it here if it gets accepted.

UPDATE:
I got some feedback for the patches. This feature needs some work, because I've added the new LLVM pass, which implements a driver-specific behavior (i.e. passes image attributes as hidden arguments), to clover, which should manage OpenCL state in a driver-independent way. So I'll have to move this to a driver-specific place. The plan is to generate intrinsics in a similar fashion to the workitem id and size getter builtins, and lower those intrinsics in the R600 LLVM backend to reads from implicit arguments.

UPDATE #2: I've got the address-shifting-by-2 thing backwards: a _right_ shift is needed before loads. See the updated note above.