zogi GSoC 2015: June 2015

Monday 29 June 2015

Week 5 - Image writing

During the last week I've been working on image writing, i.e. on the write_image* builtins. The current implementation is in an experimental state (which is a euphemism for hacky in this case), and only deals with the special case of a one dimensional single channel image of 32 bit pixels. Furthermore only write_imageui is supported and only for the first write-only image argument, but these are relatively easy to fix.

To be a bit more specific: the write-only image arguments are bound to RATs (Random Access Targets) aka UAVs (Unordered Access Views) by r600g. The RAT ID is the 1-based (!) index of the write-only image argument among all the write-only image arguments as is present in the kernel signature. RAT 0 is reserved for global buffers.

The libclc implementation of write_imageui simply stores the input value (only the x component for now) to the location defined by the coordinate argument (again, only the x component is used). The RAT ID gets encoded into the address space of the pointer. Currently the address space of RAT 1 is hard-coded, but it could be fixed using an intrinsic which returns a null pointer with the proper address space given the image argument. The intrinsic could be substituted with the constant pointer value by an LLVM pass.

During instruction lowering, the MEM_RAT_CACHELESS_STORE_RAW instruction will be selected in place of the abstract store instruction, in spite of the fact that the Catalyst driver uses MEM_RAT_STORE_TYPED (see the disasm below). The rationale behind this choice is that STORE_TYPED is not yet implemented in the AMDGPU LLVM backend, and I couldn't make it work in the time I was willing to spend experimenting with it. Sadly AMD's Evergreen ISA docs are pretty vague, and the exact behaviour of STORE_TYPED including its interaction with the hardware configuration of the RATs is not documented AFAIK. RAT 1 is hard-coded here too.

Example kernel:

__kernel void imgtest_basic(write_only image2d_t img, __global int *out)
{
    write_imageui(img, (int2)(1, 2), (uint4)(3, 4, 5, 6));
    *out = 7;
}

And it's ASM produced by Catalyst (disassembled using CodeXL):

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15) 
      0  x: LSHR        R3.x,  KC0[2].x,  2      
         y: MOV         R0.y,  (0x00000004, 5.605193857e-45f).y      
         z: MOV         R0.z,  (0x00000005, 7.006492322e-45f).z      
         t: MOV         R0.x,  (0x00000003, 4.203895393e-45f).w      
      1  x: MOV         R1.x,  (0x00000001, 1.401298464e-45f).x      
         y: MOV         R1.y,  (0x00000002, 2.802596929e-45f).y      
         z: MOV         R1.z,  0.0f      
         w: MOV         R0.w,  (0x00000006, 8.407790786e-45f).z      
         t: MOV         R2.x,  (0x00000007, 9.809089250e-45f).w      
01 MEM_RAT_STORE_TYPED: RAT(0)[R1], R0,  VPM 
02 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3].x___, R2, ARRAY_SIZE(4)  VPM 
END_OF_PROGRAM

Note that Catalyst reserves RAT 11 for global buffers.

Thursday 18 June 2015

Week 3.5 - Running late

During the previous ~1.5 weeks I've been working on finishing the requirements of milestone 1 (i.e. get_image_* builtins), which means I've greatly underestimated the time required to complete it. Fortunately I've allocated some buffer time on the last weeks of GSoC to implement some optional requirements, so I'll probably have to sacrifice some of that. Now that that's out of the way, I can talk about the current state of the get_image_* builtins along with the choices I've made while implementing these features.

The general idea is essentially the same as the one I described in my previous post: the getters have to be replaced by an LLVM pass with implicit kernel parameters loads. However the code had to be restructured because several parts were in wrong places, e.g. one should avoid patching the IR in the driver if possible. To this end the pass has been moved to LLVM.

The current implementation can be broken down to 3 parts:

builtin definitions using dummy intrinsics (libclc)
translation of dummy intrinsics to meaningful code (LLVM)
placement of image attribute data to specific locations before kernel launch (mesa)

The libclc definitions are simple functions which contain calls to the llvm.AMDGPU.get.image.[23]d dummy intrinsics. A new pass (R600ImageAttributeIntrinsicsReplacer) has been added to the AMDGPU backend of LLVM to replace calls to these dummy intrinsics with the newly added llvm.AMDGPU.read.image.attribute intrinsic. This intrinsic accepts two compile-time constant operands (each 4 bytes wide): an image index (i) and an attribute index (j). Upon instruction lowering, the intrinsic will be translated to a load from an implicit kernel parameter using the 4-byte offset 4 + 5 * i + j added to the starting location of the implicit parameters (the first four 4-bytes are used by grid dim and grid offset, and there are five 4-byte image attributes for each image). The attribute index is 0 for width, 1 for height, 2 for depth, 3 for channel data type and 4 for channel order. The image index is the index of the image argument among all image arguments. This decision has some consequences though. Namely it affects how convenient it is to prepare implicit arguments for the software component which is responsible for that.

This raises the question which component of the driver should prepare these implicit arguments? Currently clover serializes grid dim and grid offset taking care of byte extension and endian conversion, but ideally the pipe driver should make the choice how it implements mechanisms like grid dim info or image attributes. On the other hand image attributes like channel order and data type contain OpenCL specific constants, which the pipe driver stores in a different format. Adding conversion code to the driver would mean (1) adding state tracker specific code to the driver and (2) duplicating some functionality, so one has to maintain both the OpenCL to pipe format converter in clover and the pipe format to OpenCL converter in the driver.

So what implications do the choice above has on image indexing? The driver has no immediate knowledge of the order of read-only and write-only kernel arguments, since those are handled differently at driver level. If one chooses to prepare image attributes in the driver, another indexing scheme has to be introduced, e.g. by copying the attributes of write-only images first and the read-only ones after that.

Considering the complications arising from preparing image attributes in the driver, I decided to do it in clover instead in a similar fashion to how grid dim and grid offset are handled. Because of that, image attributes can be added to the kernel parameters simply in the order of the image arguments as is present in the kernel signature. This also simplifies the intrinsic replacer pass.

Patches have been sent to the relevant mailing list under the following subjects:

[Mesa-dev] [PATCH] clover: Implement image attribute getters (commit on GitHub)
[llvm] Image attribute access for the AMDGPU backend (commit on GitHub)
[Libclc-dev] [PATCH] Add image attribute getter builtins (commit on GitHub)

UPDATE: added links to commits on GitHub.

Saturday 6 June 2015

Week 2 - Milestone 1

This week I managed to implement the OpenCL image attribute getter builtins for 2D images (and partially for 3D). These builtins allow the query of image metadata - width, height, depth (for 3D), channel data type and channel order - within an OpenCL kernel. In the remainder of this post I will describe how the implementation works and how did I got there.

Back in April when I wrote my GSoC proposal, my mentor Tom Stellard suggested that examining the ISA emitted by the Catalyst driver could help in the design process, so I did just that. I used the simple OpenCL kernel below:

__kernel void imgtest_basic(read_only image2d_t img, __global int * out)
{
*out = get_image_width(img);
}

and obtained the following ISA using AMD's CodeXL on Windows with a Mobility Radeon HD 5850 (Juniper, Evergreen) GPU:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15)

0 x: LSHR R1.x, KC0[2].x, 2

t: MOV R0.x, KC0[0].x

01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) VPM

END_OF_PROGRAM

What I think is important (or what I managed to understand) here is that the width of the image is stored in KC0[0].x which references a constant cache. The location of the data is known at compile time, no pointer dereferencing is needed. Furthermore the value of the out pointer is stored in a similar manner - in KC0[2].x, so one may assume that kernel arguments are passed this way.
Note: global addresses are 4-byte aligned on this architecture, so the least significant 2 bits should be omitted from the load addresses. This is why a right shift (LSHR) by 2 is required to transform the address stored in the argument out to a valid load input.

Clover, a software component of mesa which manages OpenCL state (a state tracker in gallium parlance), compiles OpenCL C sources to ISA using clang and LLVM. The OpenCL builtins are implemented as an LLVM bitcode library (libclc), which gets linked to the kernel in IR form. The fact that LLVM IR represents the image[23]d_t type as an opaque pointer imposes some restrictions on the implementation of the getter builtins, because this pointer is the sole input which these builtins may use to do their thing. And you can't do much with a pointer without dereferencing it.

What all this rambling lead us is that a code transformation is required to achieve something similar to what Catalyst does. So I've implemented an LLVM pass which replaces calls to the getter builtins with accesses to hidden kernel arguments containing the image attributes. The builtins are implemented in libclc as nonexistent intrinsics to mark the places where the pass needs to perform a substitution, and also to avoid problems with mangled symbol names created because of overloading. For each explicit image argument (only 2d for now) 5 new ones are created for the 5 attributes in a specific order. The LLVM pass assumes the hidden argumens are in this order when performing the substitutions. The hidden arguments will receive their values when the image argument is bound to the execution context.

A patch is on the way, I'll link it here if it gets accepted.

UPDATE:
I got some feedback for the patches. This feature needs some work, because I've added the new LLVM pass, which implements a driver-specific behavior (i.e. passes image attributes as hidden arguments), to clover, which should manage OpenCL state in a driver-independent way. So I'll have to move this to a driver-specific place. The plan is to generate intrinsics in a similar fashion to the workitem id and size getter builtins, and lower those intrinsics in the R600 LLVM backend to reads from implicit arguments.

UPDATE #2: I've got the address-shifting-by-2 thing backwards: a _right_ shift is needed before loads. See the updated note above.