Monday, 29 June 2015

Week 5 - Image writing

During the last week I've been working on image writing, i.e. on the write_image* builtins. The current implementation is in an experimental state (which is a euphemism for hacky in this case), and only deals with the special case of a one dimensional single channel image of 32 bit pixels. Furthermore only write_imageui is supported and only for the first write-only image argument, but these are relatively easy to fix.

To be a bit more specific: the write-only image arguments are bound to RATs (Random Access Targets) aka UAVs (Unordered Access Views) by r600g. The RAT ID is the 1-based (!) index of the write-only image argument among all the write-only image arguments as is present in the kernel signature. RAT 0 is reserved for global buffers.

The libclc implementation of write_imageui simply stores the input value (only the x component for now) to the location defined by the coordinate argument (again, only the x component is used). The RAT ID gets encoded into the address space of the pointer. Currently the address space of RAT 1 is hard-coded, but it could be fixed using an intrinsic which returns a null pointer with the proper address space given the image argument. The intrinsic could be substituted with the constant pointer value by an LLVM pass.

During instruction lowering, the MEM_RAT_CACHELESS_STORE_RAW instruction will be selected in place of the abstract store instruction, in spite of the fact that the Catalyst driver uses MEM_RAT_STORE_TYPED (see the disasm below). The rationale behind this choice is that STORE_TYPED is not yet implemented in the AMDGPU LLVM backend, and I couldn't make it work in the time I was willing to spend experimenting with it. Sadly AMD's Evergreen ISA docs are pretty vague, and the exact behaviour of STORE_TYPED including its interaction with the hardware configuration of the RATs is not documented AFAIK. RAT 1 is hard-coded here too.

Example kernel:

__kernel void imgtest_basic(write_only image2d_t img, __global int *out)
{
    write_imageui(img, (int2)(1, 2), (uint4)(3, 4, 5, 6));
    *out = 7;
}
And it's ASM produced by Catalyst (disassembled using CodeXL):
; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15) 
      0  x: LSHR        R3.x,  KC0[2].x,  2      
         y: MOV         R0.y,  (0x00000004, 5.605193857e-45f).y      
         z: MOV         R0.z,  (0x00000005, 7.006492322e-45f).z      
         t: MOV         R0.x,  (0x00000003, 4.203895393e-45f).w      
      1  x: MOV         R1.x,  (0x00000001, 1.401298464e-45f).x      
         y: MOV         R1.y,  (0x00000002, 2.802596929e-45f).y      
         z: MOV         R1.z,  0.0f      
         w: MOV         R0.w,  (0x00000006, 8.407790786e-45f).z      
         t: MOV         R2.x,  (0x00000007, 9.809089250e-45f).w      
01 MEM_RAT_STORE_TYPED: RAT(0)[R1], R0,  VPM 
02 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3].x___, R2, ARRAY_SIZE(4)  VPM 
END_OF_PROGRAM
Note that Catalyst reserves RAT 11 for global buffers.