During the last week I've been working on image writing, i.e. on the write_image* builtins. The current implementation is in an experimental state (which is a euphemism for hacky in this case), and only deals with the special case of a one dimensional single channel image of 32 bit pixels. Furthermore only write_imageui is supported and only for the first write-only image argument, but these are relatively easy to fix.
To be a bit more specific: the write-only image arguments are bound to RATs (Random Access Targets) aka UAVs (Unordered Access Views) by r600g. The RAT ID is the 1-based (!) index of the write-only image argument among all the write-only image arguments as is present in the kernel signature. RAT 0 is reserved for global buffers.
The libclc implementation of write_imageui simply stores the input value (only the x component for now) to the location defined by the coordinate argument (again, only the x component is used). The RAT ID gets encoded into the address space of the pointer. Currently the address space of RAT 1 is hard-coded, but it could be fixed using an intrinsic which returns a null pointer with the proper address space given the image argument. The intrinsic could be substituted with the constant pointer value by an LLVM pass.
During instruction lowering, the MEM_RAT_CACHELESS_STORE_RAW instruction will be selected in place of the abstract store instruction, in spite of the fact that the Catalyst driver uses MEM_RAT_STORE_TYPED (see the disasm below). The rationale behind this choice is that STORE_TYPED is not yet implemented in the AMDGPU LLVM backend, and I couldn't make it work in the time I was willing to spend experimenting with it. Sadly AMD's Evergreen ISA docs are pretty vague, and the exact behaviour of STORE_TYPED including its interaction with the hardware configuration of the RATs is not documented AFAIK. RAT 1 is hard-coded here too.
Example kernel:
__kernel void imgtest_basic(write_only image2d_t img, __global int *out)
{
write_imageui(img, (int2)(1, 2), (uint4)(3, 4, 5, 6));
*out = 7;
}
And it's ASM produced by Catalyst (disassembled using CodeXL):
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15)
0 x: LSHR R3.x, KC0[2].x, 2
y: MOV R0.y, (0x00000004, 5.605193857e-45f).y
z: MOV R0.z, (0x00000005, 7.006492322e-45f).z
t: MOV R0.x, (0x00000003, 4.203895393e-45f).w
1 x: MOV R1.x, (0x00000001, 1.401298464e-45f).x
y: MOV R1.y, (0x00000002, 2.802596929e-45f).y
z: MOV R1.z, 0.0f
w: MOV R0.w, (0x00000006, 8.407790786e-45f).z
t: MOV R2.x, (0x00000007, 9.809089250e-45f).w
01 MEM_RAT_STORE_TYPED: RAT(0)[R1], R0, VPM
02 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3].x___, R2, ARRAY_SIZE(4) VPM
END_OF_PROGRAM
Note that Catalyst reserves RAT 11 for global buffers.