Thursday 18 June 2015

Week 3.5 - Running late

During the previous ~1.5 weeks I've been working on finishing the requirements of milestone 1 (i.e. get_image_* builtins), which means I've greatly underestimated the time required to complete it. Fortunately I've allocated some buffer time on the last weeks of GSoC to implement some optional requirements, so I'll probably have to sacrifice some of that. Now that that's out of the way, I can talk about the current state of the get_image_* builtins along with the choices I've made while implementing these features.

The general idea is essentially the same as the one I described in my previous post: the getters have to be replaced by an LLVM pass with implicit kernel parameters loads. However the code had to be restructured because several parts were in wrong places, e.g. one should avoid patching the IR in the driver if possible. To this end the pass has been moved to LLVM.

The current implementation can be broken down to 3 parts:

  • builtin definitions using dummy intrinsics (libclc)
  • translation of dummy intrinsics to meaningful code (LLVM)
  • placement of image attribute data to specific locations before kernel launch (mesa)
The libclc definitions are simple functions which contain calls to the llvm.AMDGPU.get.image.[23]d dummy intrinsics. A new pass (R600ImageAttributeIntrinsicsReplacer) has been added to the AMDGPU backend of LLVM to replace calls to these dummy intrinsics with the newly added llvm.AMDGPU.read.image.attribute intrinsic. This intrinsic accepts two compile-time constant operands (each 4 bytes wide): an image index (i) and an attribute index (j). Upon instruction lowering, the intrinsic will be translated to a load from an implicit kernel parameter using the 4-byte offset 4 + 5 * i + j added to the starting location of the implicit parameters (the first four 4-bytes are used by grid dim and grid offset, and there are five 4-byte image attributes for each image). The attribute index is 0 for width, 1 for height, 2 for depth, 3 for channel data type and 4 for channel order. The image index is the index of the image argument among all image arguments. This decision has some consequences though. Namely it affects how convenient it is to prepare implicit arguments for the software component which is responsible for that.

This raises the question which component of the driver should prepare these implicit arguments? Currently clover serializes grid dim and grid offset taking care of byte extension and endian conversion, but ideally the pipe driver should make the choice how it implements mechanisms like grid dim info or image attributes. On the other hand image attributes like channel order and data type contain OpenCL specific constants, which the pipe driver stores in a different format. Adding conversion code to the driver would mean (1) adding state tracker specific code to the driver and (2) duplicating some functionality, so one has to maintain both the OpenCL to pipe format converter in clover and the pipe format to OpenCL converter in the driver.

So what implications do the choice above has on image indexing? The driver has no immediate knowledge of the order of read-only and write-only kernel arguments, since those are handled differently at driver level. If one chooses to prepare image attributes in the driver, another indexing scheme has to be introduced, e.g. by copying the attributes of write-only images first and the read-only ones after that.

Considering the complications arising from preparing image attributes in the driver, I decided to do it in clover instead in a similar fashion to how grid dim and grid offset are handled. Because of that, image attributes can be added to the kernel parameters simply in the order of the image arguments as is present in the kernel signature. This also simplifies the intrinsic replacer pass.

Patches have been sent to the relevant mailing list under the following subjects:

UPDATE: added links to commits on GitHub.