Thursday, 20 August 2015

Summary

Since the firm pencils down date is approaching, I'll summarize the work I've done during the summer in this post.

All of the required features mentioned in the proposal are implemented and tested, that is image attribute query, 2D image reading (using basic sampler setup) and 2D image writing. Image reading using different sampling configurations is supported too, which was an optional feature. The following sections give an overview to the state of the project.

Image attributes

Image attributes are implemented as implicit kernel arguments. The implicit arguments are added to the kernel signature by a new LLVM pass in the AMDGPU backend. The arguments are inserted immediately after the image argument they belong to. The new LLVM pass (named AMDGPUOpenCLImageTypeLoweringPass) iterates over the kernel functions found in the opencl.kernels metadata node, and performs the modifications for each function. Furthermore the pass substitutes calls to llvm.OpenCL.image.get.size* and llvm.OpenCL.image.get.format* (the suffix is 2d or 3d) with implicit parameter loads. These are not real intrinsics; neither of them are listed in any tablegen file. These pseudo-intrinsics are used in the implementation of the OpenCL get_image* builtins (see this patch).

The presence of the implicit arguments are indicated by special type strings in the kernel argument metadata. The type strings __llvm_image_size and __llvm_image_format signal clover to add the appropriate image attribute values to the kernel input vector. For more information, see this blog post and this patch.

Image reading

Clover has already supported image objects and image-related API calls when I started working on image support in May. There were only a few minor problems. The first one is that clover linearized the transfer region before mapping, e.g. mapping a 3x4x2 box got flattened into a linear transfer of size 2*slice_pitch + 4*row_pitch + 3*element_size. This is a problem in case of tiled GPU resources: the driver needs to know the exact region to be able to transfer the data correctly. See this patch.
Another problem is that the driver may override the transfer pitch, which was ignored by clover. This patch fixes the issue.
The final problem with clover was that it didn't upload any value for samplers. The hardware I was working with (Mobility Radeon HD 5850: a Juniper chip) uses texture fetch instructions that need the coordinate type, i.e. normalized or unnormalized, as an operand, rather than looking this information up from a 3D register as is the case with addressing and filter modes. This implies that the kernel code itself (rather than the pre-launch register state setup) may require the information that a sampler uses normalized or unnormalized coordinates. To this end a bitfield containing sampler configuration is uploaded as the value of the sampler argument. See this patch for the implementation. For compile-time constant global and kernel local samplers, which are not supported yet, this wouldn't be a problem of course.

Read-only images are implemented as sampler views in the r600g driver. Only minor modifications were needed, since texture sampler state and view setup for graphics was already present, furthermore compute and graphics setup of these objects are very similar. Resource IDs appropriate for compute had to be used, and the compute flag (0x2) had to be bitwise or'd to the PKT3 headers. Unsetting the sampler views had to be handled too. See this patch.

The OpenCL image reading builtins were implemented in libclc using the llvm.R600.tex intrinsic. The instruction to which the intrinsic gets compiled requires the texture and sampler IDs to be immediate operands. The AMDGPUOpenCLImageTypeLoweringPass introduced to handle image attributes takes care of this problem by substituting calls to llvm.OpenCL.image.get.resource.id* and llvm.OpenCL.sampler.get.resource.id* (the suffix is 2d or 3d) with compile-time constant IDs. The libclc implementation uses these pseudo-intrinsics. Since TEX0=VTX0 is reserved for kernel arguments and TEX1=VTX1 is reserved for reading global buffers, the builtins add 2 to the image ID to obtain a TEX ID.

Image writing

Write-only images are implemented as color surfaces in the driver, and bound to RAT slots on the GPU. Similarly to texture setup, code for color surface setup already existed in mesa for graphics, however some modifications were needed to use it in compute mode. The RAT flag and RESOURCE field of the CB_COLOR*_INFO register had to be set, correct value had to be assigned for the CB_COLOR*_DIM register, and resource unbinding had to be handled (see this patch). There were problems accessing 2D texture RATs with linear array mode set: see this blog post for details, and this patch for a solution.

The OpenCL image writing builtins have to perform pixel format conversion according to the image format. On Evergreen hardware the MEM_RAT instruction with the STORE_TYPED flag performs just that. To make this instruction available to libclc, this patch introduces a new intrinsic called llvm.r600.rat.store.typed to LLVM. The write_image* builtins, which use the new intrinsic, are added in this patch. Since RAT0 is reserved for writing global buffers, the builtins add 1 to the image ID to obtain the RAT ID.

Piglit tests

Image and sampler argument support is added to piglit's OpenCL program tester along with a few test. See my previous blog post for details.

Missing features

There are a few missing features, some of which are required, some are optional according to the OpenCL 1.1 standard.

Missing required features are the following:

  • Currently the only way samplers may be specified is via kernel arguments; global and kernel local samplers are not supported.
  • Correct usage of image access qualifiers (read_only and write_only) are not enforced by llvm. Using write_image* on read-only images or vice versa results in undefined behavior instead of a compilation error.

Missing optional features listed in the proposal:

  • 3D images are not supported.
  • Half precision float formats are not supported.

Code

My patches can be found on GitHub: