Thursday 20 August 2015

Summary

Since the firm pencils down date is approaching, I'll summarize the work I've done during the summer in this post.

All of the required features mentioned in the proposal are implemented and tested, that is image attribute query, 2D image reading (using basic sampler setup) and 2D image writing. Image reading using different sampling configurations is supported too, which was an optional feature. The following sections give an overview to the state of the project.

Image attributes

Image attributes are implemented as implicit kernel arguments. The implicit arguments are added to the kernel signature by a new LLVM pass in the AMDGPU backend. The arguments are inserted immediately after the image argument they belong to. The new LLVM pass (named AMDGPUOpenCLImageTypeLoweringPass) iterates over the kernel functions found in the opencl.kernels metadata node, and performs the modifications for each function. Furthermore the pass substitutes calls to llvm.OpenCL.image.get.size* and llvm.OpenCL.image.get.format* (the suffix is 2d or 3d) with implicit parameter loads. These are not real intrinsics; neither of them are listed in any tablegen file. These pseudo-intrinsics are used in the implementation of the OpenCL get_image* builtins (see this patch).

The presence of the implicit arguments are indicated by special type strings in the kernel argument metadata. The type strings __llvm_image_size and __llvm_image_format signal clover to add the appropriate image attribute values to the kernel input vector. For more information, see this blog post and this patch.

Image reading

Clover has already supported image objects and image-related API calls when I started working on image support in May. There were only a few minor problems. The first one is that clover linearized the transfer region before mapping, e.g. mapping a 3x4x2 box got flattened into a linear transfer of size 2*slice_pitch + 4*row_pitch + 3*element_size. This is a problem in case of tiled GPU resources: the driver needs to know the exact region to be able to transfer the data correctly. See this patch.
Another problem is that the driver may override the transfer pitch, which was ignored by clover. This patch fixes the issue.
The final problem with clover was that it didn't upload any value for samplers. The hardware I was working with (Mobility Radeon HD 5850: a Juniper chip) uses texture fetch instructions that need the coordinate type, i.e. normalized or unnormalized, as an operand, rather than looking this information up from a 3D register as is the case with addressing and filter modes. This implies that the kernel code itself (rather than the pre-launch register state setup) may require the information that a sampler uses normalized or unnormalized coordinates. To this end a bitfield containing sampler configuration is uploaded as the value of the sampler argument. See this patch for the implementation. For compile-time constant global and kernel local samplers, which are not supported yet, this wouldn't be a problem of course.

Read-only images are implemented as sampler views in the r600g driver. Only minor modifications were needed, since texture sampler state and view setup for graphics was already present, furthermore compute and graphics setup of these objects are very similar. Resource IDs appropriate for compute had to be used, and the compute flag (0x2) had to be bitwise or'd to the PKT3 headers. Unsetting the sampler views had to be handled too. See this patch.

The OpenCL image reading builtins were implemented in libclc using the llvm.R600.tex intrinsic. The instruction to which the intrinsic gets compiled requires the texture and sampler IDs to be immediate operands. The AMDGPUOpenCLImageTypeLoweringPass introduced to handle image attributes takes care of this problem by substituting calls to llvm.OpenCL.image.get.resource.id* and llvm.OpenCL.sampler.get.resource.id* (the suffix is 2d or 3d) with compile-time constant IDs. The libclc implementation uses these pseudo-intrinsics. Since TEX0=VTX0 is reserved for kernel arguments and TEX1=VTX1 is reserved for reading global buffers, the builtins add 2 to the image ID to obtain a TEX ID.

Image writing

Write-only images are implemented as color surfaces in the driver, and bound to RAT slots on the GPU. Similarly to texture setup, code for color surface setup already existed in mesa for graphics, however some modifications were needed to use it in compute mode. The RAT flag and RESOURCE field of the CB_COLOR*_INFO register had to be set, correct value had to be assigned for the CB_COLOR*_DIM register, and resource unbinding had to be handled (see this patch). There were problems accessing 2D texture RATs with linear array mode set: see this blog post for details, and this patch for a solution.

The OpenCL image writing builtins have to perform pixel format conversion according to the image format. On Evergreen hardware the MEM_RAT instruction with the STORE_TYPED flag performs just that. To make this instruction available to libclc, this patch introduces a new intrinsic called llvm.r600.rat.store.typed to LLVM. The write_image* builtins, which use the new intrinsic, are added in this patch. Since RAT0 is reserved for writing global buffers, the builtins add 1 to the image ID to obtain the RAT ID.

Piglit tests

Image and sampler argument support is added to piglit's OpenCL program tester along with a few test. See my previous blog post for details.

Missing features

There are a few missing features, some of which are required, some are optional according to the OpenCL 1.1 standard.

Missing required features are the following:

  • Currently the only way samplers may be specified is via kernel arguments; global and kernel local samplers are not supported.
  • Correct usage of image access qualifiers (read_only and write_only) are not enforced by llvm. Using write_image* on read-only images or vice versa results in undefined behavior instead of a compilation error.

Missing optional features listed in the proposal:

  • 3D images are not supported.
  • Half precision float formats are not supported.

Code

My patches can be found on GitHub:

Wednesday 19 August 2015

Week 12 - Piglit tests

I spent the last week before soft pencils down on implementing OpenCL image and sampler type support for piglit's program tester, and adding tests to check image builtins. The configuration parser had to be modified to accept image and sampler arguments. The syntax of the new argument types are as follows:

  • Image argument:
    (arg_in|arg_out) argument_index image pixel_type
        (values|random|repeat values)
        type (2d|3d)
        image_width image_width
        image_height image_width
        image_channel_order image_channel_order
        image_channel_data_type image_channel_data_type
        [tolerance (tolerance|ulp ulp)]
    
  • Sampler argument:
    (arg_in|arg_out) argument_index sampler
        normalized_coords (true|false)
        addressing_mode (none|clamp_to_edge|repeat|mirrored_repeat)
        filter_mode (nearest|linear)
    
Currently only 2d arguments are supported. The channel order and data type may take any of the appropriate OpenCL constant names without the CL_ prefix (e.g. image_channel_order RGBA). For examples see the following tests:

I've sumbmitted the changes to piglit, but also uploaded it to a GitHub repo.

In the next post I'm going to summarize the new mesa, llvm and libclc features implemented during the summer.

Monday 10 August 2015

Week 11 - 2D Image reading and resource management

Previous week I've modified the existing r600 sampler state setup code so that compute shaders can use it, and fixed some resource management issues. Image reading now works using CL_INTENSITY with CL_FLOAT and CL_RGBA with CL_UNSIGNED_INT8 formats (these are ones I've tried). The former was tested using both nearest and linear filtering mode.

About sampler state setup: clover now uploads the sampler bitfield as the sampler argument (commit) to allow the libclc implementation to branch on sampler fields, particularly whether it uses normalized coordinates (read_image* builtins). OpenCL C constants have been added to the clc headers (image_defines.h).

About the resource management issues: I've noticed valgrind errors when running my test kernel [1]. Valgrind detected reads from already freed memory, like this one:

==13999== Invalid read of size 2
==13999==    at 0x4C2ED06: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3A74B0: radeon_emit_array (radeon_winsys.h:680)
==13999==    by 0xB3AD368: evergreen_emit_sampler_views (evergreen_state.c:2047)
==13999==    by 0xB3AD4F8: evergreen_emit_cs_sampler_views (evergreen_state.c:2085)
[...]
==13999==  Address 0xa8f382c is 76 bytes inside a block of size 112 free'd
==13999==    at 0x4C2B200: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3DF2A1: r600_sampler_view_destroy (r600_state_common.c:368)
==13999==    by 0x504A8DA: clover::resource::unbind_sampler_view(clover::command_queue&, pipe_sampler_view*) (resource.cpp:96)
==13999==    by 0x5029A7D: clover::kernel::image_rd_argument::unbind(clover::kernel::exec_context&) (kernel.cpp:529)
==13999==    by 0x5028808: clover::kernel::exec_context::unbind() (kernel.cpp:235)
[...]

Furthermore, there were other valgrind errors regarding RAT setup of write-only images. After asking around on mesa-dev and messing with the code using gdb it became clear that the memory used by compute resources are managed by clover, and shouldn't be freed inside the driver. The lifetime of textures are bound to the lifetime of the mem object; the driver-side data describing surfaces and sampler views are created before and destroyed after kernel launch. These actions are all initiated by clover.

To avoid code duplication, it is beneficial to use already existing graphics code when possible. However, graphics code contains a reference counting resource management scheme, which interferes with clover if used during compute setup: this was the problem causing the valgrind errors.

The following changes contain the read-only and write-only image resource setup which avoids the errors mentioned above:

Piglit tests are still on my TODO list, now both for image reading and writing. The OpenCL test runner itself has to be modified to be able to accept image input.

Furthermore, currently the only way to supply samplers to the kernels is to pass them as an argument. Implementing global and kernel local constant samplers is another TODO.

UPDATE: Actually one may still use the reference counting mechanism of the driver as long as care is taken to prevent the refcount ever reaching 0 inside the driver.


[1] Test kernel. The kernel is supplied with image arguments such that a[4] evaluates to 0x3f.


// img1: CLK_INTENSITY, CLK_FLOAT
// img2: CLK_RGBA, CLK_UNSIGNED_INT8
// img3: CLK_RGBA, CLK_UNSIGNED_INT8
__kernel void imgtest(read_only image2d_t img1,
                      read_only image2d_t img2,
                      write_only image2d_t img3,
                      sampler_t s1, sampler_t s2,
                      __global int * a, __global float * b)
{
    int i = get_global_id(0);
    int j = get_global_id(1);

    // Test read_imagef
    if (i == j && i < 10) {
        float x = (0.5f + i*0.1f) / get_image_width(img1);
        b[i] = read_imagef(img1, s1, (float2)(x, 0.f)).x;
    }

    // Test read_imageui
    if (i == 0 && j == 0) {
        uint4 c = read_imageui(img2, s2, (int2)(1, 2));
        a[5] = c.x;
        a[6] = c.y;
        a[7] = c.z;
        a[8] = c.w;
    }

    // Test write_imageui
    int k = 100 * i + j;
    write_imageui(img3, (int2)(i, j), (uint4)(k & 0xff, k >> 8, 0, 0));

    // Test attribute getters
    a[0] = get_image_width(img1);
    a[1] = get_image_height(img1);
    a[2] = get_image_width(img2);
    a[3] = get_image_height(img2);
    // Should evaluate to 63
    a[4] = (get_image_channel_order(img1) == CLK_INTENSITY) |
           (get_image_channel_data_type(img1) == CLK_FLOAT) << 1 |
           (get_image_channel_order(img2) == CLK_RGBA) << 2 |
           (get_image_channel_data_type(img2) == CLK_UNSIGNED_INT8) << 3 |
           (get_image_channel_order(img3) == CLK_RGBA) << 4 |
           (get_image_channel_data_type(img3) == CLK_UNSIGNED_INT8) << 5;
}
}

Monday 3 August 2015

Week 10 - 2D image writing

Last week I've implemented 2D image writing. As I've mentioned earlier, the Catalyst driver compiles MEM_RAT STORE_TYPED from the write_image* functions, since this instuction performs format conversion if the RAT is configured correctly. Previously I couldn't configure the RATs correctly, but last week I've managed to make it work.

On llvm side the STORE_TYPED instruction has been added along with the new llvm.r600.rat.write.typed intrinsic to the AMDGPU backend (commit). The write_image* functions in libclc can simply use the new intrinsic (commit).

The RAT configuration in r600g consists of setting up the RAT and RESOURCE fields of the CB_COLOR*_INFO registers. For some reason, the CB_COLOR*_DIM registers weren't set correctly, so this had to be added too. See this commit.

There was one unexpected problem though: the LINEAR_ALIGNED array mode doesn't work well with TEXTURE_2D resource type in case of RATs on my hardware, again, for an unknown reason. More precisely the location of the writes is not correct: the data written appeared at wrong locations. My previous attempt to use STORE_TYPED did not work because the driver always chose LINEAR_ALIGNED array mode even for images. My solution/workaround for this is to force a tiled array mode on texture compute resources for r600g hardware (r600, r700, evergreen, northern islands). See this commit.

Along with the RAT configuration in the r600g driver, a few minor changes had to be introduced to clover too. One such change is about mapping GPU resources to a CPU-accessible location. The transfer region is a potentially multi-dimensional (2 or 3) box, that was previously flattened to a linear offset and size. This information is insufficient for tiled textures: the driver has to know region dimensions. See this commit for details.

Another problem was that upon transfer the driver may force a specific row and slice pitch for the mapped data, and this information was ignored by clover. This behaviour was correct for linear buffers, but caused problems for tiled ones. See this commit.

One particular TODO is to add piglit tests to check image writing functionality.

Monday 27 July 2015

Week 9 - A different approach for image attributes

Previously image attributes were passed as implicit parameters located at the end of the kernel input vector: clover appended the image metadata regardless of the target on which the kernel would run. It would be beneficial to allow more flexibility to clover, since some kind of targets may already have means of querying image dimensions or format, and uploading the attributes as implicit parameters is unnecessary for them. To this end the implicit arguments holding the image attributes are now added to the function signature at IR level immediately after the image argument to which they belong. Calls to attribute getter functions are replaced by reads from these new arguments. This transformation is implemented as an IR pass in the AMDGPU target. Since llvm function signatures can't be changed directly, the kernel functions are recreated with the new signature. The pass also handles image and (partially) sampler resource IDs.

Let's look at the implementation of the pass in a bit more detail. The pass is based on the opencl.kernels named metadata node; no changes are made to modules without proper metadata, and only the kernels listed under the opencl.kernels node are transformed.

For each function found in the node, the kernel_arg_type metadata is scanned for image2d_t and image3d_t argument types. For each of those kernel arguments, two new arguments are added immediately following the image argument: one of type <3 x i32> for the image size (width, height and depth), and one of type <2 x i32> for the image format (channel data type and channel order). The kernel function is recreated with the new signature and the body is copied over. Metadata is added to the new arguments; the implicit args are marked by the types __llvm_image_size and __llvm_image_format.

After the function with implicit arguments is in place, the uses of image and sampler arguments are scanned for the following function calls:

  • llvm.OpenCL.image.get.size*
  • llvm.OpenCL.image.get.format*
  • llvm.OpenCL.image.get.resource.id*
  • llvm.OpenCL.sampler.get.resource.id

The stars in the list above indicate that different image getter function has to be used for 2d and 3d images because of the type difference.

The resource IDs are determined by the index of the argument value within the kernel signature. Image arguments are grouped by access qualifier to read_only and write_only groups; the resource ID is the argument index within the group. E.g. in case of
__kernel void foo(read_only image2d_t a, read_only image3d_t b, write_only image2d_t c, write_only image3d_t d)
the resource ID of a and c is 0, and for b and d is 1. The resource IDs of sampler_t arguments are calculated similarly. Samplers declared as module globals or kernel locals are not handled yet.

Now, that the implicit arguments are present at IR level with proper metadata, clover is able to look for them, and upload the image attributes only if necessary. See this commit.

Note: because of the fact that kernels are recreated during the pass described above, this change is necessary in clover.

The implementation of the OpenCL builtins in libclc looks like this.

Sunday 12 July 2015

Week 7 - Image reading

This week the main focus was on image reading, while also reworking previously sent patches about image attributes. Currently I have a working prototype for writing CL_FLOAT images, although I'm not ready to push it upstream yet. The current status is summarized in the following list.

  • Mesa is now able to set up compute texture and sampler resources (code). Most of the code was already present, only initialization and emission of the atoms responsible for sampler and texture resource (sampler view) state setup had to be added. I also had my mentor's code as a starting point.
  • The prototype libclc implementation uses the llvm.R600.tex intrinsic with hardcoded texture id, sampler and coord types (code).
  • No modification of llvm is required.

The image attribute getters have undergone a few modifications. The mechanism which assigns a compile-time constant ID to the kernel image arguments, which was part of the image attribute intrinsics replacer pass, has been factored out into a separate pass (patch). Simultaneously, the image attribute replacer pass has been deleted, since the reason of the pass was the compile-time constant ID generation. This way the attribute getters can be implement in libclc roughly as follows (see this patch for additional details):

get_image_attribute(get_image_id(image), attribute);

An additional benefit of this approach is that the image ID pass can be extended with different kind of IDs, e.g. resource IDs (like RAT ID) and sampler IDs.

Monday 29 June 2015

Week 5 - Image writing

During the last week I've been working on image writing, i.e. on the write_image* builtins. The current implementation is in an experimental state (which is a euphemism for hacky in this case), and only deals with the special case of a one dimensional single channel image of 32 bit pixels. Furthermore only write_imageui is supported and only for the first write-only image argument, but these are relatively easy to fix.

To be a bit more specific: the write-only image arguments are bound to RATs (Random Access Targets) aka UAVs (Unordered Access Views) by r600g. The RAT ID is the 1-based (!) index of the write-only image argument among all the write-only image arguments as is present in the kernel signature. RAT 0 is reserved for global buffers.

The libclc implementation of write_imageui simply stores the input value (only the x component for now) to the location defined by the coordinate argument (again, only the x component is used). The RAT ID gets encoded into the address space of the pointer. Currently the address space of RAT 1 is hard-coded, but it could be fixed using an intrinsic which returns a null pointer with the proper address space given the image argument. The intrinsic could be substituted with the constant pointer value by an LLVM pass.

During instruction lowering, the MEM_RAT_CACHELESS_STORE_RAW instruction will be selected in place of the abstract store instruction, in spite of the fact that the Catalyst driver uses MEM_RAT_STORE_TYPED (see the disasm below). The rationale behind this choice is that STORE_TYPED is not yet implemented in the AMDGPU LLVM backend, and I couldn't make it work in the time I was willing to spend experimenting with it. Sadly AMD's Evergreen ISA docs are pretty vague, and the exact behaviour of STORE_TYPED including its interaction with the hardware configuration of the RATs is not documented AFAIK. RAT 1 is hard-coded here too.

Example kernel:

__kernel void imgtest_basic(write_only image2d_t img, __global int *out)
{
    write_imageui(img, (int2)(1, 2), (uint4)(3, 4, 5, 6));
    *out = 7;
}
And it's ASM produced by Catalyst (disassembled using CodeXL):
; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15) 
      0  x: LSHR        R3.x,  KC0[2].x,  2      
         y: MOV         R0.y,  (0x00000004, 5.605193857e-45f).y      
         z: MOV         R0.z,  (0x00000005, 7.006492322e-45f).z      
         t: MOV         R0.x,  (0x00000003, 4.203895393e-45f).w      
      1  x: MOV         R1.x,  (0x00000001, 1.401298464e-45f).x      
         y: MOV         R1.y,  (0x00000002, 2.802596929e-45f).y      
         z: MOV         R1.z,  0.0f      
         w: MOV         R0.w,  (0x00000006, 8.407790786e-45f).z      
         t: MOV         R2.x,  (0x00000007, 9.809089250e-45f).w      
01 MEM_RAT_STORE_TYPED: RAT(0)[R1], R0,  VPM 
02 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3].x___, R2, ARRAY_SIZE(4)  VPM 
END_OF_PROGRAM
Note that Catalyst reserves RAT 11 for global buffers.