Thursday 20 August 2015

Summary

Since the firm pencils down date is approaching, I'll summarize the work I've done during the summer in this post.

All of the required features mentioned in the proposal are implemented and tested, that is image attribute query, 2D image reading (using basic sampler setup) and 2D image writing. Image reading using different sampling configurations is supported too, which was an optional feature. The following sections give an overview to the state of the project.

Image attributes

Image attributes are implemented as implicit kernel arguments. The implicit arguments are added to the kernel signature by a new LLVM pass in the AMDGPU backend. The arguments are inserted immediately after the image argument they belong to. The new LLVM pass (named AMDGPUOpenCLImageTypeLoweringPass) iterates over the kernel functions found in the opencl.kernels metadata node, and performs the modifications for each function. Furthermore the pass substitutes calls to llvm.OpenCL.image.get.size* and llvm.OpenCL.image.get.format* (the suffix is 2d or 3d) with implicit parameter loads. These are not real intrinsics; neither of them are listed in any tablegen file. These pseudo-intrinsics are used in the implementation of the OpenCL get_image* builtins (see this patch).

The presence of the implicit arguments are indicated by special type strings in the kernel argument metadata. The type strings __llvm_image_size and __llvm_image_format signal clover to add the appropriate image attribute values to the kernel input vector. For more information, see this blog post and this patch.

Image reading

Clover has already supported image objects and image-related API calls when I started working on image support in May. There were only a few minor problems. The first one is that clover linearized the transfer region before mapping, e.g. mapping a 3x4x2 box got flattened into a linear transfer of size 2*slice_pitch + 4*row_pitch + 3*element_size. This is a problem in case of tiled GPU resources: the driver needs to know the exact region to be able to transfer the data correctly. See this patch.
Another problem is that the driver may override the transfer pitch, which was ignored by clover. This patch fixes the issue.
The final problem with clover was that it didn't upload any value for samplers. The hardware I was working with (Mobility Radeon HD 5850: a Juniper chip) uses texture fetch instructions that need the coordinate type, i.e. normalized or unnormalized, as an operand, rather than looking this information up from a 3D register as is the case with addressing and filter modes. This implies that the kernel code itself (rather than the pre-launch register state setup) may require the information that a sampler uses normalized or unnormalized coordinates. To this end a bitfield containing sampler configuration is uploaded as the value of the sampler argument. See this patch for the implementation. For compile-time constant global and kernel local samplers, which are not supported yet, this wouldn't be a problem of course.

Read-only images are implemented as sampler views in the r600g driver. Only minor modifications were needed, since texture sampler state and view setup for graphics was already present, furthermore compute and graphics setup of these objects are very similar. Resource IDs appropriate for compute had to be used, and the compute flag (0x2) had to be bitwise or'd to the PKT3 headers. Unsetting the sampler views had to be handled too. See this patch.

The OpenCL image reading builtins were implemented in libclc using the llvm.R600.tex intrinsic. The instruction to which the intrinsic gets compiled requires the texture and sampler IDs to be immediate operands. The AMDGPUOpenCLImageTypeLoweringPass introduced to handle image attributes takes care of this problem by substituting calls to llvm.OpenCL.image.get.resource.id* and llvm.OpenCL.sampler.get.resource.id* (the suffix is 2d or 3d) with compile-time constant IDs. The libclc implementation uses these pseudo-intrinsics. Since TEX0=VTX0 is reserved for kernel arguments and TEX1=VTX1 is reserved for reading global buffers, the builtins add 2 to the image ID to obtain a TEX ID.

Image writing

Write-only images are implemented as color surfaces in the driver, and bound to RAT slots on the GPU. Similarly to texture setup, code for color surface setup already existed in mesa for graphics, however some modifications were needed to use it in compute mode. The RAT flag and RESOURCE field of the CB_COLOR*_INFO register had to be set, correct value had to be assigned for the CB_COLOR*_DIM register, and resource unbinding had to be handled (see this patch). There were problems accessing 2D texture RATs with linear array mode set: see this blog post for details, and this patch for a solution.

The OpenCL image writing builtins have to perform pixel format conversion according to the image format. On Evergreen hardware the MEM_RAT instruction with the STORE_TYPED flag performs just that. To make this instruction available to libclc, this patch introduces a new intrinsic called llvm.r600.rat.store.typed to LLVM. The write_image* builtins, which use the new intrinsic, are added in this patch. Since RAT0 is reserved for writing global buffers, the builtins add 1 to the image ID to obtain the RAT ID.

Piglit tests

Image and sampler argument support is added to piglit's OpenCL program tester along with a few test. See my previous blog post for details.

Missing features

There are a few missing features, some of which are required, some are optional according to the OpenCL 1.1 standard.

Missing required features are the following:

  • Currently the only way samplers may be specified is via kernel arguments; global and kernel local samplers are not supported.
  • Correct usage of image access qualifiers (read_only and write_only) are not enforced by llvm. Using write_image* on read-only images or vice versa results in undefined behavior instead of a compilation error.

Missing optional features listed in the proposal:

  • 3D images are not supported.
  • Half precision float formats are not supported.

Code

My patches can be found on GitHub:

Wednesday 19 August 2015

Week 12 - Piglit tests

I spent the last week before soft pencils down on implementing OpenCL image and sampler type support for piglit's program tester, and adding tests to check image builtins. The configuration parser had to be modified to accept image and sampler arguments. The syntax of the new argument types are as follows:

  • Image argument:
    (arg_in|arg_out) argument_index image pixel_type
        (values|random|repeat values)
        type (2d|3d)
        image_width image_width
        image_height image_width
        image_channel_order image_channel_order
        image_channel_data_type image_channel_data_type
        [tolerance (tolerance|ulp ulp)]
    
  • Sampler argument:
    (arg_in|arg_out) argument_index sampler
        normalized_coords (true|false)
        addressing_mode (none|clamp_to_edge|repeat|mirrored_repeat)
        filter_mode (nearest|linear)
    
Currently only 2d arguments are supported. The channel order and data type may take any of the appropriate OpenCL constant names without the CL_ prefix (e.g. image_channel_order RGBA). For examples see the following tests:

I've sumbmitted the changes to piglit, but also uploaded it to a GitHub repo.

In the next post I'm going to summarize the new mesa, llvm and libclc features implemented during the summer.

Monday 10 August 2015

Week 11 - 2D Image reading and resource management

Previous week I've modified the existing r600 sampler state setup code so that compute shaders can use it, and fixed some resource management issues. Image reading now works using CL_INTENSITY with CL_FLOAT and CL_RGBA with CL_UNSIGNED_INT8 formats (these are ones I've tried). The former was tested using both nearest and linear filtering mode.

About sampler state setup: clover now uploads the sampler bitfield as the sampler argument (commit) to allow the libclc implementation to branch on sampler fields, particularly whether it uses normalized coordinates (read_image* builtins). OpenCL C constants have been added to the clc headers (image_defines.h).

About the resource management issues: I've noticed valgrind errors when running my test kernel [1]. Valgrind detected reads from already freed memory, like this one:

==13999== Invalid read of size 2
==13999==    at 0x4C2ED06: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3A74B0: radeon_emit_array (radeon_winsys.h:680)
==13999==    by 0xB3AD368: evergreen_emit_sampler_views (evergreen_state.c:2047)
==13999==    by 0xB3AD4F8: evergreen_emit_cs_sampler_views (evergreen_state.c:2085)
[...]
==13999==  Address 0xa8f382c is 76 bytes inside a block of size 112 free'd
==13999==    at 0x4C2B200: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3DF2A1: r600_sampler_view_destroy (r600_state_common.c:368)
==13999==    by 0x504A8DA: clover::resource::unbind_sampler_view(clover::command_queue&, pipe_sampler_view*) (resource.cpp:96)
==13999==    by 0x5029A7D: clover::kernel::image_rd_argument::unbind(clover::kernel::exec_context&) (kernel.cpp:529)
==13999==    by 0x5028808: clover::kernel::exec_context::unbind() (kernel.cpp:235)
[...]

Furthermore, there were other valgrind errors regarding RAT setup of write-only images. After asking around on mesa-dev and messing with the code using gdb it became clear that the memory used by compute resources are managed by clover, and shouldn't be freed inside the driver. The lifetime of textures are bound to the lifetime of the mem object; the driver-side data describing surfaces and sampler views are created before and destroyed after kernel launch. These actions are all initiated by clover.

To avoid code duplication, it is beneficial to use already existing graphics code when possible. However, graphics code contains a reference counting resource management scheme, which interferes with clover if used during compute setup: this was the problem causing the valgrind errors.

The following changes contain the read-only and write-only image resource setup which avoids the errors mentioned above:

Piglit tests are still on my TODO list, now both for image reading and writing. The OpenCL test runner itself has to be modified to be able to accept image input.

Furthermore, currently the only way to supply samplers to the kernels is to pass them as an argument. Implementing global and kernel local constant samplers is another TODO.

UPDATE: Actually one may still use the reference counting mechanism of the driver as long as care is taken to prevent the refcount ever reaching 0 inside the driver.


[1] Test kernel. The kernel is supplied with image arguments such that a[4] evaluates to 0x3f.


// img1: CLK_INTENSITY, CLK_FLOAT
// img2: CLK_RGBA, CLK_UNSIGNED_INT8
// img3: CLK_RGBA, CLK_UNSIGNED_INT8
__kernel void imgtest(read_only image2d_t img1,
                      read_only image2d_t img2,
                      write_only image2d_t img3,
                      sampler_t s1, sampler_t s2,
                      __global int * a, __global float * b)
{
    int i = get_global_id(0);
    int j = get_global_id(1);

    // Test read_imagef
    if (i == j && i < 10) {
        float x = (0.5f + i*0.1f) / get_image_width(img1);
        b[i] = read_imagef(img1, s1, (float2)(x, 0.f)).x;
    }

    // Test read_imageui
    if (i == 0 && j == 0) {
        uint4 c = read_imageui(img2, s2, (int2)(1, 2));
        a[5] = c.x;
        a[6] = c.y;
        a[7] = c.z;
        a[8] = c.w;
    }

    // Test write_imageui
    int k = 100 * i + j;
    write_imageui(img3, (int2)(i, j), (uint4)(k & 0xff, k >> 8, 0, 0));

    // Test attribute getters
    a[0] = get_image_width(img1);
    a[1] = get_image_height(img1);
    a[2] = get_image_width(img2);
    a[3] = get_image_height(img2);
    // Should evaluate to 63
    a[4] = (get_image_channel_order(img1) == CLK_INTENSITY) |
           (get_image_channel_data_type(img1) == CLK_FLOAT) << 1 |
           (get_image_channel_order(img2) == CLK_RGBA) << 2 |
           (get_image_channel_data_type(img2) == CLK_UNSIGNED_INT8) << 3 |
           (get_image_channel_order(img3) == CLK_RGBA) << 4 |
           (get_image_channel_data_type(img3) == CLK_UNSIGNED_INT8) << 5;
}
}

Monday 3 August 2015

Week 10 - 2D image writing

Last week I've implemented 2D image writing. As I've mentioned earlier, the Catalyst driver compiles MEM_RAT STORE_TYPED from the write_image* functions, since this instuction performs format conversion if the RAT is configured correctly. Previously I couldn't configure the RATs correctly, but last week I've managed to make it work.

On llvm side the STORE_TYPED instruction has been added along with the new llvm.r600.rat.write.typed intrinsic to the AMDGPU backend (commit). The write_image* functions in libclc can simply use the new intrinsic (commit).

The RAT configuration in r600g consists of setting up the RAT and RESOURCE fields of the CB_COLOR*_INFO registers. For some reason, the CB_COLOR*_DIM registers weren't set correctly, so this had to be added too. See this commit.

There was one unexpected problem though: the LINEAR_ALIGNED array mode doesn't work well with TEXTURE_2D resource type in case of RATs on my hardware, again, for an unknown reason. More precisely the location of the writes is not correct: the data written appeared at wrong locations. My previous attempt to use STORE_TYPED did not work because the driver always chose LINEAR_ALIGNED array mode even for images. My solution/workaround for this is to force a tiled array mode on texture compute resources for r600g hardware (r600, r700, evergreen, northern islands). See this commit.

Along with the RAT configuration in the r600g driver, a few minor changes had to be introduced to clover too. One such change is about mapping GPU resources to a CPU-accessible location. The transfer region is a potentially multi-dimensional (2 or 3) box, that was previously flattened to a linear offset and size. This information is insufficient for tiled textures: the driver has to know region dimensions. See this commit for details.

Another problem was that upon transfer the driver may force a specific row and slice pitch for the mapped data, and this information was ignored by clover. This behaviour was correct for linear buffers, but caused problems for tiled ones. See this commit.

One particular TODO is to add piglit tests to check image writing functionality.