zogi GSoC 2015: 2015

Thursday 20 August 2015

Summary

Since the firm pencils down date is approaching, I'll summarize the work I've done during the summer in this post.

All of the required features mentioned in the proposal are implemented and tested, that is image attribute query, 2D image reading (using basic sampler setup) and 2D image writing. Image reading using different sampling configurations is supported too, which was an optional feature. The following sections give an overview to the state of the project.

Image attributes

Image attributes are implemented as implicit kernel arguments. The implicit arguments are added to the kernel signature by a new LLVM pass in the AMDGPU backend. The arguments are inserted immediately after the image argument they belong to. The new LLVM pass (named AMDGPUOpenCLImageTypeLoweringPass) iterates over the kernel functions found in the opencl.kernels metadata node, and performs the modifications for each function. Furthermore the pass substitutes calls to llvm.OpenCL.image.get.size* and llvm.OpenCL.image.get.format* (the suffix is 2d or 3d) with implicit parameter loads. These are not real intrinsics; neither of them are listed in any tablegen file. These pseudo-intrinsics are used in the implementation of the OpenCL get_image* builtins (see this patch).

The presence of the implicit arguments are indicated by special type strings in the kernel argument metadata. The type strings __llvm_image_size and __llvm_image_format signal clover to add the appropriate image attribute values to the kernel input vector. For more information, see this blog post and this patch.

Image reading

Clover has already supported image objects and image-related API calls when I started working on image support in May. There were only a few minor problems. The first one is that clover linearized the transfer region before mapping, e.g. mapping a 3x4x2 box got flattened into a linear transfer of size 2*slice_pitch + 4*row_pitch + 3*element_size. This is a problem in case of tiled GPU resources: the driver needs to know the exact region to be able to transfer the data correctly. See this patch.
Another problem is that the driver may override the transfer pitch, which was ignored by clover. This patch fixes the issue.
The final problem with clover was that it didn't upload any value for samplers. The hardware I was working with (Mobility Radeon HD 5850: a Juniper chip) uses texture fetch instructions that need the coordinate type, i.e. normalized or unnormalized, as an operand, rather than looking this information up from a 3D register as is the case with addressing and filter modes. This implies that the kernel code itself (rather than the pre-launch register state setup) may require the information that a sampler uses normalized or unnormalized coordinates. To this end a bitfield containing sampler configuration is uploaded as the value of the sampler argument. See this patch for the implementation. For compile-time constant global and kernel local samplers, which are not supported yet, this wouldn't be a problem of course.

Read-only images are implemented as sampler views in the r600g driver. Only minor modifications were needed, since texture sampler state and view setup for graphics was already present, furthermore compute and graphics setup of these objects are very similar. Resource IDs appropriate for compute had to be used, and the compute flag (0x2) had to be bitwise or'd to the PKT3 headers. Unsetting the sampler views had to be handled too. See this patch.

The OpenCL image reading builtins were implemented in libclc using the llvm.R600.tex intrinsic. The instruction to which the intrinsic gets compiled requires the texture and sampler IDs to be immediate operands. The AMDGPUOpenCLImageTypeLoweringPass introduced to handle image attributes takes care of this problem by substituting calls to llvm.OpenCL.image.get.resource.id* and llvm.OpenCL.sampler.get.resource.id* (the suffix is 2d or 3d) with compile-time constant IDs. The libclc implementation uses these pseudo-intrinsics. Since TEX0=VTX0 is reserved for kernel arguments and TEX1=VTX1 is reserved for reading global buffers, the builtins add 2 to the image ID to obtain a TEX ID.

Image writing

Write-only images are implemented as color surfaces in the driver, and bound to RAT slots on the GPU. Similarly to texture setup, code for color surface setup already existed in mesa for graphics, however some modifications were needed to use it in compute mode. The RAT flag and RESOURCE field of the CB_COLOR*_INFO register had to be set, correct value had to be assigned for the CB_COLOR*_DIM register, and resource unbinding had to be handled (see this patch). There were problems accessing 2D texture RATs with linear array mode set: see this blog post for details, and this patch for a solution.

The OpenCL image writing builtins have to perform pixel format conversion according to the image format. On Evergreen hardware the MEM_RAT instruction with the STORE_TYPED flag performs just that. To make this instruction available to libclc, this patch introduces a new intrinsic called llvm.r600.rat.store.typed to LLVM. The write_image* builtins, which use the new intrinsic, are added in this patch. Since RAT0 is reserved for writing global buffers, the builtins add 1 to the image ID to obtain the RAT ID.

Piglit tests

Image and sampler argument support is added to piglit's OpenCL program tester along with a few test. See my previous blog post for details.

Missing features

There are a few missing features, some of which are required, some are optional according to the OpenCL 1.1 standard.

Missing required features are the following:

Currently the only way samplers may be specified is via kernel arguments; global and kernel local samplers are not supported.
Correct usage of image access qualifiers (read_only and write_only) are not enforced by llvm. Using write_image* on read-only images or vice versa results in undefined behavior instead of a compilation error.

Missing optional features listed in the proposal:

3D images are not supported.
Half precision float formats are not supported.

Code

My patches can be found on GitHub:

Wednesday 19 August 2015

Week 12 - Piglit tests

I spent the last week before soft pencils down on implementing OpenCL image and sampler type support for piglit's program tester, and adding tests to check image builtins. The configuration parser had to be modified to accept image and sampler arguments. The syntax of the new argument types are as follows:

Image argument:

(arg_in|arg_out) argument_index image pixel_type
    (values|random|repeat values)
    type (2d|3d)
    image_width image_width
    image_height image_width
    image_channel_order image_channel_order
    image_channel_data_type image_channel_data_type
    [tolerance (tolerance|ulp ulp)]

Sampler argument:

(arg_in|arg_out) argument_index sampler
    normalized_coords (true|false)
    addressing_mode (none|clamp_to_edge|repeat|mirrored_repeat)
    filter_mode (nearest|linear)

Currently only 2d arguments are supported. The channel order and data type may take any of the appropriate OpenCL constant names without the CL_ prefix (e.g. image_channel_order RGBA). For examples see the following tests:

I've sumbmitted the changes to piglit, but also uploaded it to a GitHub repo.

In the next post I'm going to summarize the new mesa, llvm and libclc features implemented during the summer.

Monday 10 August 2015

Week 11 - 2D Image reading and resource management

Previous week I've modified the existing r600 sampler state setup code so that compute shaders can use it, and fixed some resource management issues. Image reading now works using CL_INTENSITY with CL_FLOAT and CL_RGBA with CL_UNSIGNED_INT8 formats (these are ones I've tried). The former was tested using both nearest and linear filtering mode.

About sampler state setup: clover now uploads the sampler bitfield as the sampler argument (commit) to allow the libclc implementation to branch on sampler fields, particularly whether it uses normalized coordinates (read_image* builtins). OpenCL C constants have been added to the clc headers (image_defines.h).

About the resource management issues: I've noticed valgrind errors when running my test kernel [1]. Valgrind detected reads from already freed memory, like this one:

==13999== Invalid read of size 2
==13999==    at 0x4C2ED06: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3A74B0: radeon_emit_array (radeon_winsys.h:680)
==13999==    by 0xB3AD368: evergreen_emit_sampler_views (evergreen_state.c:2047)
==13999==    by 0xB3AD4F8: evergreen_emit_cs_sampler_views (evergreen_state.c:2085)
[...]
==13999==  Address 0xa8f382c is 76 bytes inside a block of size 112 free'd
==13999==    at 0x4C2B200: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13999==    by 0xB3DF2A1: r600_sampler_view_destroy (r600_state_common.c:368)
==13999==    by 0x504A8DA: clover::resource::unbind_sampler_view(clover::command_queue&, pipe_sampler_view*) (resource.cpp:96)
==13999==    by 0x5029A7D: clover::kernel::image_rd_argument::unbind(clover::kernel::exec_context&) (kernel.cpp:529)
==13999==    by 0x5028808: clover::kernel::exec_context::unbind() (kernel.cpp:235)
[...]

Furthermore, there were other valgrind errors regarding RAT setup of write-only images. After asking around on mesa-dev and messing with the code using gdb it became clear that the memory used by compute resources are managed by clover, and shouldn't be freed inside the driver. The lifetime of textures are bound to the lifetime of the mem object; the driver-side data describing surfaces and sampler views are created before and destroyed after kernel launch. These actions are all initiated by clover.

To avoid code duplication, it is beneficial to use already existing graphics code when possible. However, graphics code contains a reference counting resource management scheme, which interferes with clover if used during compute setup: this was the problem causing the valgrind errors.

The following changes contain the read-only and write-only image resource setup which avoids the errors mentioned above:

Piglit tests are still on my TODO list, now both for image reading and writing. The OpenCL test runner itself has to be modified to be able to accept image input.

Furthermore, currently the only way to supply samplers to the kernels is to pass them as an argument. Implementing global and kernel local constant samplers is another TODO.

UPDATE: Actually one may still use the reference counting mechanism of the driver as long as care is taken to prevent the refcount ever reaching 0 inside the driver.

[1] Test kernel. The kernel is supplied with image arguments such that a[4] evaluates to 0x3f.


// img1: CLK_INTENSITY, CLK_FLOAT
// img2: CLK_RGBA, CLK_UNSIGNED_INT8
// img3: CLK_RGBA, CLK_UNSIGNED_INT8
__kernel void imgtest(read_only image2d_t img1,
                      read_only image2d_t img2,
                      write_only image2d_t img3,
                      sampler_t s1, sampler_t s2,
                      __global int * a, __global float * b)
{
    int i = get_global_id(0);
    int j = get_global_id(1);

    // Test read_imagef
    if (i == j && i < 10) {
        float x = (0.5f + i*0.1f) / get_image_width(img1);
        b[i] = read_imagef(img1, s1, (float2)(x, 0.f)).x;
    }

    // Test read_imageui
    if (i == 0 && j == 0) {
        uint4 c = read_imageui(img2, s2, (int2)(1, 2));
        a[5] = c.x;
        a[6] = c.y;
        a[7] = c.z;
        a[8] = c.w;
    }

    // Test write_imageui
    int k = 100 * i + j;
    write_imageui(img3, (int2)(i, j), (uint4)(k & 0xff, k >> 8, 0, 0));

    // Test attribute getters
    a[0] = get_image_width(img1);
    a[1] = get_image_height(img1);
    a[2] = get_image_width(img2);
    a[3] = get_image_height(img2);
    // Should evaluate to 63
    a[4] = (get_image_channel_order(img1) == CLK_INTENSITY) |
           (get_image_channel_data_type(img1) == CLK_FLOAT) << 1 |
           (get_image_channel_order(img2) == CLK_RGBA) << 2 |
           (get_image_channel_data_type(img2) == CLK_UNSIGNED_INT8) << 3 |
           (get_image_channel_order(img3) == CLK_RGBA) << 4 |
           (get_image_channel_data_type(img3) == CLK_UNSIGNED_INT8) << 5;
}
}

Monday 3 August 2015

Week 10 - 2D image writing

Last week I've implemented 2D image writing. As I've mentioned earlier, the Catalyst driver compiles MEM_RAT STORE_TYPED from the write_image* functions, since this instuction performs format conversion if the RAT is configured correctly. Previously I couldn't configure the RATs correctly, but last week I've managed to make it work.

On llvm side the STORE_TYPED instruction has been added along with the new llvm.r600.rat.write.typed intrinsic to the AMDGPU backend (commit). The write_image* functions in libclc can simply use the new intrinsic (commit).

The RAT configuration in r600g consists of setting up the RAT and RESOURCE fields of the CB_COLOR*_INFO registers. For some reason, the CB_COLOR*_DIM registers weren't set correctly, so this had to be added too. See this commit.

There was one unexpected problem though: the LINEAR_ALIGNED array mode doesn't work well with TEXTURE_2D resource type in case of RATs on my hardware, again, for an unknown reason. More precisely the location of the writes is not correct: the data written appeared at wrong locations. My previous attempt to use STORE_TYPED did not work because the driver always chose LINEAR_ALIGNED array mode even for images. My solution/workaround for this is to force a tiled array mode on texture compute resources for r600g hardware (r600, r700, evergreen, northern islands). See this commit.

Along with the RAT configuration in the r600g driver, a few minor changes had to be introduced to clover too. One such change is about mapping GPU resources to a CPU-accessible location. The transfer region is a potentially multi-dimensional (2 or 3) box, that was previously flattened to a linear offset and size. This information is insufficient for tiled textures: the driver has to know region dimensions. See this commit for details.

Another problem was that upon transfer the driver may force a specific row and slice pitch for the mapped data, and this information was ignored by clover. This behaviour was correct for linear buffers, but caused problems for tiled ones. See this commit.

One particular TODO is to add piglit tests to check image writing functionality.

Monday 27 July 2015

Week 9 - A different approach for image attributes

Previously image attributes were passed as implicit parameters located at the end of the kernel input vector: clover appended the image metadata regardless of the target on which the kernel would run. It would be beneficial to allow more flexibility to clover, since some kind of targets may already have means of querying image dimensions or format, and uploading the attributes as implicit parameters is unnecessary for them. To this end the implicit arguments holding the image attributes are now added to the function signature at IR level immediately after the image argument to which they belong. Calls to attribute getter functions are replaced by reads from these new arguments. This transformation is implemented as an IR pass in the AMDGPU target. Since llvm function signatures can't be changed directly, the kernel functions are recreated with the new signature. The pass also handles image and (partially) sampler resource IDs.

Let's look at the implementation of the pass in a bit more detail. The pass is based on the opencl.kernels named metadata node; no changes are made to modules without proper metadata, and only the kernels listed under the opencl.kernels node are transformed.

For each function found in the node, the kernel_arg_type metadata is scanned for image2d_t and image3d_t argument types. For each of those kernel arguments, two new arguments are added immediately following the image argument: one of type <3 x i32> for the image size (width, height and depth), and one of type <2 x i32> for the image format (channel data type and channel order). The kernel function is recreated with the new signature and the body is copied over. Metadata is added to the new arguments; the implicit args are marked by the types __llvm_image_size and __llvm_image_format.

After the function with implicit arguments is in place, the uses of image and sampler arguments are scanned for the following function calls:

llvm.OpenCL.image.get.size*
llvm.OpenCL.image.get.format*
llvm.OpenCL.image.get.resource.id*
llvm.OpenCL.sampler.get.resource.id

The stars in the list above indicate that different image getter function has to be used for 2d and 3d images because of the type difference.

The resource IDs are determined by the index of the argument value within the kernel signature. Image arguments are grouped by access qualifier to read_only and write_only groups; the resource ID is the argument index within the group. E.g. in case of
__kernel void foo(read_only image2d_t a, read_only image3d_t b, write_only image2d_t c, write_only image3d_t d)
the resource ID of a and c is 0, and for b and d is 1. The resource IDs of sampler_t arguments are calculated similarly. Samplers declared as module globals or kernel locals are not handled yet.

Now, that the implicit arguments are present at IR level with proper metadata, clover is able to look for them, and upload the image attributes only if necessary. See this commit.

Note: because of the fact that kernels are recreated during the pass described above, this change is necessary in clover.

The implementation of the OpenCL builtins in libclc looks like this.

Sunday 12 July 2015

Week 7 - Image reading

This week the main focus was on image reading, while also reworking previously sent patches about image attributes. Currently I have a working prototype for writing CL_FLOAT images, although I'm not ready to push it upstream yet. The current status is summarized in the following list.

Mesa is now able to set up compute texture and sampler resources (code). Most of the code was already present, only initialization and emission of the atoms responsible for sampler and texture resource (sampler view) state setup had to be added. I also had my mentor's code as a starting point.
The prototype libclc implementation uses the llvm.R600.tex intrinsic with hardcoded texture id, sampler and coord types (code).
No modification of llvm is required.

The image attribute getters have undergone a few modifications. The mechanism which assigns a compile-time constant ID to the kernel image arguments, which was part of the image attribute intrinsics replacer pass, has been factored out into a separate pass (patch). Simultaneously, the image attribute replacer pass has been deleted, since the reason of the pass was the compile-time constant ID generation. This way the attribute getters can be implement in libclc roughly as follows (see this patch for additional details):

get_image_attribute(get_image_id(image), attribute);

An additional benefit of this approach is that the image ID pass can be extended with different kind of IDs, e.g. resource IDs (like RAT ID) and sampler IDs.

Monday 29 June 2015

Week 5 - Image writing

During the last week I've been working on image writing, i.e. on the write_image* builtins. The current implementation is in an experimental state (which is a euphemism for hacky in this case), and only deals with the special case of a one dimensional single channel image of 32 bit pixels. Furthermore only write_imageui is supported and only for the first write-only image argument, but these are relatively easy to fix.

To be a bit more specific: the write-only image arguments are bound to RATs (Random Access Targets) aka UAVs (Unordered Access Views) by r600g. The RAT ID is the 1-based (!) index of the write-only image argument among all the write-only image arguments as is present in the kernel signature. RAT 0 is reserved for global buffers.

The libclc implementation of write_imageui simply stores the input value (only the x component for now) to the location defined by the coordinate argument (again, only the x component is used). The RAT ID gets encoded into the address space of the pointer. Currently the address space of RAT 1 is hard-coded, but it could be fixed using an intrinsic which returns a null pointer with the proper address space given the image argument. The intrinsic could be substituted with the constant pointer value by an LLVM pass.

During instruction lowering, the MEM_RAT_CACHELESS_STORE_RAW instruction will be selected in place of the abstract store instruction, in spite of the fact that the Catalyst driver uses MEM_RAT_STORE_TYPED (see the disasm below). The rationale behind this choice is that STORE_TYPED is not yet implemented in the AMDGPU LLVM backend, and I couldn't make it work in the time I was willing to spend experimenting with it. Sadly AMD's Evergreen ISA docs are pretty vague, and the exact behaviour of STORE_TYPED including its interaction with the hardware configuration of the RATs is not documented AFAIK. RAT 1 is hard-coded here too.

Example kernel:

__kernel void imgtest_basic(write_only image2d_t img, __global int *out)
{
    write_imageui(img, (int2)(1, 2), (uint4)(3, 4, 5, 6));
    *out = 7;
}

And it's ASM produced by Catalyst (disassembled using CodeXL):

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15) 
      0  x: LSHR        R3.x,  KC0[2].x,  2      
         y: MOV         R0.y,  (0x00000004, 5.605193857e-45f).y      
         z: MOV         R0.z,  (0x00000005, 7.006492322e-45f).z      
         t: MOV         R0.x,  (0x00000003, 4.203895393e-45f).w      
      1  x: MOV         R1.x,  (0x00000001, 1.401298464e-45f).x      
         y: MOV         R1.y,  (0x00000002, 2.802596929e-45f).y      
         z: MOV         R1.z,  0.0f      
         w: MOV         R0.w,  (0x00000006, 8.407790786e-45f).z      
         t: MOV         R2.x,  (0x00000007, 9.809089250e-45f).w      
01 MEM_RAT_STORE_TYPED: RAT(0)[R1], R0,  VPM 
02 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3].x___, R2, ARRAY_SIZE(4)  VPM 
END_OF_PROGRAM

Note that Catalyst reserves RAT 11 for global buffers.

Thursday 18 June 2015

Week 3.5 - Running late

During the previous ~1.5 weeks I've been working on finishing the requirements of milestone 1 (i.e. get_image_* builtins), which means I've greatly underestimated the time required to complete it. Fortunately I've allocated some buffer time on the last weeks of GSoC to implement some optional requirements, so I'll probably have to sacrifice some of that. Now that that's out of the way, I can talk about the current state of the get_image_* builtins along with the choices I've made while implementing these features.

The general idea is essentially the same as the one I described in my previous post: the getters have to be replaced by an LLVM pass with implicit kernel parameters loads. However the code had to be restructured because several parts were in wrong places, e.g. one should avoid patching the IR in the driver if possible. To this end the pass has been moved to LLVM.

The current implementation can be broken down to 3 parts:

builtin definitions using dummy intrinsics (libclc)
translation of dummy intrinsics to meaningful code (LLVM)
placement of image attribute data to specific locations before kernel launch (mesa)

The libclc definitions are simple functions which contain calls to the llvm.AMDGPU.get.image.[23]d dummy intrinsics. A new pass (R600ImageAttributeIntrinsicsReplacer) has been added to the AMDGPU backend of LLVM to replace calls to these dummy intrinsics with the newly added llvm.AMDGPU.read.image.attribute intrinsic. This intrinsic accepts two compile-time constant operands (each 4 bytes wide): an image index (i) and an attribute index (j). Upon instruction lowering, the intrinsic will be translated to a load from an implicit kernel parameter using the 4-byte offset 4 + 5 * i + j added to the starting location of the implicit parameters (the first four 4-bytes are used by grid dim and grid offset, and there are five 4-byte image attributes for each image). The attribute index is 0 for width, 1 for height, 2 for depth, 3 for channel data type and 4 for channel order. The image index is the index of the image argument among all image arguments. This decision has some consequences though. Namely it affects how convenient it is to prepare implicit arguments for the software component which is responsible for that.

This raises the question which component of the driver should prepare these implicit arguments? Currently clover serializes grid dim and grid offset taking care of byte extension and endian conversion, but ideally the pipe driver should make the choice how it implements mechanisms like grid dim info or image attributes. On the other hand image attributes like channel order and data type contain OpenCL specific constants, which the pipe driver stores in a different format. Adding conversion code to the driver would mean (1) adding state tracker specific code to the driver and (2) duplicating some functionality, so one has to maintain both the OpenCL to pipe format converter in clover and the pipe format to OpenCL converter in the driver.

So what implications do the choice above has on image indexing? The driver has no immediate knowledge of the order of read-only and write-only kernel arguments, since those are handled differently at driver level. If one chooses to prepare image attributes in the driver, another indexing scheme has to be introduced, e.g. by copying the attributes of write-only images first and the read-only ones after that.

Considering the complications arising from preparing image attributes in the driver, I decided to do it in clover instead in a similar fashion to how grid dim and grid offset are handled. Because of that, image attributes can be added to the kernel parameters simply in the order of the image arguments as is present in the kernel signature. This also simplifies the intrinsic replacer pass.

Patches have been sent to the relevant mailing list under the following subjects:

[Mesa-dev] [PATCH] clover: Implement image attribute getters (commit on GitHub)
[llvm] Image attribute access for the AMDGPU backend (commit on GitHub)
[Libclc-dev] [PATCH] Add image attribute getter builtins (commit on GitHub)

UPDATE: added links to commits on GitHub.

Saturday 6 June 2015

Week 2 - Milestone 1

This week I managed to implement the OpenCL image attribute getter builtins for 2D images (and partially for 3D). These builtins allow the query of image metadata - width, height, depth (for 3D), channel data type and channel order - within an OpenCL kernel. In the remainder of this post I will describe how the implementation works and how did I got there.

Back in April when I wrote my GSoC proposal, my mentor Tom Stellard suggested that examining the ISA emitted by the Catalyst driver could help in the design process, so I did just that. I used the simple OpenCL kernel below:

__kernel void imgtest_basic(read_only image2d_t img, __global int * out)
{
*out = get_image_width(img);
}

and obtained the following ISA using AMD's CodeXL on Windows with a Mobility Radeon HD 5850 (Juniper, Evergreen) GPU:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15)

0 x: LSHR R1.x, KC0[2].x, 2

t: MOV R0.x, KC0[0].x

01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) VPM

END_OF_PROGRAM

What I think is important (or what I managed to understand) here is that the width of the image is stored in KC0[0].x which references a constant cache. The location of the data is known at compile time, no pointer dereferencing is needed. Furthermore the value of the out pointer is stored in a similar manner - in KC0[2].x, so one may assume that kernel arguments are passed this way.
Note: global addresses are 4-byte aligned on this architecture, so the least significant 2 bits should be omitted from the load addresses. This is why a right shift (LSHR) by 2 is required to transform the address stored in the argument out to a valid load input.

Clover, a software component of mesa which manages OpenCL state (a state tracker in gallium parlance), compiles OpenCL C sources to ISA using clang and LLVM. The OpenCL builtins are implemented as an LLVM bitcode library (libclc), which gets linked to the kernel in IR form. The fact that LLVM IR represents the image[23]d_t type as an opaque pointer imposes some restrictions on the implementation of the getter builtins, because this pointer is the sole input which these builtins may use to do their thing. And you can't do much with a pointer without dereferencing it.

What all this rambling lead us is that a code transformation is required to achieve something similar to what Catalyst does. So I've implemented an LLVM pass which replaces calls to the getter builtins with accesses to hidden kernel arguments containing the image attributes. The builtins are implemented in libclc as nonexistent intrinsics to mark the places where the pass needs to perform a substitution, and also to avoid problems with mangled symbol names created because of overloading. For each explicit image argument (only 2d for now) 5 new ones are created for the 5 attributes in a specific order. The LLVM pass assumes the hidden argumens are in this order when performing the substitutions. The hidden arguments will receive their values when the image argument is bound to the execution context.

A patch is on the way, I'll link it here if it gets accepted.

UPDATE:
I got some feedback for the patches. This feature needs some work, because I've added the new LLVM pass, which implements a driver-specific behavior (i.e. passes image attributes as hidden arguments), to clover, which should manage OpenCL state in a driver-independent way. So I'll have to move this to a driver-specific place. The plan is to generate intrinsics in a similar fashion to the workitem id and size getter builtins, and lower those intrinsics in the R600 LLVM backend to reads from implicit arguments.

UPDATE #2: I've got the address-shifting-by-2 thing backwards: a _right_ shift is needed before loads. See the updated note above.

Friday 29 May 2015

Week 1 - onwards to Milestone 1

I'm halfway through the deadline of Milestone 1, which requires me to be able to create an image, pass it to a kernel and query its attributes by June 8. These are the steps I've done towards these requirements this week:

Image support capability queries now return true. This makes it possible to call clCreateImage*.
Added dummy get_image_width to libclc for 2d images, which now always returns 42.
Enabled PIPE_BIND_COMPUTE_RESOURCE usage in the evergreen_is_format_supported function. Currently it returns true to any input (TODO: find out if this is correct).
Depth value is now 1 instead of 0, because libdrm requires it (see first comment).

This is the kernel I'm testing with right now, my first goal is to be able to run this, and get back 42 in the output buffer.

__kernel void imgtest(__read_only image2d_t in, __global int * out)
{
if (get_global_id(0) == 0) {
*out = get_image_width(in);
}
}

What remains to be done next week:

Allow kernel launch to set dummy compute sampler views (currently an assert prevents that).
Pass the image attributes to the kernel as hidden parameters, and actually use them in libclc.

Thursday 30 April 2015

Google Summer of Code project accepted

My Google Summer of Code project proposal titled OpenCL image support for the r600 gallium driver has been accepted. See the proposal below:

Contact: Zoltan Gilian <zoltan.gilian@gmail.com>, zogi on IRC
Title: OpenCL image support for the r600 gallium driver
Synopsis:
Current open source radeon drivers do not support OpenCL images. The aim of this project is to implement support for image-related OpenCL C 1.1 capabilities, both 2d and 3d, in the r600 gallium driver.
Benefits to the Community:
OpenCL image handling functions could utilize appropriate GPU resources (e.g. texture sampling units), therefore increasing the performance of related algorithms (e.g. image processing). Such software relying on OpenCL image capabilities cannot run using the open source drivers despite having a capable GPU in the system. Implementing OpenCL image support would solve this problem.
Deliverables:
- Expose image support to clover. Required.
- Implement 2d image attribute query and basic data access OpenCL C builtins. Required.
  - querying image attributes (get_image_* builtins)
    reading image data with basic sampling parameters (using unnormalized coordinates, ‘none’ addressing mode and nearest filtering)
  - writing image data
- Implement image sampling and access qualifiers. Optional.
- Support 3d images. Optional.
- Support half floating-point image access. Optional.
- Documentation. Required.
Milestones:
- Be able create an image, pass it to a kernel and query its attributes by June 8.
  Functions to impement:
  - get_image_width
  - get_image_height
  - get_image_channel_data_type
  - get_image_channel_order
  - get_image_dim
- Be able to write and read 1 dimensional image data using basic sampling parameters by June 26.
  Functions to implement:
  - read_image{f,i,ui} (with the restrictions above)
  - write_image{f,i,ui} (with the restrictions above)
- Implement 2d image access using basic sampling parameters by July 13.
  Functions to implement:
  - read_image{f,i,ui} (with the restrictions above)
  - write_image{f,i,ui} (with the restrictions above)
- Document the previous steps to future developers by July 20.
- Implement and document (a subset of) the optional deliverables in the remaining time.
  - Image sampling
    - extend the functionality of read_image
  - 3d images
    - extend the functionality of the builtins above
    - new function: get_image_depth
  - Access qualifiers (read_only and write_only)
  - Half floating-point image access
    - read_imageh, write_imageh
Architecture overview:
- LLVM frontend (libclc): extend the libclc llvm bitcode library by supporting image related OpenCL C constructs (see: Deliverables).
  - Image attributes can be implemented as hidden kernel parameters. These attributes should be transferred to constant memory on the device.
- LLVM backend (llvm): support intrinsics emitted by libclc.
- r600 pipe driver (mesa): expose image support to clover.
Related Work:
- http://portablecl.org: OpenCL C frontend using clang.
Biographical Information:
- Education: MSc degree in Computer Science, Eotvos Lorand University, Budapest (2012). Currently pursuing PhD in signal processing there.
- Work experience:
  - SRE intern at Google, Zurich (2013 Aug-Dec)
  - Intern and coop student at Morgan Stanley, Budapest (2010 summer, 2011 spring)
  - System administrator at Bolyai College, Budapest (since 2014)
- Related technical skills: C, C++, CUDA (udacity parallel computing class) and OpenCL, Linux, compilers (BSc thesis about a compiler extension to the feldspar language, 2010, sadly in hungarian).
- Open source work:
  - Implemented hardware page accounting into hplip cups backend for college printer accounting (hplip fork)
  - Small open source spectrum analyzer in javascript: https://github.com/zogi/spectrum-analyzer-js (demo)