Cycles OpenCL kernel-splitting work
ClosedPublic

Authored by Martijn Berger (juicyfruit) on Mar 24 2015, 4:20 PM.
Tags
Tokens
"Love" token, awarded by DBrown."Like" token, awarded by numarul7."Love" token, awarded by yuzukyo."The World Burns" token, awarded by lockal."Love" token, awarded by zanqdo."Like" token, awarded by campbellbarton."Like" token, awarded by bliblubli."Like" token, awarded by jesterking."Like" token, awarded by capnm."Love" token, awarded by lordodin."Like" token, awarded by GiantCowFIlms."Like" token, awarded by catlover2."Love" token, awarded by gandalf3."Like" token, awarded by brachi."Like" token, awarded by sterlingroth."Love" token, awarded by dingto."Love" token, awarded by mano-wii."Orange Medal" token, awarded by thelasthope."Yellow Medal" token, awarded by italic_."Love" token, awarded by tuqueque.

Details

Summary

This patch contains work to split the OpenCL mega-kernel into separate kernels to obtain better GPU utilization and therefore performance.

A description of the optimizations included in this patch is located at https://docs.google.com/document/d/1LuXW-CV-sVJkQaEGZlMJ86jZ8FmoPfecaMdR-oiWbUY/edit?usp=sharing

Diff Detail

Repository
rB Blender
There are a very large number of changes, so older changes are hidden. Show Older Changes
intern/cycles/device/device.h
100

Can't it be replaced with closure_nodes.size()?

102

Not sure why it's needed?

intern/cycles/device/device_opencl.cpp
46

Doesn't this belong to kernel_types.h?

74

It's also a bit dangerous because of possible padding added to ShaderData. Need some clearer way of calculating the size, but needs some deeper thoughts.

299

It should at least be an enum, not a string.

Also, why exactly it is so specific? Can't cache just cache whatever program is requested?

338

Same as above.

425

Not needed.

765

Prefer to have output parameters at the end of the list.

999

Guess it is something to be fixed in the compiler as well. It's also important ot mention what exact version/vendor of compiler is broken.

Also, comments is a full sentence, starting with capital and having fullstop at the end.

1025

We really need to simplify this with some macro, like:

#define KERNEL_APPEND_ARG(kernel, arg, narg) \
  opencl_assert(clSetKernelArg(kernel, narg++, sizeof(arg), (void*)&arg));

KERNEL_APPEND_ARG(ckFilmConvertKernel, d_data, narg);
KERNEL_APPEND_ARG(ckFilmConvertKernel, d_rgba, narg);
/* ...and so on... */

Could also be wrapped into some smarter macro which will do CUDA-like kernel launching. Or just an inlined c++ function with some template magic:

template <typename T1, typename T2, typename T3, typename T4 ...>
inline kernel_launch(kernel, T1 a, T2 b ...) {
  int narg = 0;
  KERNEL_APPEND_ARG(kernel, a, narg);
 /* ... */
}

/* Launch film kernel. */
kernel_launch(ckFilmConvertKernel, d_data, d_rgba, d_buffer);
1042

KERNEL_TEX is to be undefined to this point.

1091

Undefine KERNEL_TEX.

1134

If kernel itself needs to know if it's split or not, then better to negotiate this flag and call it SPLIT_KERNEL. This way no changes will be needed to keep CPU/CUDA working.

1224

Undefine KERNEL_TEX.

1306

Remove _SPLIT_KERNEL suffix.

1754

Undefine KERNEL_TEX. And actually do it in all similar cases.

1777

break is meaningless in this case, by removing this redundant statements you'll clean up this function quite a lot.

Also return could be at the same line, making the function even smaller.

Would be nice to avoid this all together, but currently i can't think of more clear way. Finding better way of dealing with this could be mentioned in the TODO tho.

1992

assert(!"Uknown node type passed to FOO")

2009

If it can't be 0 assert for that and don't check of it being zero.

2013

Same as above.

2015

Not really sure, will it survive if one closure is replaced with another one in shader tree, keeping overall number of closures the same?

2047

Would prefer if all defines passed via cflags were wrapped with underscores.

2166

Seems __SPLIT_KERNEL, max closure options and work stealing seem could be deduplicated in this function.

This revision now requires changes to proceed.May 2 2015, 4:42 PM

Sorry for the spam. One more thing: it'll help if stuff is being committed to the branch more often. Then we'll have more instant feedback/communication and me/@Martijn Berger (juicyfruit) will be more efficient with help that way (knowing we don't do duplicated work and so on).

@George Kyriazis (kyriazis), @Lenny Wang (lennyhpc) : Please correct me, If I am wrong in making the following statements.
@Sergey Sharybin (sergey) : sd_fetch() is added in order to realize the structure-of-arrays implementation for ShaderData structure. The structure of arrays implementation facilitates a coalesced memory access to ShaderData structure, which is a good memory access pattern for GPUs.
Also considering the huge size of ShaderData structure, sometimes the compiler generates code with spilled registers (spilled registers reside in high-latency global memory). This usage of spilled registers actually increases the global memory fetches in some cases.
We observed performance improvement after doing the structure-of-arrays implementation of ShaderData structure.

hereafter, I will try and push code more often.

intern/cycles/device/device.h
82

yeah . it definitely makes sense to put it in DeviceInfo. I will make the change

100

closure_nodes lists all the svm nodes that will ever be traversed in "svm_eval_nodes()". clos_max denotes how big the "MAX_CLOSURE" macro should be, for the current scene. In megakernel if "MULTI_CLOSURE" is enabled, the "MAX_CLOSURE" value is default to 64. Not all of the 64 locations in ShaderData's "closure" array is necessary to render a scene.

102

use_split_kernel flag is mainly used in load_kernels logic. If the device is going to use split kernel, then we delay load_kernels() execution to after "device_update()". (Note : load_kernels() depends on data from device_update() to realize "selective svm node compilation" optimization)

intern/cycles/device/device_opencl.cpp
46

It was present in kernel_types.h. But I moved it to device_opencl.cpp because, this macro is also used in device_opencl.cpp.
My rationale is that, one would expect cycles to work if he/she disabled/enabled (that is, after a build) a feature-macro in kernel_types.h (build/bin/Release/2.74/scripts/addons/cycles/kernel/kernel_types.h). is it not so ? work_stealing could be thought of as an optimization-feature. However, if one disabled work_stealing in kernel_types.h (in the bin folder), then cycles would crash since "WORK_STEALING" related code will still be enabled in the binary executable.

74

@Martijn Berger (juicyfruit) : But this is the best-fitting place i could find for these declarations :( any ideas ?
@Sergey Sharybin (sergey) : Yeah, It could be made better. Needs some thought.

299

I will make the change into an enum.

We have the string to denote which program it should return. Previously i.e. in blender-master branch, we had only one program and get_program, set_program will implicitly refer to that program only. Since we have 2 programs here, we need to explicitly tell get_program and set_program about what program we are refering to.

425

noted. will remove it.

765

yeah. will make necessary changes.

999

This comment is not introduced in cycles_kernel_split commits.
Will make note of the comment format and make changes accordingly.
Question on comment format : If the sentence is too big, it could be broken into 2 or more lines, right ?

1025

I think the first approach, set of macros to set kernel arguments followed by a enqueue_kernel, is clean, clear and straight-forward.
However, its your call. do you want it done using templates ?

1042

noted. will make necessary changes.

1091

noted. will make necessary changes.

1134

The flow now is that, the load_kernels of OpenCLDeviceBase will compile filmconvert, shader and bake kernels.
If the device is to work on megakernel, then the load_kernels of OpenCLDeviceMegakernel will compile only kernel_ocl_path_trace kernel.
If the device is to work on splitkerenl, then the load_kernels of OpenCLDeviceSplitkernel will compile all the split kernels of path trace.

The flag "COMPILE_ONLY_MEGAKERNEL" is used in OpenCLDeviceMegakernel to prevent recompilation of other kernels. This flag also prevents un-necessary compilation of megakernel in OpenCLDeviceBase's load_kernels().

Also this flag is only available in kernel.cl and would not harm CPU/CUDA's working.

1224

noted. will make necessary changes.

1306

noted.will make necessary changes.

1754

noted. will make necessary changes.

1777

yeah ok. i will clean it up and write a TODO.

1992

noted. will make necessary changes.

2013

My mistake, It should have been "shouldn't" instead of "can't" in both cases. I will correct the comments.

2015

Let me consult with another developer and get back to you on this.

2047

noted. will make necessary changes.

2166

yeah. will de-duplicate them.

2612

noted. will make necessary changes.

2863

ok.

3203

noted. will use macros for kernel enqueue.

intern/cycles/kernel/closure/bsdf.h
133

@Sergey Sharybin (sergey), @Martijn Berger (juicyfruit) : will change it to ccl_addr_space.

intern/cycles/kernel/geom/geom_attribute.h
60

Not sure how this "XXX" got introduced. I will remove it.

intern/cycles/kernel/geom/geom_object.h
154

I believe its possible to remove duplication. Let me check it out further and get back to you.

intern/cycles/kernel/kernel_emission.h
22

the split kernel version has one extra argument. let me check out the de-duplication feasibility and get back to you on this .

intern/cycles/kernel/kernel_passes.h
22

noted. will make necessary changes.

intern/cycles/kernel/kernel_queues.h
76

I searched for a ccl_local definition and could't find one. I will add a definition and replace all "__local" with "ccl_local".

intern/cycles/kernel/kernel_shader.h
1048

Let me check it out in detail and get back to you.

intern/cycles/kernel/kernel_shadow.h
197

The split kernel version has extra arguments (sd and isect_global). This function can be de-duplicated by adding the two extra arguments to the original version and subsequently changing all the function calls.

intern/cycles/kernel/kernel_types.h
20

ok. will make necessary changes.

457

I will check it out and get back to you.

552

yes.

1055

yeah ok. will make necessary changes.

1071

ok . will make necessary changes.

intern/cycles/kernel/kernel_work_stealing.h
65

will make necessary changes.

intern/cycles/kernel/svm/svm.h
184

yeah . got it :)

intern/cycles/render/buffers.h
143 ↗(On Diff #4126)

This is the tile size set by the user in the gui. I was not aware that it could be calculated already. I will check it out.

161 ↗(On Diff #4126)

This value is calculated in thread_run and used inside path_trace. Thinking about it, I believe,This probably belongs somewhere in device_opencl.cpp. will check it out.

intern/cycles/render/graph.h
211 ↗(On Diff #4126)

sorry, I dont understand. Where is the variable "closure" that you have mentioned ? This value is used in the calculation of clos_max in Device class.

intern/cycles/render/session.cpp
663

Will need a little help here. We will need a better understanding of svm_nodes compilation to do it ourselves without relying on device_update().
The idea is that we should find all the svm_nodes that would be traversed in "svm_eval_nodes" function here and enable appropriate macros.
Is there some doc that you could share regarding that ?

intern/cycles/render/svm.cpp
97 ↗(On Diff #4126)

you mean, a shader node could be compiled differently for each render of a same scene ?

This comment was removed by Tito Novelianto (mbakkokom).
Martijn Berger (juicyfruit) updated this object.
  • Cycles: OpenCL kernel-splitting work
  • Merge branch 'master' into cycles_kernel_split
  • Cycles: Make code style closer to Cycles, avoid some unneeded changes
  • Cycles: Compilation error fixes
  • Cycles: Don't limit OpenCL devices by the env variable
  • Cycles: Don't use std::set directly, use util_set instead
  • Cycles: Fix compilation error in release builds with strict flags
  • Cycles: Add regular license header
  • Cycles: Make sure there's no double spaces in the build options
  • Cycles kernel split: Fix for progressive refine
  • Merge branch 'master' into cycles_kernel_split
  • Add comment for PATH_ITER_INC_FACTOR
  • [CYCLES] Resolve data-flow violation
  • Add utility function to check if given tile size can be rendered by split kernel
  • Add utility function to get render-feasible tile size
  • Add utility function to split big-tile into small-tiles
  • Record buffer and rng_state offsets in RenderTile
  • Add big-tile-handle logic in thread_run -> device_opencl.cpp
  • Support rendering of big tile sizes
  • Make parallel-samples logic work
  • Refactor : Check render feasibility only once
  • Refactor : Remove unused variables
  • Refactor : Group all split kernel utility functions
  • Initialize per_thread_output_buffer_size based in render type
  • Refactor : Uncomment shader, filmconvert and bake kernels
  • Bug fix : Fix viewport render issue
  • Bug fix : Always make max_render_feasible_tile_size a multiple of local work size
  • Print message to console incase we need to split tiles further
  • [BCYCLES-209] make direct lighting and shadow blocked kernels share shaderData global buffers
  • [BCYCLES-209] make direct lighting and shadow kernels share members of
  • [BCYCLES-209] Refactor : rename variables sd_shadow -> sd_DL_shadow
  • Bug fix : Dont use ASSIGN_RAY_STATE macro for first assignment
  • [BCYCLES-212 Fix] Make split kernel work on multiple OpenCL devices
  • Force kernel re-compilation check for viewport render
  • Print additional info incase we need to split tile further
  • Fix OpenCL on Apple again, double spaces in compile arguments.
  • Merge blender/master d6b28b
  • [BCYCLES-213] Improve tile splitting logic
  • Refactor : Add split/mega kernel selection logic
  • Add OpenCLDeviceSplitKernel class
  • Remove un-necessary code from OpenCLDevice class
  • Refactor : Remove un-necessary SPLIT_KERNEL macros
  • Refactor : Move some build option setting into load_kernels for split_kernel
  • Bug fix : Remove SPLIT_KERNEL build option for kernel.cl in OpenCLDeviceSplitKernel
  • Prevent un-necessary megakernel compilation inside splitkernel
  • Restore original state of kernel_types.h
  • Remove KERNEL_OPENCL definition in device_opencl.cpp
  • Refactor : Remove un-necessary code/comment
  • Remove un-used subsurface kernel
  • Bypass selective-compilation-of-svm-nodes for viewport
  • Refactor(cosmetics) : Modify ShaderData declaration
  • Bug fix : Don't display sample count in progress bar when using split kernel
  • Refactor : Add/Remove comments/spaces
  • Refactor : Change variable name
  • Remove SPLIT_KERNEL_CLOSURE_COUNT macro
  • Dont use OpenCLCache with split kernel
  • SupportKERNEL_DEBUG flag
  • Change class hierarchy in device_opencl.cpp and refactor
  • Use OpenCLCache for megakernel
  • Refactor : Remove tabs in tile-split message
  • Move viewport render to the beginning of load_kernels
  • Move work_stealing macro to device_opencl.cpp
  • Bug fix : Account for ShaderClosure ptr in ShaderData SOA
  • Complete all kernel exec before releasing tile
  • Fix mem leak : Release per_sample_output_buffers mem
  • Refactor : "if (" -> "if(", "for (" -> "for("
  • Merge branch 'master' into cycles_kernel_split
  • Rename ADDR_SPACE to ccl_addr_space
  • Move definition of ccl_addr_space to kernel_compat_*
  • make "const ccl_addr_space" consistent throughout the code
  • Make sure we undef KERNEL_TEX once we are done with it
  • Cycles kernel split: Fix compilation error after recent changes
  • Get rid of bogus XXX comments
  • Enums should not be in CAPS.
  • Cycles kernel split: Make it possible to force usage of split kernel
  • Remove '#ifndef SPLIT_KERNEL' around shader_eval_displacement
  • De-duplicate object_normal_transform function
  • De-duplicate object_dir_transform function
  • De-duplicate shader_setup_from_background function
  • Make the use of Ray in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • Make use of sd in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • Make use of isect in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • De-duplicate shadow_blocked function
  • De-duplicate direct_emissive_eval function
  • Add ShaderData variable header file
  • Use kernel_shaderdata_vars.h for ShaderData decl
  • Reshuffle ShaderData variables and add padding
  • Don't use variable count macros to calc ShaderData size
  • Don't use variable count macros to calc ShaderData_SOA size
  • Remove ShaderData variable count macros
  • Rename program_names with enums in OpenCLCache
  • Remove un-necessary back-slash in macro definition
  • Refactor : Place output arguments at end of list and remove unused arguments
  • Account for d_data memory in 'get_scene_specific_mem_allocated'
  • Refactor : Add assert and comment for clos_max
  • De-duplicate build-option-setting in load_kernels
  • Use max_render_feasible_tile_size for tile-split determination
  • Modify split-kernel's path-trace function signature
  • Move max_render_feasible_tile_size to device_opencl.cpp
  • Cycles kernel split: Compilation error fix after recent changes
  • Cycles kernel split: Another attempt to fix error in camera motion
  • Merge branch 'master' into cycles_kernel_split
  • Cycles kernel split: Disable advanced shading for all split kernels
  • Cycles Kernel Split: Cleanup get_node_type_as_string() switch and add todo comment.
  • Add mem_alloc() function for Split kernel buffer allocation
  • Cycles kernel split : Use ENQUEUE_SPLIT_KERNEL macro to enqueue split kernels
  • Cycles kernel split : Move ENQUEUE_SPLIT_KERNEL macro inside path_trace
  • Cycles kenrel split : move use_split_kernel to DeviceInfo class
  • Cycles kernel split : avoid explicit macro definition in svm.h
  • Cycles kernel split : Refactor: replace address specifiers with ccl_ macros
  • Merge branch 'master' into cycles_kernel_split
  • Cyccles kernel split : Add message to assert statement
  • Use macros to set args for datainit kernel
  • Cycles kernel split : Use macros to set args for sceneintersect kernel
  • Cycles kernel split : Use macros to set args for LampEmission kernel
  • Cycles kernel split : Use macros to set arg for QueueEnqueue kernel
  • Cycles kernel split : Use macros to set args for bg_buffupdate kernel
  • Cycles kernel split : Use macros to set args for Shader Lighting kernel
  • Cycles kernel split : Use macros to set args for holdout_emission kernel
  • Cyclse kernel split : Use macros to set args for DirectLighting kernel
  • Cycles kernel split : Use macros to set args for shadow_blocked kernel
  • Cycles kernel split : Use macros to set args for Setupnextiter kernel
  • Cycles kernel split : Use macros to add args to SumAllRadiance kernel
  • Cycles kernel split : Undef KENREL_APPEND_ARG macro
  • Cycles kernel split : Use macros to set args for film_convert kernel
  • Cycles kernel split : Use macro to set args for shader/bake kernels
  • Cycles kernel split : Use macro to set args for megakernel
  • Cycles kernel split : move atomic_add_float function to kernel_split.h
  • Cycles kernel split : Refactor: move atomic_add_float utility function to appropriate place in kernel_split.h
  • Merge branch 'master' into cycles_kernel_split
  • Cycles kernel split: Simplify closure counting
  • Cycles kernel split: Remove unused and unwanted clos field from ShaderNode
  • Cycles kernel split: Fix order of headers inclusion
  • Cycles kernel split: Remove global static variable
  • Cycles kernel split: Begin getting rid of device->get_background()
  • Cycles kernel split: De-duplicate image updating
  • Cycles kernel split: Clarify some comments in session.h
  • Cycles kernel split : Remove misleading load_kernels progress bar status
  • Cycles kernel split : Remove SPLIT_KERNEL suffix in device_opencl.cpp
  • Cycles kernel split : Disable TRANSPARENT_SHADOWS for AMD platform
  • Cycles kernel split: Move address space specification to KernelGlobals typedef
  • Merge branch 'master' into cycles_kernel_split
  • Cycles kernel split: Move address space specification to ShaderData typedef
  • Cycles kernel split: Move address space specification to ShaderClosure typedef
  • Cycles kernel split: Move address space specification to Intersection typedef
  • Cycles kernel split: Move address space specification to PathRadiance typedef
  • Cycles kernel split: Move address space specification to DebugData typedef
  • Merge branch 'master' into cycles_kernel_split
  • Cleanup: Remove some unneeded white space changes, to make diff clearer
  • Cycles kernel split: Begin reworking selective node compilation
  • Cycles kernel split: Get rid of legacy selective nodes compilation checks
  • Cycles kernel split: Get rid of loading kernels from run_cpu/run_gpu
  • Cycles kernel split: get_*() vs. *_get() for class methods
  • Cycles kernel split: Remove ugly hack from Session::load_kernels()
  • Cycles kernel split: Always gather requested feature set for the device
  • Cycles kernel split: Remove unneeded changes in device multi
  • Cycles kernel split: Move device requested features to own struct
  • Cycles kernel split: Fix broken logic around multi-closure in device_opencl
  • Cycles kernel split: No need to get tile size from manager, it's known in params
  • Cycles kernel split: Remove unwanted whitespace changes
  • Merge branch 'master' into cycles_kernel_split
  • Cycles kernel split: Get rid of redundant _KERNEL_SPLIT suffix of kernels
  • Cycles kernel split: Don't add extra argument to emissive eval in non-split kernel
  • Cycles kernel split: Indentation fixes in new kernel files
  • make "const ccl_addr_space" consistent throughout the code
  • Make sure we undef KERNEL_TEX once we are done with it
  • Cycles kernel split: Fix compilation error after recent changes
  • Get rid of bogus XXX comments
  • Enums should not be in CAPS.
  • Cycles kernel split: Make it possible to force usage of split kernel
  • Remove '#ifndef SPLIT_KERNEL' around shader_eval_displacement
  • De-duplicate object_normal_transform function
  • De-duplicate object_dir_transform function
  • De-duplicate shader_setup_from_background function
  • Make the use of Ray in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • Make use of sd in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • Make use of isect in shadow_blocked_SPLIT_KERNEL similar to shadow_blocked
  • De-duplicate shadow_blocked function
  • De-duplicate direct_emissive_eval function
  • Add ShaderData variable header file
  • Use kernel_shaderdata_vars.h for ShaderData decl
  • Reshuffle ShaderData variables and add padding
  • Don't use variable count macros to calc ShaderData size
  • Don't use variable count macros to calc ShaderData_SOA size
  • Remove ShaderData variable count macros
  • Rename program_names with enums in OpenCLCache
  • Remove un-necessary back-slash in macro definition
  • Refactor : Place output arguments at end of list and remove unused arguments
  • Account for d_data memory in 'get_scene_specific_mem_allocated'
  • Refactor : Add assert and comment for clos_max
  • De-duplicate build-option-setting in load_kernels
  • Use max_render_feasible_tile_size for tile-split determination
  • Modify split-kernel's path-trace function signature
  • Move max_render_feasible_tile_size to device_opencl.cpp
  • Cycles kernel split: Compilation error fix after recent changes
  • Cycles kernel split: Another attempt to fix error in camera motion
  • Merge branch 'master' into cycles_kernel_split
  • Cycles kernel split: Disable advanced shading for all split kernels

Some inlined comments for myself. Will be addressed asap and start merging into master.

intern/cycles/device/device.h
26

Not needed anymore.

intern/cycles/device/device_opencl.cpp
268

TODO: We can split kernel.cl into kernel_path_trace.cl and kernel_common.cl and then we can get rid of this enum. But could happen after the merge.

787

Should be if(

1281

This could also be gone after the kerbel.cl split.

1624

This could be replaced with inlined sizeof(), same as it is done for stuff like ShaderData.

This revision was automatically updated to reflect the committed changes.