Page MenuHome

Cycles: Support multithreaded compilation of kernels

Authored by Jeroen Bakker (jbakker) on Sep 28 2016, 11:55 PM.

Diff Detail

rB Blender

Event Timeline


A bit annoying to have another compiler step which might possible fail..

Why it might fail?

Just to say it's a huge time saver for programmers. It allows to test optimisations and new features much faster. So it would be really helpful to add it at least as an experimental feature to master.

Together with D2254, it allows to build the simplest kernels in 2sec and the most advanced one in 11seconds compared to 20 to 38sec on master.

LazyDodo (LazyDodo) added inline comments.

Could we just do the math and alloc the memory we need? just allocating 10k on the stack and hoping it's enough feels sloppy and a buffer overflow exploit waiting to happen.

Patch needs update to work with current master

Something went wrong. Arcanist created a new object D4313 where I merged latest master in this code. I will continue working there.

Jeroen Bakker (jbakker) commandeered this revision.Feb 7 2019, 7:46 AM

Hi Lukas,

I am working to update this patch to latest cycles, but will need to update processing. So if you don't mind I will take ownership of this Patch.

Jeroen Bakker (jbakker) planned changes to this revision.Feb 7 2019, 8:44 AM
  • Most 'huge' kernels are not effected by current implementation as they are compiled before this logic. (for example split kernel).
  • Should we compile every program this way. I will make this an option in the program so small kernels do not start new processes, only large kernels/programs.
Jeroen Bakker (jbakker) updated this revision to Diff 13545.EditedFeb 7 2019, 12:34 PM
  • refactored system_call_self to use a reserved capacity string. The previous implementation had a popential security risk, when passed parameters exceeded 10k.
  • tweaked load_kernels. The split kernel were compiled when the SplitKernelFunctions were created. This made the compilation of the split kernels to be done in the main blender process before the multi-process compilation processes were started.
  • renamed load_kernels to add_kernel_programs as load kernels had to construct the list of programs to load cq compile.
  • construct the right source when compiling. Due to changes in Cycles, the relative paths and setting names were incorrect.
  • ordered the split kernels so the largest kernels are compiled first.

    When running default scenes the speed up is not that much as the kernels compiled in 3 programs. When you disable use_single_program (--debug-value=256) you all split_kernels will be compiled in a separate program. This will speed up the compilation a lot.

    BMW; normally 55 seconds single program 40 seconds !single program 20 seconds

Currenty unclear why use_single_program is default.

Jeroen Bakker (jbakker) marked an inline comment as done.Feb 7 2019, 12:37 PM

Update this documentation

Ruslan (Loner) added a subscriber: Ruslan (Loner).EditedFeb 7 2019, 1:35 PM
BMW; normally 55 seconds
     single program 40 seconds
    !single program 20 seconds

Currenty unclear why use_single_program is default.

Can using !single program is default then, if he give high result?
Thanks for your work.

Jeroen Bakker (jbakker) planned changes to this revision.Feb 12 2019, 11:56 AM
  • Reduce overhead by bundling small (static) kernels in a single program.
Brecht Van Lommel (brecht) requested changes to this revision.Feb 12 2019, 12:23 PM
Brecht Van Lommel (brecht) added inline comments.

We shouldn't have to include this in the blender/ module, it's supposed to be abstracted from that.

I suggest to add a Device::compile_opencl_kernel() function that takes a string array.


I would prefer to keep this logic localized as much as possible in the OpenCL code. This could just pass a string array, and then let the OpenCL device code unpack the arguments.


This could use OIIO::Sysutil::this_program_path(), which we already use elsewhere.


This will not work for file paths with spaces. Since escaping in general is not so simple, I suggest to use execv (and _execv on Windows) instead.


// TODO(name): Comment.


bool system_call_self(const vector<string>& args);

Jeroen Bakker (jbakker) marked 3 inline comments as done.
  • Merge branch 'master' into arcpatch-D2264
  • D2264 Cycles OpenCL Multi process compilation

use execv/_execv to support path with spaces

Jeroen Bakker (jbakker) marked 2 inline comments as done.
  • D2264: Cycles OpenCL Multi Process Compilation

@Brecht Van Lommel (brecht)

In your review you added a comment to use execv and _execv due to possible incorrect argument passing.
The execv functions rewrite the current process and fails to receive the correct arguments.
Googling did not give me any useful answer (process forking).

As std::system is also used in other locations (register blendfile, linking ptx, compiling cuda kernels) and we have a safeguard when the compilation failed I used system.
If you have any idea how to solve the argument parsing, please let me know.

I found this MIT licensed library that might help and seems to be a professional implementation except that it doesn't seem to support the windows platform.

Please advice.

  • D2264: Multi Process OpenCL Kernel Compilation

Not doing: execv replaces current process.

  • Fix mixed space/tabs indentation.
  • Fix Python memory leak.
  • Show cleaner terminal output.
  • Refactoring string parameter passing a little.
This revision is now accepted and ready to land.Feb 14 2019, 9:40 PM
This revision was automatically updated to reflect the committed changes.