Brecht Van Lommel (brecht) Thomas Dinges (dingto) Sergey Sharybin (sergey) Martijn Berger (juicyfruit) Lukas Stockner (lukasstockner97)
- rCff7950ae6456: Support multithreaded compilation of kernels
rBS9800837b9879: Cycles: Support multithreaded compilation of kernels
rBS4ce9785e0158: Cycles: Support multithreaded compilation of kernels
rB9800837b9879: Cycles: Support multithreaded compilation of kernels
rB4ce9785e0158: Cycles: Support multithreaded compilation of kernels
Just to say it's a huge time saver for programmers. It allows to test optimisations and new features much faster. So it would be really helpful to add it at least as an experimental feature to master.
Together with D2254, it allows to build the simplest kernels in 2sec and the most advanced one in 11seconds compared to 20 to 38sec on master.
Could we just do the math and alloc the memory we need? just allocating 10k on the stack and hoping it's enough feels sloppy and a buffer overflow exploit waiting to happen.
- Most 'huge' kernels are not effected by current implementation as they are compiled before this logic. (for example split kernel).
- Should we compile every program this way. I will make this an option in the program so small kernels do not start new processes, only large kernels/programs.
- refactored system_call_self to use a reserved capacity string. The previous implementation had a popential security risk, when passed parameters exceeded 10k.
- tweaked load_kernels. The split kernel were compiled when the SplitKernelFunctions were created. This made the compilation of the split kernels to be done in the main blender process before the multi-process compilation processes were started.
- renamed load_kernels to add_kernel_programs as load kernels had to construct the list of programs to load cq compile.
- construct the right source when compiling. Due to changes in Cycles, the relative paths and setting names were incorrect.
- ordered the split kernels so the largest kernels are compiled first.
When running default scenes the speed up is not that much as the kernels compiled in 3 programs. When you disable use_single_program (--debug-value=256) you all split_kernels will be compiled in a separate program. This will speed up the compilation a lot.
BMW; normally 55 seconds single program 40 seconds !single program 20 seconds
Currenty unclear why use_single_program is default.
We shouldn't have to include this in the blender/ module, it's supposed to be abstracted from that.
I suggest to add a Device::compile_opencl_kernel() function that takes a string array.
I would prefer to keep this logic localized as much as possible in the OpenCL code. This could just pass a string array, and then let the OpenCL device code unpack the arguments.
This could use OIIO::Sysutil::this_program_path(), which we already use elsewhere.
This will not work for file paths with spaces. Since escaping in general is not so simple, I suggest to use execv (and _execv on Windows) instead.
In your review you added a comment to use execv and _execv due to possible incorrect argument passing.
The execv functions rewrite the current process and fails to receive the correct arguments.
Googling did not give me any useful answer (process forking).
As std::system is also used in other locations (register blendfile, linking ptx, compiling cuda kernels) and we have a safeguard when the compilation failed I used system.
If you have any idea how to solve the argument parsing, please let me know.
I found this MIT licensed library that might help and seems to be a professional implementation https://github.com/arun11299/cpp-subprocess except that it doesn't seem to support the windows platform.