Page MenuHome

Cycles split kernel optimizations
Closed, ArchivedPublicDESIGN

Assigned To
None
Authored By
Brecht Van Lommel (brecht)
Nov 10 2020, 2:45 PM
Tokens
"Love" token, awarded by silex."Love" token, awarded by gilberto_rodrigues."Love" token, awarded by mindinsomnia."Love" token, awarded by Alaska."Like" token, awarded by bsavery.

Description

This task intends to gather ideas to optimize the Cycles split kernel.

There are a few major things to work on:

  • Reorganize split kernels so that shader evaluation and ray-tracing are not duplicated in multiple kernels, or not duplicated as many times. This could reduce kernel compile times, register pressure, and improve coherence. Alternatively, shader evaluation for background, shadows, volumes, could be specialized so that it does not include nodes not used in any such shader (for example volumes don't need BSDFs).
    • This would likely involve some rethinking of how we handle transparent shadows, and maybe light and background shaders.
  • Replace usage of queues and atomics for scheduling work, and replace with sorting between split kernels as done by some other renderers.
  • Reduce the size of the state that needs to remain in memory between split kernels. The number of rays that can be active is limited by this. It can result in low occupancy or not enough memory leaft for scene data.
    • Render passes could be written directly to memory to make PathRadiance much smaller (T72293)
    • ShaderData: there are ways to reduce the size of shader globals. Computing some members on demand rather than storing them, compression (lossless or lossy), simpler differentials.
    • It may be possible to structure the kernels so that only one closure needs to be stored at a time or for a shorter time, however this may come with some noise trade-offs.
  • If shader evaluation is isolated to one or fewer kernels, ray sorting by material ID can improve coherence. Similar sorting may help other split kernels.

I would consider removing branched path tracing for GPU rendering entirely (T52725), since this is not particularly suitable for GPUs and complicates the code. It would be easier to refactor without this.

Event Timeline

Brecht Van Lommel (brecht) changed the task status from Needs Triage to Confirmed.Nov 10 2020, 2:45 PM
Brecht Van Lommel (brecht) created this task.

OpenCL 2.0 introduced Pipes; A mean to communicate between simultaneous running kernels.
The benefit is that a pipe is a fixed size in global memory and is more likely to be cached by HW caches. It could also be used fine tune performance by changing the size of the pipes.

Would be an idea to research this as a alternative to queueing and sorting.

Shader globals could also be spliced into multiple smaller variants one optimized for intersection, other one for shading, other one for integration etc.

Brecht Van Lommel (brecht) changed the subtype of this task from "Report" to "Design".Nov 11 2020, 6:03 PM