Cycles: CUDA faster rendering of small tiles, using multiple samples like OpenCL.
ClosedPublic

Authored by Brecht Van Lommel (brecht) on Sep 27 2017, 1:44 PM.

Details

Summary

The work size is still very conservative, and this doesn't help for progressive refine. For that we will need to render multiple tiles at the same time. But this should already help for denoising renders that require too much memory with big tiles, and just generally soften the performance dropoff with small tiles.

For the benchmark scenes on a GTX 1080, going down from tile size 256x256 to 32x32 doesn't seem to lose any performance. Rendering with bigger tiles that cover the entire image is still a bit faster though.

Note the GTX 1080 has 2560 cores, and the heuristics results in a minimum work size of 25600 on that card (assuming there are sufficient AA samples). That corresponds to a tile size of 160x160 with 1 sample which we know is not the fastest. We can get close to that bigger tiles performance by multiplying step_samples by 8 for example, but I'd like to have a better solution for driver timeouts before we automatically increase the work size that much.

This required a bunch of refactoring, see the History tab to inspect individual commits. Overall code ended up simpler than before.

Diff Detail

Repository
rB Blender

Code refactor: remove rng_state buffer and compute hash on the fly.

A little faster on some benchmark scenes, a little slower on others, seems
about performance neutral on average and saves a little memory. Helps avoid
the need for an equivalent of the data_init kernel for megakernels.

Code refactor: use split variance calculation for mega kernels too.

There is no significant difference in denoised benchmark scenes and
denoising ctests, so might as well make it all consistent.

Code refactor: use split variance calculation for mega kernels too.

There is no significant difference in denoised benchmark scenes and
denoising ctests, so might as well make it all consistent.

Code refactor: zero render buffers outside of kernel.

This was originally done with the first sample in the kernel for better
performance, but it doesn't work anymore with atomics. Any benefit was
very minor anyway, too small to measure it seems.

CUDA faster rendering of small tiles, using multiple samples like OpenCL.

Brecht Van Lommel (brecht) retitled this revision from Code refactor: add WorkTile struct for passing work to kernel. to Cycles: CUDA faster rendering of small tiles, using multiple samples like OpenCL..Sep 27 2017, 1:52 PM
Brecht Van Lommel (brecht) edited the summary of this revision. (Show Details)
Brecht Van Lommel (brecht) edited the summary of this revision. (Show Details)

Code refactor parts already committed, just multiple samples rendering left.

This revision was automatically updated to reflect the committed changes.

Just gave this a whirl, render time between 256x256 and 32x32 is nearly identical on sm_30 (Gtx 670) so that's good, however not sure if it's this exact commit, but the time remaining calculation seems *way* off +- 1 minute render, and the estimated time remaining never went below 5 minutes even when working on the last tile.