Cycles: schedule more work and prevent CUDA driver timeouts.
Needs RevisionPublic

Authored by Brecht Van Lommel (brecht) on Sep 29 2017, 4:33 AM.

Details

Reviewers
Sergey Sharybin (sergey)
Group Reviewers
Cycles
Summary

This should further improve CUDA performance with small tiles if there
are sufficient AA samples, while explicitly preventing driver timeouts.

I tried to use the existing work stealing code from get_next_work(),
but always found it 5-10% slower than hardware scheduling even when
trying to optimize it to avoid atomics.

Only tested with GTX 1080 on Linux so far. With this change benchmark
scene render times with tile sizes between the full render and 32x32
are within a few % of each other.

It would not surprise me if there are problems with other cards and
platforms though, that needs to be tested more.

Depends on D2856

Diff Detail

Repository
rB Blender
Branch
bits
Build Status
Buildable 860
Build 860: arc lint + arc unit

If same behavior could be obtained on OpenCL, the next benchmark pack could have only one version with 32x32 for CPU, Cuda and OpenCL. That would also avoid playing with tile size for the end user. On OpenCL currently, only tile size between 256 and 64 are about the same speed. 32 is much slower and one tile also (two digit percent). It would also maybe allow later to have a tile server that would render with CPU codepath for CPU, CUDA code path for NVidia and OpenCL code path for AMD all on one frame, using one tile size.

That's certainly where I would like to end up eventually, a fixed small tile size that automatically works well for all devices. I'm not sure if communicating with running kernels works with OpenCL though, and we will need support for rendering multiple tiles with a single GPU device as well.

Sergey Sharybin (sergey) requested changes to this revision.Oct 10 2017, 9:56 AM

Seems patch needs an update now, so it can be applied cleanly using arc ?

intern/cycles/device/device_cuda.cpp
1926

Eugh. Don't like auto at all, is quite unreadable on what is it. Is it even a scalar or some sort of more comprehensive struct?

Also, the funciton itself is more like belongs to util_time.h

intern/cycles/kernel/kernel_compat_cuda.h
177

Fullstop.

180

Two spaces after # before define, it is an indettent preprocessor.

intern/cycles/kernel/kernels/cuda/kernel.cu
38

I know it's .cu file, but shouldn't we still use ccl_local to keep code more consistent between device-specific and device-non-specific cases?

This revision now requires changes to proceed.Oct 10 2017, 9:56 AM

I'll revisit this patch when I've done some other changes. Initially I was testing with a GTX 1080 on Linux, but I found out this combination supports compute preemption and doesn't suffer from display freezing anyway. Windows does not support compute preemption and it seems for a responsive display you need to set the timeout quite low there, which makes performance worse than before. That may be useful anyway for viewport rendering, but then probably should also have some UI control or only apply to viewport render.

intern/cycles/device/device_cuda.cpp
1926

Right, I'll replace this code, was just something quick to get it working.

intern/cycles/kernel/kernels/cuda/kernel.cu
38

The whole function makes some specific assumptions about the way NVidia GPUs work, so I'm not sure the abstraction makes things more clear. But I can make it use the more generic macros anyway.