Cycles: Allow rendering with GPUs an CPUs at once
ClosedPublic

Authored by Brecht Van Lommel (brecht) on Oct 8 2017, 1:37 PM.

Details

Summary

CPU rendering will be restricted to a BVH2, which is not ideal for raytracing
performance but can be shared with the GPU. Decoupled volume shading will be
disabled to match GPU sampling.

Thread priority or GPU sync tweaks are likely needed to improve performance,
but might as well post the patch for testing already. Perfect scaling is not
going to happen due to BVH2 usage though.

Go to User Preferences > System to enable the CPU to render alongside the GPU.

Diff Detail

Repository
rB Blender

Some results comparing GTX 1080 and Core i7-4790K + GTX 1080, on Linux with tile size 32x32.

Some scenes are scaling ok, others not so much. It would help to support rendering smaller tiles faster on the GPU or to let multiple CPU cores work on the same tile. These are relatively short 10-25s renders though, I expect it would be less of an issue with more samples.

Reserving a thread for CUDA may be needed, or perhaps raising its priority, though disabling the spin loop seems not too bad here. On Windows the OS thread scheduler may not work so well though.

replacing line 1374 with:

if entry.type == 'OPENCL' or entry.type == 'CPU':
    if entry not in cuda_devices:
        opencl_devices.append(entry)

makes OpenCL, CUDA and CPU to work together, however, with OpenCL having a bottleneck under 64x64 tiles, it is slower to render with CPU.

Supporting CUDA + OpenCL rendering would be nice indeed. Mainly the challenge there is changing the UI so we only get a single device list instead of CUDA and OpenCL separately, and the associated changes to preserve backwards compatibility.

Regarding performance, there's a bunch of factors:

  • If lowering the number of CPU threads in the Performance panel to "number of CPU cores - number of GPUs" helps, that means the GPU thread will need higher priority or we will need to do such a heuristic automatically.
  • The addition of CU_CTX_SCHED_BLOCKING_SYNC may slow down GPU rendering (with or without adding the CPU for rendering), this could be left out but requires lowering the number of CPU threads.
  • If the tile is too big it can happen that the GPU finishes early and then has to wait on CPU threads to finish their tiles.
Aaron Carlisle (Blendify) retitled this revision from Cycles: CPU + GPU rendering support. to Cycles: Allow rendering of GPUs an CPUs at once.Oct 9 2017, 6:01 AM
Aaron Carlisle (Blendify) retitled this revision from Cycles: Allow rendering of GPUs an CPUs at once to Cycles: Allow rendering with GPUs an CPUs at once.
  • If the tile is too big it can happen that the GPU finishes early and then has to wait on CPU threads to finish their tiles.

This could be a good time to implement different tile size for GPU/CPU, or if it´s possible, an adaptive one that subdivides the remaining threads according to the number of devices and work left.

Just an idea, I´m not sure if this can be done or if it´s possible in the current development.

The plan is to let each GPU render many small tiles at once. Different tile sizes or subdivision would still constrain the tile shape to be rectangular which doesn't always fit well.

from test made in UI, cpu indeed is always the last one to finish. The more threads a cpu will have, the higher the probability is that the GPUs will idle, because those 16 or 32 tiles are already being rendered by cpu very slowly. So if it's possible without too much work, I would say it would be more effective in real scenarios to let all the CPU thread render one tile, just like all the thread of the GPU render one tile. It may also improve cache behaviour and increase render speed. Of course, letting GPU render several tiles would still be needed to ensure better occupancy.

Sergey Sharybin (sergey) added inline comments.
intern/cycles/blender/addon/properties.py
1379

That would mix up split kernel for OpenCL and megakernel for OpenCL. You sure it's not going to cause any issues (since the memory buffers are kind of different for those kernels) ?

It's also going to be confusing if one have CPU OpenCL SDK installed. But how can we indicate that one CPU is a bare-metal CPU and other one is OpenCL CPU?

intern/cycles/blender/blender_sync.cpp
731

I don't like this 0, going to backfire sooner or later. Can we have a proper enum defined somewhere so we are safe for possibly changing IDs here?

intern/cycles/device/device_cuda.cpp
247 ↗(On Diff #9365)

Did you run benchmarks?

I can do some on weekend when back in Amsterdam. Only have gt620 as a display card here, which isn't really good for benchmarks.

All in all, i don't think we have to wait this patch to become ideal before commit. Better to get it to git sooner than later, and code wise feedback is minimal.

This revision is now accepted and ready to land.Oct 10 2017, 9:43 AM
intern/cycles/blender/addon/properties.py
1379

Split/mega kernel should not be a problem, it's handled per device as far as I can tell, and I've found no issues in testing.

Regarding devices being listed multiple times, I think the first goal should be to avoid that. And then if it does happen (due to debug flags being used) we can add (CUDA) and (OpenCL) to the end of the name.

intern/cycles/device/device_cuda.cpp
247 ↗(On Diff #9365)

There was a few % render time reduction in my tests with a GTX 1080 on Linux, which is not too bad. But I'm not at all confident this extends to other platforms or different/multiple GPUs, viewport render vs. final render, etc.

Basically I need to do much more testing still, and this also interacts with other changes like D2862. The safe thing would be to reduce the number of CPU threads so each GPU keeps a dedicated CPU core, but ideally we can avoid it. It would be nice to avoid CPU cores running at 100% when rendering on the GPU only.

Here's a more drastic change, exposing a single list of CUDA and OpenCL
devices. If both types of devices are available, we disambiguate them with
(CUDA) and (OpenCL) in the name.

The downside is that this breaks API compatibility in that the enum items
change, which I guess is ok for Blender 2.8.

Viewport rendering works quite poorly with this, since it basically splits the viewport in half between the CPU and GPU. If the GPU is e.g. 4x faster than the CPU this slows things down overall. Perhaps the CPU should remain disabled for viewport render, or only kick in after a few seconds if we can figure out a way to do better work distribution then with bigger numbers of samples at a time.

viewport rendering of BMW from official benchmark pack takes 12seconds on 1080TI, 20seconds on Vega64 and 16 seconds using both. With F12 render, that's the opposite, Vega is faster with 82sec (at 128x128, best time), 1080Ti takes 93seconds (at 16x16, best time) and both take 44seconds using latest master with initial_num_samples at 5000.
To sum up:

  • viewport seem really slow in latest master. OpenCL. 2.78c with selective node compilation for viewport renders nearly 2x faster on Vega 64. It's not due to SSS or volume as those are not compiled in viewport kernel either. I can investigate on that.
  • multi-device rendering is slower with viewport/progressive rendering than the fastest device alone. Logic would be to wait for the slowest half to finish, which would be around 10seconds for Vega?
  • viewport seem really slow in latest master. OpenCL. 2.78c with selective node compilation for viewport renders nearly 2x faster on Vega 64. It's not due to SSS or volume as those are not compiled in viewport kernel either. I can investigate on that.

I guess the bicubic stuff is part of that, but we should indeed figure out what the other reasons are.

  • multi-device rendering is slower with viewport/progressive rendering than the fastest device alone. Logic would be to wait for the slowest half to finish, which would be around 10seconds for Vega?

It's not that simple probably. For example a few paths with many bounces may take some minimum amount of time to be finished (especially with a split kernel), and the only way to hide that may be to render more samples at a time the way we do in batch.

In general scheduling of interactive rendering with different GPUs is a very difficult problem. We could do a little better than splitting half by for example using the number of cores multiplied by the clock speed to find a more appropriate split, but between NVidia and AMD it's also not that meaningful. And probably would not solve this particularly latency problem.

Reduce CPU threads by number of GPU devices, re-enable CUDA spinning.

I think this is ready to commit now.

Use only GPU for viewport render, CPU slows it down too much.

This revision was automatically updated to reflect the committed changes.