Cycles: multi-device rendering performance #89833

New Issue

Sergey Sharybin · 2021-07-13T12:22:47+02:00

Sergey Sharybin commented

2021-07-13 12:22:47 +02:00

An initial work on the multi-device rendering has been done, bringing all the required building blocks in place. However, actual balancing of the amount of work on the devices is to be improved.

There two main issues with the current logic:

The per-device performance is based on the time which it spent path-tracing. It works fine for "uniformly" complex frame, but fails in cases like there is an easy area like sky.
Headless render needs to schedule balancing a bit more often than it currently does (headless render renders 1 sample, and then keeps devices occupied for ~30sec). This could lead to a device doing nothing for a long time if the balance wasn't calculated accurate enough.

Tweaking headless scheduling is simple, but from own experiments it is really important to schedule more or less equally complex pixels to the devices, otherwise the balance will not converge fast enough (or will never converge).

There are few ideas to try:

Change the objective function from "amount of work based on an approximate performance of a device on a uni-work" to "amount of work based on equalizing time spend by the devices on doing path tracing". From experiments here the tricky part with this is to choose the weight step (avoiding over-compensations, but still allowing to converge to a good balance quickly)
Do interleaved scanline scheduling. The issue with this approach is that adaptive sampling becomes very tricky.

The best scene which demonstrates shortcomings of the current approach is the pabellon.blend in the F12 render (it behaves OK in headless, because half-decent balance after first sample is good enough, and subsequent re-balancing do not happen often, and hence do not hurt the performance).

Various open related topics:

Command line render (i.e. benchmark tool) seems to have a higher deviation in render times than one would expect. Thermal throttling/boost which interferes with the balancing strategy/schedule?
Rebalancing closer to the end of adaptive sampling. What's the best to do? Fall-back to a single device to avoid oevrheads?

An initial work on the multi-device rendering has been done, bringing all the required building blocks in place. However, actual balancing of the amount of work on the devices is to be improved. There two main issues with the current logic: - The per-device performance is based on the time which it spent path-tracing. It works fine for "uniformly" complex frame, but fails in cases like there is an easy area like sky. - Headless render needs to schedule balancing a bit more often than it currently does (headless render renders 1 sample, and then keeps devices occupied for ~30sec). This could lead to a device doing nothing for a long time if the balance wasn't calculated accurate enough. Tweaking headless scheduling is simple, but from own experiments it is really important to schedule more or less equally complex pixels to the devices, otherwise the balance will not converge fast enough (or will never converge). There are few ideas to try: - Change the objective function from "amount of work based on an approximate performance of a device on a uni-work" to "amount of work based on equalizing time spend by the devices on doing path tracing". From experiments here the tricky part with this is to choose the weight step (avoiding over-compensations, but still allowing to converge to a good balance quickly) - Do interleaved scanline scheduling. The issue with this approach is that adaptive sampling becomes very tricky. The best scene which demonstrates shortcomings of the current approach is the `pabellon.blend` in the F12 render (it behaves OK in headless, because half-decent balance after first sample is good enough, and subsequent re-balancing do not happen often, and hence do not hurt the performance). Various open related topics: - Command line render (i.e. benchmark tool) seems to have a higher deviation in render times than one would expect. Thermal throttling/boost which interferes with the balancing strategy/schedule? - Rebalancing closer to the end of adaptive sampling. What's the best to do? Fall-back to a single device to avoid oevrheads?

Sergey Sharybin commented

2021-07-13 12:22:47 +02:00

Added subscriber: @Sergey

blender-admin commented

2021-07-13 12:53:40 +02:00

This issue was referenced by 289a173d938f26114a837bb34742c4ee6792d0ff

Steffen Dünner commented

2021-07-14 21:00:39 +02:00

Added subscriber: @SteffenD

Sergey Sharybin commented

2021-07-22 10:50:47 +02:00

Some of the logic has been tweaked in the 289a173d93, which helped a lot to avoid huge speed regreesion in the Pabellon scene when fast GPU is used together with slow CPU.

There is still some performance penalty for such unbalanced configuration. Not sure what would be proper solution for that. In a way the tile-based rendering was dealing with such configuration better (smaller schedule units combined with work stealing), but it also was not keeping GPUs really busy. So here is a tradeoff between reacting quick enough for unbalanced configuration and keeping devices always occupied.

Would be interesting to test whether the observed slowdown is more of a constant time (some time penalty to balance things out happening during first samples of render: which means the percentage of penalty goes down when adding more samples) or whether it is a constant percentage from the overall render time (which means percentage of penalty does not go down with more samples added).

Some of the logic has been tweaked in the 289a173d93, which helped a lot to avoid huge speed regreesion in the Pabellon scene when fast GPU is used together with slow CPU. There is still some performance penalty for such unbalanced configuration. Not sure what would be proper solution for that. In a way the tile-based rendering was dealing with such configuration better (smaller schedule units combined with work stealing), but it also was not keeping GPUs really busy. So here is a tradeoff between reacting quick enough for unbalanced configuration and keeping devices always occupied. Would be interesting to test whether the observed slowdown is more of a constant time (some time penalty to balance things out happening during first samples of render: which means the percentage of penalty goes down when adding more samples) or whether it is a constant percentage from the overall render time (which means percentage of penalty does not go down with more samples added).

Sam Morse-Brown commented

2021-07-28 12:52:08 +02:00

Added subscriber: @ParallelMayhem

Eric Bubela commented

2021-09-09 11:26:29 +02:00

Added subscriber: @ericspeer70

Eric Bubela commented

2021-09-09 11:26:29 +02:00

Thanks for all the hard work everyone!
Now that we are doing 'Multi-device rendering' again...
Can we please include an option for animations to do one frame per device?
I think this will improve performance as a faster device can do multiple frames while the slower devices plod along with their single frames.
Thanks for taking time to listen. Take care and stay safe, E:)

Thanks for all the hard work everyone! Now that we are doing 'Multi-device rendering' again... Can we please include an option for animations to do one frame per device? I think this will improve performance as a faster device can do multiple frames while the slower devices plod along with their single frames. Thanks for taking time to listen. Take care and stay safe, E:)

Sergey Sharybin commented

2021-09-09 12:30:07 +02:00

That's an interesting idea, but is outside of the scope of the current Cycles X development. Is also something what is usually implemented as part of a render farm software.

Eric Bubela commented

2021-09-09 12:53:31 +02:00

Okay Sergey thanks for the reply.
I have raised the issue with the folks at CrowdRender.
But they are only able to do Tile Sharing, which as was previously mentioned is precarious due to the variety of complexity from one frame to the next.
A lot of wasted potential.
The reason for my initial post was because previously Blender had its own Network Render ability built in and I was hoping that might be the case soon again.
Thanks again for all the talent and inspiration, E:)

Okay Sergey thanks for the reply. I have raised the issue with the folks at CrowdRender. But they are only able to do Tile Sharing, which as was previously mentioned is precarious due to the variety of complexity from one frame to the next. A lot of wasted potential. The reason for my initial post was because previously Blender had its own Network Render ability built in and I was hoping that might be the case soon again. Thanks again for all the talent and inspiration, E:)

Brecht Van Lommel commented

2021-09-14 19:30:15 +02:00

Added subscriber: @brecht

Brecht Van Lommel commented

2021-09-14 19:30:15 +02:00

From the latest meeting notes:

Multi GPU rendering in Cycles X does not appear to scale well beyond 2 GPUs right now, whereas previously Cycles scaled better up to 8 GPUs. The cause is unclear, Feng and Patrick will investigate it. Brecht and Sergey currently do not have access to such a setup, but may be able to help when there are profiles, backtraces or debug logs to look at. Brecht will list it as a bug to be fixed for the 3.0 release.

https://devtalk.blender.org/t/2021-09-14-blender-rendering-meeting/20469

The cause of this is likely different than the issues mentioned in this task, but we have to find out what it is exactly.

From the latest meeting notes: > Multi GPU rendering in Cycles X does not appear to scale well beyond 2 GPUs right now, whereas previously Cycles scaled better up to 8 GPUs. The cause is unclear, Feng and Patrick will investigate it. Brecht and Sergey currently do not have access to such a setup, but may be able to help when there are profiles, backtraces or debug logs to look at. Brecht will list it as a bug to be fixed for the 3.0 release. https://devtalk.blender.org/t/2021-09-14-blender-rendering-meeting/20469 The cause of this is likely different than the issues mentioned in this task, but we have to find out what it is exactly.

Sergey Sharybin commented

2021-09-15 09:37:09 +02:00

A quick way to check whether it is a different problem or not is to disable rebalancing (should be as easy as early return in PathTrace::rebalance) and render an uniform image (so that the complexity of slices is roughly the same.

It could also be non-ideal tile size calculated for a narrow slice in the tile_calculate_best_size. Easiest I think would be to render a higher res image to force slices to be taller.

Could also be something outside of our control, like, a lock in the driver which we hit much more often with all the micro-kernel enqueuing.

A quick way to check whether it is a different problem or not is to disable rebalancing (should be as easy as early return in `PathTrace::rebalance`) and render an uniform image (so that the complexity of slices is roughly the same. It could also be non-ideal tile size calculated for a narrow slice in the `tile_calculate_best_size`. Easiest I think would be to render a higher res image to force slices to be taller. Could also be something outside of our control, like, a lock in the driver which we hit much more often with all the micro-kernel enqueuing.

Raimund Klink commented

2021-09-15 20:07:43 +02:00

Added subscriber: @Raimund58

hujin commented

2021-09-21 12:31:44 +02:00

Added subscriber: @2046411367

Milan Jaros commented

2021-10-05 11:51:11 +02:00

Added subscriber: @MilanJaros

Milan Jaros commented

2021-10-05 11:51:11 +02:00

This comment was removed by @MilanJaros

*This comment was removed by @MilanJaros*

Milan Jaros commented

2021-10-05 11:52:38 +02:00

This comment was removed by @MilanJaros

*This comment was removed by @MilanJaros*

Sayak Biswas commented

2021-10-19 20:56:15 +02:00

Added subscriber: @Sayak-Biswas

Jagannadhan Ravi commented

2021-10-19 22:08:08 +02:00

Added subscriber: @easythrees

Brecht Van Lommel changed title from ~~Cycles X - Multi-device rendering performance~~ to Cycles: multi-device rendering performance

2021-10-28 15:04:31 +02:00

Luc Revardel commented

2021-11-12 08:43:58 +01:00

Added subscriber: @lrevardel

Garek commented

2022-01-26 06:11:18 +01:00

Added subscriber: @Garek

Andrew Price commented

2022-08-01 08:39:33 +02:00

Added subscriber: @AndrewPrice

Andrew Price commented

2022-08-01 08:39:33 +02:00

Any updates on this?

Some of us spent 20K buying 4 3090s and would love to use it in a single Blender instance 😅

Any updates on this? Some of us spent 20K buying 4 3090s and would love to use it in a single Blender instance 😅

Sergey Sharybin commented

2022-08-01 14:33:23 +02:00

There are some work-in-progress development in D14014 and D14083 (with the corresponding task for latter #95687).
They both have some short-comings, and finding the best solution is not trivial and takes time.

There are some work-in-progress development in [D14014](https://archive.blender.org/developer/D14014) and [D14083](https://archive.blender.org/developer/D14083) (with the corresponding task for latter #95687). They both have some short-comings, and finding the best solution is not trivial and takes time.

Andrew Price commented

2022-08-02 03:37:12 +02:00

Thanks for sharing the WIPs. I'll track the progress there.

Eki Oshri commented

2023-01-08 18:53:35 +01:00

Added subscriber: @Eki-Oshri

Brecht Van Lommel added this to the Render & Cycles project 2023-02-07 19:08:06 +01:00

Thomas Dinges added this to the 2.90 milestone 2023-02-08 16:25:00 +01:00

Philipp Oeser removed the

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Cycles: multi-device rendering performance #89833