Allow CUDA GPU rendering to use host memory #48651

New Issue

Stefan Werner · 2016-06-14T10:44:49+02:00

Stefan Werner commented

2016-06-14 10:44:49 +02:00

This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all.

One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler.

At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.

This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all. One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler. At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.

Stefan Werner commented

2016-06-14 10:44:49 +02:00

Changed status to: 'Open'

Stefan Werner commented

2016-06-14 10:44:49 +02:00

Added subscriber: @Stefan_Werner

Stefan Werner commented

2016-06-14 10:45:27 +02:00

patch.diff

[patch.diff](https://archive.blender.org/developer/F317105/patch.diff)

Sergey Sharybin commented

2016-06-14 11:35:50 +02:00

Added subscriber: @Sergey

Sergey Sharybin commented

2016-06-14 11:35:50 +02:00

That's an interesting patch, but claim that being able to render slowly rather not to be able to render at all is not fully convincing without real numbers. For example, why to use such GPUs instead of pure CPU rendering?

So can you quantify penalty and give some comparison of CUDA+host memory vs. CPU rendering?

That's an interesting patch, but claim that being able to render slowly rather not to be able to render at all is not fully convincing without real numbers. For example, why to use such GPUs instead of pure CPU rendering? So can you quantify penalty and give some comparison of CUDA+host memory vs. CPU rendering?

Sergey Sharybin commented

2016-06-14 11:36:09 +02:00

Added subscriber: @brecht

Stefan Werner commented

2016-06-14 14:53:05 +02:00

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.

Then it depends on the scene of course, and how important the data that ends up in host memory is. When frequently used data, such as the BVH tree get pushed to host memory, things get much slower than if only a couple of small (in screen space) textures get pushed out. Luckily, Cycles allocates the BVH data structures first, so those should end up in VRAM whenever possible.

As one data point, here's the BMW scene from the benchmarks:
i5-3335P CPU @ 3.10 GHz: 18m:53s
K5000, all data in VRAM: 8m:15s
K5000, bvh_nodes, bvh_leaf_nodes, prim_type, prim_object, object_node, tri_storage, prim_visibility (that is, all data used for ray traversal) in VRAM, rest (= all shading) on host memory: 9m:31s
K5000, only bvh_nodes and bvh_leaf_nodes in VRAM: 10m58s
K5000, all data in host memory: 14m:49s

So for my machine, with that particular scene, the GPU wins over the CPU in all scenarios. Surely this will be different for other scenes or other machines.

(BTW, this was the only scene I benchmarked yet - this is not cherry picked.)

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory. Then it depends on the scene of course, and how important the data that ends up in host memory is. When frequently used data, such as the BVH tree get pushed to host memory, things get much slower than if only a couple of small (in screen space) textures get pushed out. Luckily, Cycles allocates the BVH data structures first, so those should end up in VRAM whenever possible. As one data point, here's the BMW scene from the benchmarks: i5-3335P CPU @ 3.10 GHz: **18m:53s** K5000, all data in VRAM: **8m:15s** K5000, bvh_nodes, bvh_leaf_nodes, prim_type, prim_object, object_node, tri_storage, prim_visibility (that is, all data used for ray traversal) in VRAM, rest (= all shading) on host memory: **9m:31s** K5000, only bvh_nodes and bvh_leaf_nodes in VRAM: **10m58s** K5000, all data in host memory: **14m:49s** So for my machine, with that particular scene, the GPU wins over the CPU in all scenarios. Surely this will be different for other scenes or other machines. (BTW, this was the only scene I benchmarked yet - this is not cherry picked.)

Thomas Dinges commented

2016-06-14 15:04:49 +02:00

Added subscriber: @ThomasDinges

YAFU commented

2016-06-14 16:12:11 +02:00

Added subscriber: @YAFU

YAFU commented

2016-06-14 16:12:11 +02:00

My test in case it result useful. I'm not sure if I understand if this applies to Maxwell cards too for example.. Anyway here my test:

System:
Kubuntu Linux 14.04 64bits
GTX 960 4GB
CPU: i7-3770
RAM: 16 GB (Running at 1600 MHz if I remember correctly)
Scene:
scene_CUDA_vRAM.7z

Only layer 1 2 3 enabled in the scene (Just to get an idea of the CPU/GPU times)
*Buildbot 2.77.1 Hash: 049f715
GPU: 00:31.49 (480x270)
CPU: 00:24.75 (32x32)

Layer 1 2 3 4 5 enabled in the scene
*Buildbot 2.77.1 Hash: 049f715
GPU: CUDA error: Out of memory in cuMemAlloc(&device_pointer, size)

CPU (32x32)
RAM total System: 8.4 GiB
Mem/Peak: 3785.27M

Time: 00:42.52

*Blender Patched Hash 424f41a:
GPU (480x270)
RAM total System: 9.7 GiB
Blender vRAM: 2830 MiB
Total System vRAM: 3237 MiB

Time: 00:48.82

My test in case it result useful. I'm not sure if I understand if this applies to Maxwell cards too for example.. Anyway here my test: System: Kubuntu Linux 14.04 64bits GTX 960 4GB CPU: i7-3770 RAM: 16 GB (Running at 1600 MHz if I remember correctly) Scene: [scene_CUDA_vRAM.7z](https://archive.blender.org/developer/F317139/scene_CUDA_vRAM.7z) **Only layer 1 2 3 enabled in the scene (Just to get an idea of the CPU/GPU times)** *Buildbot 2.77.1 Hash: 049f715 GPU: 00:31.49 (480x270) CPU: 00:24.75 (32x32) **Layer 1 2 3 4 5 enabled in the scene** *Buildbot 2.77.1 Hash: 049f715 GPU: CUDA error: Out of memory in cuMemAlloc(&device_pointer, size) CPU (32x32) RAM total System: 8.4 GiB Mem/Peak: 3785.27M Time: 00:42.52 *Blender Patched Hash 424f41a: GPU (480x270) RAM total System: 9.7 GiB Blender vRAM: 2830 MiB Total System vRAM: 3237 MiB Time: 00:48.82

Oscar commented

2016-06-14 16:14:14 +02:00

Added subscriber: @KINjO

Sergey Sharybin commented

2016-06-14 17:02:19 +02:00

That is actually interesting timing, thought penalty would be much higher.

Stefan Werner commented

2016-06-14 18:08:43 +02:00

PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.

Wolfgang Faehnle commented

2016-06-14 20:12:22 +02:00

Added subscriber: @mib2berlin

Wolfgang Faehnle commented

2016-06-14 20:12:22 +02:00

Hi, I checked only patched Blender eaf894d with YAFU´s scene_CUDA_vRAM.7z.
I think my mainboard switch to x8 for dual GPU.

CPU 00:48.76

Dual GPU 00:35.86

Opensuse Leap 42.1 x86_64
Intel i5 3570K
RAM 16 GB
GTX 760 4 GB /Display card
GTX 670 2 GB
Driver 361.42

The GTX 670 2GB use 1.4 GB during render.
I use an other Cuda engine which use out of core only for textures.
Enabled I got -10% performance compare to pur VRAM.

Hi, I checked only patched Blender eaf894d with YAFU´s scene_CUDA_vRAM.7z. I think my mainboard switch to x8 for dual GPU. CPU 00:48.76 Dual GPU 00:35.86 Opensuse Leap 42.1 x86_64 Intel i5 3570K RAM 16 GB GTX 760 4 GB /Display card GTX 670 2 GB Driver 361.42 The GTX 670 2GB use 1.4 GB during render. I use an other Cuda engine which use out of core only for textures. Enabled I got -10% performance compare to pur VRAM.

Brecht Van Lommel commented

2016-06-14 21:52:51 +02:00

I haven't looked at the patch yet, but I think it makes sense to support this in some way.

How does this work exactly, does the GPU access host memory on every L1/L2 cache miss, or does this pinned memory get cached in VRAM too, and data only gets loaded once when there is enough space in NVRAM? In any case at some point this should be tested with a scene that is actually bigger than VRAM.

I haven't looked at the patch yet, but I think it makes sense to support this in some way. How does this work exactly, does the GPU access host memory on every L1/L2 cache miss, or does this pinned memory get cached in VRAM too, and data only gets loaded once when there is enough space in NVRAM? In any case at some point this should be tested with a scene that is actually bigger than VRAM.

Stefan Werner commented

2016-06-14 22:54:58 +02:00

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?

I have plenty of scenes in Poser to test with (100+ textures of 4k resolution), and those seem to perform fine, though I haven't had the time to run fully timed benchmarks vs the CPU yet.

Brecht CUDA doesn't do any paging for us, what is in system RAM remains in system RAM, only the L2 and L1 cache help speeding it up. Automatic paging will hopefully come with CUDA 8 and Pascal.

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it? I have plenty of scenes in Poser to test with (100+ textures of 4k resolution), and those seem to perform fine, though I haven't had the time to run fully timed benchmarks vs the CPU yet. **Brecht** CUDA doesn't do any paging for us, what is in system RAM remains in system RAM, only the L2 and L1 cache help speeding it up. Automatic paging will hopefully come with CUDA 8 and Pascal.

YAFU commented

2016-06-15 00:28:20 +02:00

With Cosmos Laundromat default scene (for CPU benchmark) I still have CUDA out of memory error. I guess I need more than 16 GB of RAM for patch works in this scene?

Anyway if I remove almost all particles of grass for the ground plane, I can start rendering, but then I get "CUDA error: Launch exceeded timeout in cuCtxSynchronize()". If I reduce the tiles size the error takes longer to appear, but I always have the error. The thing is that I'm on Linux. TDR is only for Windows, right?

With Cosmos Laundromat default scene (for CPU benchmark) I still have CUDA out of memory error. I guess I need more than 16 GB of RAM for patch works in this scene? Anyway if I remove almost all particles of grass for the ground plane, I can start rendering, but then I get "CUDA error: Launch exceeded timeout in cuCtxSynchronize()". If I reduce the tiles size the error takes longer to appear, but I always have the error. The thing is that I'm on Linux. TDR is only for Windows, right?

Stefan Werner commented

2016-06-15 08:42:54 +02:00

The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.

Stefan Werner commented

2016-06-15 12:25:08 +02:00

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.
linuxmem.diff

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch. [linuxmem.diff](https://archive.blender.org/developer/F317222/linuxmem.diff)

Thomas Dinges commented

2016-06-15 12:35:22 +02:00

I think it would make sense to move this to the differential system. Easier to review and for testing. :)

Stefan Werner commented

2016-06-15 16:17:59 +02:00

OK, here you are: https://developer.blender.org/D2056

Aaron Carlisle commented

2016-06-20 03:09:31 +02:00

Added subscriber: @Blendify

Aaron Carlisle commented

2016-06-20 03:09:31 +02:00

Changed status from 'Open' to: 'Resolved'

Aaron Carlisle closed this issue

2016-06-20 03:09:31 +02:00

Aaron Carlisle self-assigned this 2016-06-20 03:09:31 +02:00

Aaron Carlisle commented

2016-06-20 03:09:31 +02:00

Discussion can continue at D2056

Discussion can continue at [D2056](https://archive.blender.org/developer/D2056)

Steffen Dünner commented

2017-05-26 12:17:57 +02:00

Added subscriber: @SteffenD

Leso_KN commented

2017-10-21 17:54:01 +02:00

Added subscriber: @Leso_KN

Leso_KN commented

2017-10-21 17:54:01 +02:00

This comment was removed by @Leso_KN

*This comment was removed by @Leso_KN*

dean commented

2018-03-07 03:21:40 +01:00

Added subscriber: @dean-4

dean commented

2018-03-07 03:21:40 +01:00

im interested in downloading this patch because it will allow some overhead in rendering large scenes but im not sure how to download the patch because i have never used github before if anyone could show me how to download it it would help a lot thanks

Brecht Van Lommel commented

2018-03-07 03:33:54 +01:00

You can download a daily build from here, it will have the functionality included:
https://builder.blender.org/download

You can download a daily build from here, it will have the functionality included: https://builder.blender.org/download

dean commented

2018-03-09 03:12:01 +01:00

thank you the build worked and even allowed me to render the gooseberry benchmark on my 3gb 780 ti unfortunately the tiles which i set to 256 x 256 seemed to render slower than i expected but that could be because i dont have a lot of system ram only 8 gb and i think it's possible blender may be paging my system although im not sure.

Brecht Van Lommel commented

2018-03-09 04:01:28 +01:00

Slower rendering is quite possible when using this, from the release notes :

CUDA rendering now supports rendering scenes that don't fit in GPU memory, but can be kept in CPU memory. This feature is automatic but comes at a performance cost that depends on the scene and hardware. When image textures do not fit in GPU memory, we have measured slowdowns of 20-30% in our benchmark scenes. When other scene data does not fit on the GPU either, rendering can be a lot slower, to the point that it is better to render on the CPU.

Slower rendering is quite possible when using this, from the [release notes ](https://wiki.blender.org/index.php/Dev:Ref/Release_Notes/2.80/Cycles): > CUDA rendering now supports rendering scenes that don't fit in GPU memory, but can be kept in CPU memory. This feature is automatic but comes at a performance cost that depends on the scene and hardware. When image textures do not fit in GPU memory, we have measured slowdowns of 20-30% in our benchmark scenes. When other scene data does not fit on the GPU either, rendering can be a lot slower, to the point that it is better to render on the CPU.

Lincoln Deen commented

2019-04-16 02:24:09 +02:00

Added subscriber: @Lincoln

Jay Versluis commented

2021-03-26 15:12:04 +01:00

Added subscriber: @Versluis

Jay Versluis commented

2021-03-26 15:12:04 +01:00

Hi Team, this patch works great on single GPU systems, but seems to throw an error on dual GPU systems. If I disable one of my GPUs, the scene renders fine, but with two GPUs I get the error "Invalid value in cuMemcpy2DUnaligned(&param)". Happy to submit this as a bug report with the test scene if you point me in the right direction.

Here's a StackExchange thread from a different user with the same problem: https://blender.stackexchange.com/questions/211696/cuda-problem-error-invalid-value-in-cumemcpy2dunalignedparam/215948

Hi Team, this patch works great on single GPU systems, but seems to throw an error on dual GPU systems. If I disable one of my GPUs, the scene renders fine, but with two GPUs I get the error "Invalid value in cuMemcpy2DUnaligned(&param)". Happy to submit this as a bug report with the test scene if you point me in the right direction. Here's a StackExchange thread from a different user with the same problem: https://blender.stackexchange.com/questions/211696/cuda-problem-error-invalid-value-in-cumemcpy2dunalignedparam/215948

Brecht Van Lommel commented

2021-03-26 15:49:24 +01:00

For bug reports, please use:
https://developer.blender.org/maniphest/task/edit/form/1/

For bug reports, please use: https://developer.blender.org/maniphest/task/edit/form/1/

Sign in to join this conversation.

No Label

Download

What's New

Blender Studio

Manual

Developers Blog

Documentation

Benchmark

Blender Conference

Development Fund

One-time Donations

Allow CUDA GPU rendering to use host memory #48651