Allow CUDA GPU rendering to use host memory #48651

Closed
opened 2016-06-14 10:44:49 +02:00 by Stefan Werner · 37 comments
Member

This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all.

One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler.

At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.

This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all. One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler. At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.
Author
Member

Changed status to: 'Open'

Changed status to: 'Open'
Author
Member

Added subscriber: @Stefan_Werner

Added subscriber: @Stefan_Werner
Author
Member
[patch.diff](https://archive.blender.org/developer/F317105/patch.diff)

Added subscriber: @Sergey

Added subscriber: @Sergey

That's an interesting patch, but claim that being able to render slowly rather not to be able to render at all is not fully convincing without real numbers. For example, why to use such GPUs instead of pure CPU rendering?

So can you quantify penalty and give some comparison of CUDA+host memory vs. CPU rendering?

That's an interesting patch, but claim that being able to render slowly rather not to be able to render at all is not fully convincing without real numbers. For example, why to use such GPUs instead of pure CPU rendering? So can you quantify penalty and give some comparison of CUDA+host memory vs. CPU rendering?

Added subscriber: @brecht

Added subscriber: @brecht
Author
Member

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.

Then it depends on the scene of course, and how important the data that ends up in host memory is. When frequently used data, such as the BVH tree get pushed to host memory, things get much slower than if only a couple of small (in screen space) textures get pushed out. Luckily, Cycles allocates the BVH data structures first, so those should end up in VRAM whenever possible.

As one data point, here's the BMW scene from the benchmarks:
i5-3335P CPU @ 3.10 GHz: 18m:53s
K5000, all data in VRAM: 8m:15s
K5000, bvh_nodes, bvh_leaf_nodes, prim_type, prim_object, object_node, tri_storage, prim_visibility (that is, all data used for ray traversal) in VRAM, rest (= all shading) on host memory: 9m:31s
K5000, only bvh_nodes and bvh_leaf_nodes in VRAM: 10m58s
K5000, all data in host memory: 14m:49s

So for my machine, with that particular scene, the GPU wins over the CPU in all scenarios. Surely this will be different for other scenes or other machines.

(BTW, this was the only scene I benchmarked yet - this is not cherry picked.)

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory. Then it depends on the scene of course, and how important the data that ends up in host memory is. When frequently used data, such as the BVH tree get pushed to host memory, things get much slower than if only a couple of small (in screen space) textures get pushed out. Luckily, Cycles allocates the BVH data structures first, so those should end up in VRAM whenever possible. As one data point, here's the BMW scene from the benchmarks: i5-3335P CPU @ 3.10 GHz: **18m:53s** K5000, all data in VRAM: **8m:15s** K5000, bvh_nodes, bvh_leaf_nodes, prim_type, prim_object, object_node, tri_storage, prim_visibility (that is, all data used for ray traversal) in VRAM, rest (= all shading) on host memory: **9m:31s** K5000, only bvh_nodes and bvh_leaf_nodes in VRAM: **10m58s** K5000, all data in host memory: **14m:49s** So for my machine, with that particular scene, the GPU wins over the CPU in all scenarios. Surely this will be different for other scenes or other machines. (BTW, this was the only scene I benchmarked yet - this is not cherry picked.)

Added subscriber: @ThomasDinges

Added subscriber: @ThomasDinges

Added subscriber: @YAFU

Added subscriber: @YAFU

My test in case it result useful. I'm not sure if I understand if this applies to Maxwell cards too for example.. Anyway here my test:

System:
Kubuntu Linux 14.04 64bits
GTX 960 4GB
CPU: i7-3770
RAM: 16 GB (Running at 1600 MHz if I remember correctly)
Scene:
scene_CUDA_vRAM.7z

Only layer 1 2 3 enabled in the scene (Just to get an idea of the CPU/GPU times)
*Buildbot 2.77.1 Hash: 049f715
GPU: 00:31.49 (480x270)
CPU: 00:24.75 (32x32)

Layer 1 2 3 4 5 enabled in the scene
*Buildbot 2.77.1 Hash: 049f715
GPU: CUDA error: Out of memory in cuMemAlloc(&device_pointer, size)

CPU (32x32)
RAM total System: 8.4 GiB
Mem/Peak: 3785.27M

Time: 00:42.52

*Blender Patched Hash 424f41a:
GPU (480x270)
RAM total System: 9.7 GiB
Blender vRAM: 2830 MiB
Total System vRAM: 3237 MiB

Time: 00:48.82

My test in case it result useful. I'm not sure if I understand if this applies to Maxwell cards too for example.. Anyway here my test: System: Kubuntu Linux 14.04 64bits GTX 960 4GB CPU: i7-3770 RAM: 16 GB (Running at 1600 MHz if I remember correctly) Scene: [scene_CUDA_vRAM.7z](https://archive.blender.org/developer/F317139/scene_CUDA_vRAM.7z) **Only layer 1 2 3 enabled in the scene (Just to get an idea of the CPU/GPU times)** *Buildbot 2.77.1 Hash: 049f715 GPU: 00:31.49 (480x270) CPU: 00:24.75 (32x32) **Layer 1 2 3 4 5 enabled in the scene** *Buildbot 2.77.1 Hash: 049f715 GPU: CUDA error: Out of memory in cuMemAlloc(&device_pointer, size) CPU (32x32) RAM total System: 8.4 GiB Mem/Peak: 3785.27M Time: 00:42.52 *Blender Patched Hash 424f41a: GPU (480x270) RAM total System: 9.7 GiB Blender vRAM: 2830 MiB Total System vRAM: 3237 MiB Time: 00:48.82

Added subscriber: @KINjO

Added subscriber: @KINjO

That is actually interesting timing, thought penalty would be much higher.

That is actually interesting timing, thought penalty would be much higher.
Author
Member

PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.

PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.

Added subscriber: @mib2berlin

Added subscriber: @mib2berlin

Hi, I checked only patched Blender eaf894d with YAFU´s scene_CUDA_vRAM.7z.
I think my mainboard switch to x8 for dual GPU.

CPU 00:48.76

Dual GPU 00:35.86

Opensuse Leap 42.1 x86_64
Intel i5 3570K
RAM 16 GB
GTX 760 4 GB /Display card
GTX 670 2 GB
Driver 361.42

The GTX 670 2GB use 1.4 GB during render.
I use an other Cuda engine which use out of core only for textures.
Enabled I got -10% performance compare to pur VRAM.

Hi, I checked only patched Blender eaf894d with YAFU´s scene_CUDA_vRAM.7z. I think my mainboard switch to x8 for dual GPU. CPU 00:48.76 Dual GPU 00:35.86 Opensuse Leap 42.1 x86_64 Intel i5 3570K RAM 16 GB GTX 760 4 GB /Display card GTX 670 2 GB Driver 361.42 The GTX 670 2GB use 1.4 GB during render. I use an other Cuda engine which use out of core only for textures. Enabled I got -10% performance compare to pur VRAM.

I haven't looked at the patch yet, but I think it makes sense to support this in some way.

How does this work exactly, does the GPU access host memory on every L1/L2 cache miss, or does this pinned memory get cached in VRAM too, and data only gets loaded once when there is enough space in NVRAM? In any case at some point this should be tested with a scene that is actually bigger than VRAM.

I haven't looked at the patch yet, but I think it makes sense to support this in some way. How does this work exactly, does the GPU access host memory on every L1/L2 cache miss, or does this pinned memory get cached in VRAM too, and data only gets loaded once when there is enough space in NVRAM? In any case at some point this should be tested with a scene that is actually bigger than VRAM.
Author
Member

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?

I have plenty of scenes in Poser to test with (100+ textures of 4k resolution), and those seem to perform fine, though I haven't had the time to run fully timed benchmarks vs the CPU yet.

Brecht CUDA doesn't do any paging for us, what is in system RAM remains in system RAM, only the L2 and L1 cache help speeding it up. Automatic paging will hopefully come with CUDA 8 and Pascal.

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it? I have plenty of scenes in Poser to test with (100+ textures of 4k resolution), and those seem to perform fine, though I haven't had the time to run fully timed benchmarks vs the CPU yet. **Brecht** CUDA doesn't do any paging for us, what is in system RAM remains in system RAM, only the L2 and L1 cache help speeding it up. Automatic paging will hopefully come with CUDA 8 and Pascal.

With Cosmos Laundromat default scene (for CPU benchmark) I still have CUDA out of memory error. I guess I need more than 16 GB of RAM for patch works in this scene?

Anyway if I remove almost all particles of grass for the ground plane, I can start rendering, but then I get "CUDA error: Launch exceeded timeout in cuCtxSynchronize()". If I reduce the tiles size the error takes longer to appear, but I always have the error. The thing is that I'm on Linux. TDR is only for Windows, right?

With Cosmos Laundromat default scene (for CPU benchmark) I still have CUDA out of memory error. I guess I need more than 16 GB of RAM for patch works in this scene? Anyway if I remove almost all particles of grass for the ground plane, I can start rendering, but then I get "CUDA error: Launch exceeded timeout in cuCtxSynchronize()". If I reduce the tiles size the error takes longer to appear, but I always have the error. The thing is that I'm on Linux. TDR is only for Windows, right?
Author
Member

The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.

The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.
Author
Member

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.
linuxmem.diff

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch. [linuxmem.diff](https://archive.blender.org/developer/F317222/linuxmem.diff)

I think it would make sense to move this to the differential system. Easier to review and for testing. :)

I think it would make sense to move this to the differential system. Easier to review and for testing. :)
Author
Member
OK, here you are: https://developer.blender.org/D2056
Member

Added subscriber: @Blendify

Added subscriber: @Blendify
Member

Changed status from 'Open' to: 'Resolved'

Changed status from 'Open' to: 'Resolved'
Aaron Carlisle self-assigned this 2016-06-20 03:09:31 +02:00
Member

Discussion can continue at D2056

Discussion can continue at [D2056](https://archive.blender.org/developer/D2056)

Added subscriber: @SteffenD

Added subscriber: @SteffenD

Added subscriber: @Leso_KN

Added subscriber: @Leso_KN

This comment was removed by @Leso_KN

*This comment was removed by @Leso_KN*

Added subscriber: @dean-4

Added subscriber: @dean-4

im interested in downloading this patch because it will allow some overhead in rendering large scenes but im not sure how to download the patch because i have never used github before if anyone could show me how to download it it would help a lot thanks

im interested in downloading this patch because it will allow some overhead in rendering large scenes but im not sure how to download the patch because i have never used github before if anyone could show me how to download it it would help a lot thanks

You can download a daily build from here, it will have the functionality included:
https://builder.blender.org/download

You can download a daily build from here, it will have the functionality included: https://builder.blender.org/download

thank you the build worked and even allowed me to render the gooseberry benchmark on my 3gb 780 ti unfortunately the tiles which i set to 256 x 256 seemed to render slower than i expected but that could be because i dont have a lot of system ram only 8 gb and i think it's possible blender may be paging my system although im not sure.

thank you the build worked and even allowed me to render the gooseberry benchmark on my 3gb 780 ti unfortunately the tiles which i set to 256 x 256 seemed to render slower than i expected but that could be because i dont have a lot of system ram only 8 gb and i think it's possible blender may be paging my system although im not sure.

Slower rendering is quite possible when using this, from the release notes :

CUDA rendering now supports rendering scenes that don't fit in GPU memory, but can be kept in CPU memory. This feature is automatic but comes at a performance cost that depends on the scene and hardware. When image textures do not fit in GPU memory, we have measured slowdowns of 20-30% in our benchmark scenes. When other scene data does not fit on the GPU either, rendering can be a lot slower, to the point that it is better to render on the CPU.

Slower rendering is quite possible when using this, from the [release notes ](https://wiki.blender.org/index.php/Dev:Ref/Release_Notes/2.80/Cycles): > CUDA rendering now supports rendering scenes that don't fit in GPU memory, but can be kept in CPU memory. This feature is automatic but comes at a performance cost that depends on the scene and hardware. When image textures do not fit in GPU memory, we have measured slowdowns of 20-30% in our benchmark scenes. When other scene data does not fit on the GPU either, rendering can be a lot slower, to the point that it is better to render on the CPU.

Added subscriber: @Lincoln

Added subscriber: @Lincoln

Added subscriber: @Versluis

Added subscriber: @Versluis

Hi Team, this patch works great on single GPU systems, but seems to throw an error on dual GPU systems. If I disable one of my GPUs, the scene renders fine, but with two GPUs I get the error "Invalid value in cuMemcpy2DUnaligned(&param)". Happy to submit this as a bug report with the test scene if you point me in the right direction.

Here's a StackExchange thread from a different user with the same problem: https://blender.stackexchange.com/questions/211696/cuda-problem-error-invalid-value-in-cumemcpy2dunalignedparam/215948

Hi Team, this patch works great on single GPU systems, but seems to throw an error on dual GPU systems. If I disable one of my GPUs, the scene renders fine, but with two GPUs I get the error "Invalid value in cuMemcpy2DUnaligned(&param)". Happy to submit this as a bug report with the test scene if you point me in the right direction. Here's a StackExchange thread from a different user with the same problem: https://blender.stackexchange.com/questions/211696/cuda-problem-error-invalid-value-in-cumemcpy2dunalignedparam/215948
For bug reports, please use: https://developer.blender.org/maniphest/task/edit/form/1/
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
13 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#48651
No description provided.