Allow CUDA GPU rendering to use host memory
Closed, ResolvedPublic


This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all.

One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler.

At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.


Sergey Sharybin (sergey) triaged this task as "Normal" priority.Jun 14 2016, 11:35 AM

That's an interesting patch, but claim that being able to render slowly rather not to be able to render at all is not fully convincing without real numbers. For example, why to use such GPUs instead of pure CPU rendering?

So can you quantify penalty and give some comparison of CUDA+host memory vs. CPU rendering?

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.

Then it depends on the scene of course, and how important the data that ends up in host memory is. When frequently used data, such as the BVH tree get pushed to host memory, things get much slower than if only a couple of small (in screen space) textures get pushed out. Luckily, Cycles allocates the BVH data structures first, so those should end up in VRAM whenever possible.

As one data point, here's the BMW scene from the benchmarks:
i5-3335P CPU @ 3.10 GHz: 18m:53s
K5000, all data in VRAM: 8m:15s
K5000, bvh_nodes, bvh_leaf_nodes, prim_type, prim_object, object_node, tri_storage, prim_visibility (that is, all data used for ray traversal) in VRAM, rest (= all shading) on host memory: 9m:31s
K5000, only bvh_nodes and bvh_leaf_nodes in VRAM: 10m58s
K5000, all data in host memory: 14m:49s

So for my machine, with that particular scene, the GPU wins over the CPU in all scenarios. Surely this will be different for other scenes or other machines.

(BTW, this was the only scene I benchmarked yet - this is not cherry picked.)

YAFU (YAFU) added a subscriber: YAFU (YAFU).EditedJun 14 2016, 4:12 PM

My test in case it result useful. I'm not sure if I understand if this applies to Maxwell cards too for example.. Anyway here my test:

Kubuntu Linux 14.04 64bits
GTX 960 4GB
CPU: i7-3770
RAM: 16 GB (Running at 1600 MHz if I remember correctly)

Only layer 1 2 3 enabled in the scene (Just to get an idea of the CPU/GPU times)
*Buildbot 2.77.1 Hash: 049f715
GPU: 00:31.49 (480x270)
CPU: 00:24.75 (32x32)

Layer 1 2 3 4 5 enabled in the scene
*Buildbot 2.77.1 Hash: 049f715
GPU: CUDA error: Out of memory in cuMemAlloc(&device_pointer, size)

CPU (32x32)
RAM total System: 8.4 GiB
Mem/Peak: 3785.27M

Time: 00:42.52

*Blender Patched Hash 424f41a:
GPU (480x270)
RAM total System: 9.7 GiB
Blender vRAM: 2830 MiB
Total System vRAM: 3237 MiB

Time: 00:48.82

That is actually interesting timing, thought penalty would be much higher.

PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.

Hi, I checked only patched Blender eaf894d with YAFU´s scene_CUDA_vRAM.7z.
I think my mainboard switch to x8 for dual GPU.

CPU 00:48.76

Dual GPU 00:35.86

Opensuse Leap 42.1 x86_64
Intel i5 3570K
GTX 760 4 GB /Display card
GTX 670 2 GB
Driver 361.42

The GTX 670 2GB use 1.4 GB during render.
I use an other Cuda engine which use out of core only for textures.
Enabled I got -10% performance compare to pur VRAM.

I haven't looked at the patch yet, but I think it makes sense to support this in some way.

How does this work exactly, does the GPU access host memory on every L1/L2 cache miss, or does this pinned memory get cached in VRAM too, and data only gets loaded once when there is enough space in NVRAM? In any case at some point this should be tested with a scene that is actually bigger than VRAM.

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?

I have plenty of scenes in Poser to test with (100+ textures of 4k resolution), and those seem to perform fine, though I haven't had the time to run fully timed benchmarks vs the CPU yet.

Brecht CUDA doesn't do any paging for us, what is in system RAM remains in system RAM, only the L2 and L1 cache help speeding it up. Automatic paging will hopefully come with CUDA 8 and Pascal.

With Cosmos Laundromat default scene (for CPU benchmark) I still have CUDA out of memory error. I guess I need more than 16 GB of RAM for patch works in this scene?

Anyway if I remove almost all particles of grass for the ground plane, I can start rendering, but then I get "CUDA error: Launch exceeded timeout in cuCtxSynchronize()". If I reduce the tiles size the error takes longer to appear, but I always have the error. The thing is that I'm on Linux. TDR is only for Windows, right?

The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.

I think it would make sense to move this to the differential system. Easier to review and for testing. :)

Aaron Carlisle (Blendify) closed this task as "Resolved".Jun 20 2016, 3:09 AM
Aaron Carlisle (Blendify) claimed this task.

Discussion can continue at D2056