Allow CUDA GPU rendering to use host memory
Needs ReviewPublic

Authored by Stefan Werner (swerner) on Jun 15 2016, 4:17 PM.

Details

Summary

(This is a duplicate of T48651, moved to the differential system by Thomas' suggestion)

This patch will allow CUDA devices to use system memory in addition to VRAM. While this is obviously is slower than VRAM, I think it is still better than not to render at all.

One related change rolled into this patch is that devices with compute >= 3.0 will now fetch kernel data through textures instead of global arrays again. This improves performance on Kepler cards, which don't use L1 caching on global loads, and this is even more apparent when the global data is in host memory instead of VRAM. Going through texture objects allows it to use L1 caching without running into the 4GB memory limit Cycles had when it was still using texture references on Kepler.

At this point, the patch is set to use not more than 1/2 of the system memory as rendering memory. Since system memory used for CUDA must be pinned, using too much of it can be bad for the overall system performance. An obvious limitation here is that the 1/2 heuristic only works well with a single device, with multiple CUDA devices trying to allocate that much memory, it could run into trouble. That still needs to be addressed, either through a better heuristic or a user parameter. I would also like to eventually extend it to share the pinned memory between GPUs where possible.

Diff Detail

Repository
rB Blender
Stefan Werner (swerner) retitled this revision from to Allow CUDA GPU rendering to use host memory.Jun 15 2016, 4:17 PM
Stefan Werner (swerner) updated this object.
Stefan Werner (swerner) set the repository for this revision to rB Blender.

This seems to be solid enough to allow me to launch a render of the Gooseberry benchmark scene on my 4GB GPU with TDR turned off. However, some tiles render extremely slow (counting seconds per sample instead of samples per second!), I haven't found out yet what crazy things happening in them.

This seems incomplete? The patch in T48651 contained more changes.

Yes. I'm still trying to figure out how this system and git patches work. It looks like when I try to upload my diff file with multiple commits in it, it takes only the first one.

intern/cycles/device/device_cuda.cpp
717

Actually, this is wrong. This if() block needs to go, as only 3D textures now use the handle variable.

This should now include all changes, squashed into a single commit.

Stefan Werner (swerner) marked an inline comment as done.Jun 15 2016, 7:06 PM

A couple of ideas for improvement:

  • It may be possible to do 3D texture allocation through cuMemAlloc()/cuMemAllocHost() too.
  • I'd love to have a better heuristic for finding out the actual memory requirements of our kernel launch. Memory fragmentation may be important too when estimating the required amount.
  • When multiple CUDA devices are in use, host allocations should be shared between devices.
  • Eventually, one could support Pascal/CUDA 8's virtual memory by small changes to mem_alloc_internal()
  • Some high end cards (Telsa only?) support Peer-To-Peer memory access, multi-device renders could then share allocations in VRAM.
  • There could be user preferences for enabling/disabling pinned memory and setting the max amount.

Do we have to pin memory ahead of a time and doomed to never change that? Basically, what i mean is: can't we pin more and more memory once our requirement grows (similar to what we do in ccl::array already)?

Other idea here is to do delayed device allocation. This way we'll know overall device memory requirement and can calculate how much of RAM we'll have to pin.

I'm not a strong fan of user preferences for that. They might be acceptable for local operation on desktop, but once you have to deal with renderfarms need to tweak each individual node is quite annoying. Not as if CUDA renderfarms are common at this moment, but still. Also, even for local operation, you shouldn't be in a situations like simple scene using too much of your RAM or not being able to render just because you've set low memory limit. IMO, the way to go here would be to pin RequiredDeviceMemory-VRAM megs of RAM.

It only pins the amount of memory required, no more. The user preference would only set an upper bound to that.

Well, that's good only required memory is pinned. But then i'm not sure why to have limit? We don't limit CPU memory and added code to gracefully (as much as we can) to abort render when allocation fails due to bad_alloc. Why not to use same for host memory with GPU rendering? Basically, allow to pin up to whole RAM available but stop rendering when even that is not enough?

From the CUDA documentation it sounds like pinning all available memory may not be a good idea:

"Allocating excessive amounts of memory with cudaMallocHost() may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data"

When pinning too much, it will affect overall system performance, not just the rendering process.

For the lowest memory usage and least copying we can make device_vector point to pinned memory directly. This would require some refactoring of device_vector so all memory allocations and frees go through the device, but it's not a huge change I think. If pinned memory is being shared between multiple CUDA devices it seems logical to do it this way.

@Stefan Werner (swerner): I wouldn't worry about that at all. When you have a small RAM and try to render a large scene on CPU, your system memory gets filled too, swapping begins and the system becomes unresponsive / slow. I don't see why we should treat GPU host allocations any different. I am not a fan of adding a user pref option either.

It might also make sense to use pinned memory for all device_vector allocations even if we want to copy memory to the GPU. That way we can use async copies which should be faster, particularly for multi GPU.

For limiting memory, I find that once you start swapping performance is unacceptably slow already, often to the point that the computer is almost unusable. So it would be interesting to see what happens when you degrade system performance even worse than that :). In theory it would be really bad if all memory except e.g. 1 MB is pinned and all memory need to swapped into that tiny amount of space, but I wonder if it can really get much worse than typical memory swapping in practice.

If we need to do this then a reasonable heuristic might be to allow pinning total system memory minus 2 GB, to leave enough space for the OS and other applications to run, but it's all quite arbitrary.

Would it be worthwhile having a user control to specify how much ram to use on the GPU? because of the 970 3.5gb/4gb issue?

That last 512 MB on the GTX 970 is still faster than using host memory, so I don't think it would be useful.

Different operating systems may handle this differently, but for what it's worth, I ran a simple test on OS X. cuMemAllocHost() in a loop, allocating 200MB chunks, 100 times - which should try to allocate ~20GB in total. My machine has 16GB of physical memory. At just under 10GB allocated, the machine froze completely. That is, not even the mouse pointer will move and the machine does not respond any more to network requests. After about one or two two minutes of being frozen, the machine rebooted, the "crash" log pointing to the Watchdog task.

So at least on OS X, allocating excessive pinned memory is different from excessive regular memory and a hard limit on the maximum amount is in my opinion very much preferred over a "one more texture and your render will crash the entire machine" user experience.

It might also make sense to use pinned memory for all device_vector allocations even if we want to copy memory to the GPU. That way we can use async copies which should be faster, particularly for multi GPU.

Yes, that would be an improvement that we can split out into a separate followup task. One must take care though of pitch and alignment, the CPU side would need to be aware of the memory layout for textures.

@Stefan Werner (swerner), that's a bummer it freezes so completely, but here are some points:

  • As you've mentioned, it might depend on OS/driver and even their version. I don't think we should commit ourselves to maintenance of all those per-platform heuristics, they never work reliably enough anyway. We've got that happening for AMD OpenCL, and it's really rather pathetic. Heuristic created for one driver does not work for another one, making me almost thinking of dropping all of them and demand companies to deliver proper drivers. We are too small to keep adding workarounds for each new driver.
  • Drivers might be just buggy. They definitely shouldn't crash entire OS kernel, and if they do -- it totally worth submitting report to NVidia IMO.
  • For any setting you need a default setting. Now, what is the value in this case? Is it same value for all platforms? is it something user would need to set up himself (then it kinda defeats whole idea of avoid user frustration because he will first run into the issue before discovering that option)?
  • Is this limit dictated by a memory size after all or by bus throughput and power of GPU? As in, imagine you've increased your machine memory to 32gig keeping everything else the same. Will crash system freeze at 20gig now? Or because of some "heavy" load crash still will happen at 10gig?

Also, is this some known issue and everyone is "solving" it by adding hard limits to host memory?

Octane exposes what they call "out-of-core" textures, which I assume is textures in pinned memory, as a feature that the user has to enable and to pick how much memory it is supposed to use: https://docs.otoy.com/Standalone_2_0/?page_id=3216

In Poser we ship our version of Cycles with a hardcoded 1/2 physical limit.

Stefan Werner, Do you have in your OS the virtual memory located on a 7200 rpm disk?
Sorry if I misunderstood, but your system freeze reminds me of those I have in Linux when the system begins to fill the Swap intensively on a slow disk (as one of 7200rpm). That always happens in Linux when intensively fills the swap, swap management is not good on Linux (you look in google how a lot of people try to avoid using the swap by configuring swappiness).
What I mean is that if the freeze on your OS was due to the intensive use of the swap, then maybe in a system with more RAM the problem does not occur.

Edit: sorry, apparently I have subscribed to the wrong person when I tried to quote Stefan Werner. I have to learn how to do this properly.

There is no swapping with pinned memory. That's the whole point about pinned memory - it always remains fixed in physical memory, at the same address, no matter how severe the OS' memory pressure is. That said, it was an SSD.

My computer just finished rendering the Victor/Gooseberry benchmark scene on two GPUs, coming in at about 1h:36m (conservative tile size of 64x64). So I think my patch is pretty solid.

I did have to enable SHADOW_RECORD_ALL for CUDA with an intersection buffer size of 10 to prevent timeouts, though - the incremental stepping through transparent shadows absolutely kills GPU performance. I may try to tackle that next.

Update: Bumped the tile size to 128x128, now a single GPU (K5000, 4GB) renders this scene in 1h:53m. This beats my CPU which takes 3h:15m. MemPeak is at 6681MB, so even with PCI bus transfers, GPU rendering can be faster than CPU rendering (if you have a CPU as weak as mine, I guess...).

@Stefan Werner (swerner), that's not really comparing apples to apples.. Also, we can't that easily enable __SHADOW_RECORD_ALL__ on GPU because:

  1. With transparent bounces larger than 64 it'll use malloc() which isn't fast (read as: really slow)
  2. It increases ctask usage by 1.5KB per thread (which is 5% of overall current stack usage as far as i remember)

That being said, some code feedback.

Also seems dealing with multi-device is not in there yet?

And still not sure having memory limit setting is indeed a way to go, there are various ways it'll fail on various levels:

  • it's yet-another-magic-setting which users would have to adjust based on (a) particular scene (b) particular system and there's no goo default value
  • It is based on physical memory, not free one. Meaning, cases when Blender is not an exclusive running application (or there are multiple Blender instances running) current approach of "safety" will lead to a machine death yet again
  • Documentation doesn't mention any issues using more than N% of host memory and would expect everything to be good as long as there's enough free RAM. IMO it totally worth checking if the computer hungs you're experiencing is a specific of behavior or is a bug of some sort. at least form reading documentation it's unclear why pinning memory but keeping enough memory for running processes will cause any troubles.

One more thing, documentation mentions context is to be created with CU_CTX_MAP_HOST flag, which i don't see in this patch.1

intern/cycles/device/device_cuda.cpp
263

Cycles's code style is to NOT have space between keyword and brace, so it should be if(...). Same applies to the rest of the patch,

Also not sure it's indeed robust way of dealing with VRAM limits. What if you'll have second Blender instance running in the background?

282

In Cycles else goes to the next line. Also, please always use parenthesis. Same applies to the rest of the changes.

448

Seems arbitrary and does not belong to CUDA device. Such fallback is more up to the system_physical_ram() IMO.

453

Sentence starts with capital. Also * goes to each of the line of multi-line comment.

457

Wondering following things:

  • Are those register numbers differ form what is passed to CUDA_LAUNCH_BOUNDS() ?
  • What's the size of unused memory for regular path trace? (AFAIR, it uses less registers, meaning we'll have quite some unused VRAM)
482

That's quite weird and hacky, reduces amount of allowed VRAM once again. Why exactly it's needed?

Did quick test with pinning memory using OS API (namely mlock since i'm on Linux). I had 3+Gb used by running processes and i've successfully pinned 10Gb out of 12Gb of RAM without any noticeable regression. Pinning 11Gb of RAM caused some hiccups in running youtube, but that's kinda expected. In any case, the system survived this just fine.

Wouldn't expect CUDA API doing something much more special than using kernel's calls to pin pages anyway.

Here's a test code i've used

1#include <stdlib.h>
2#include <stdio.h>
3
4#include <sys/mman.h>
5
6#define TEST_LOCK
7
8int main(int argc, char **argv) {
9const size_t N = (size_t)10 * 1024 * 1024 * 1024;
10char *mem = malloc(sizeof(char) * N);
11printf("Memory allocated\n");
12#ifdef TEST_LOCK
13printf("Pinning memory...");
14fflush(stdout);
15if (mlock(mem, sizeof(char) * N) == -1) {
16printf(" fail!\n");
17perror("mlock");
18fprintf(stderr, "Error pinning memory, aborting!\n");
19return EXIT_FAILURE;
20}
21printf(" ok.\n");
22#endif
23size_t i;
24printf("Filling in memory..");
25fflush(stdout);
26for (i = 0; i < N; ++i) {
27mem[i] = i % 255;
28}
29printf(" ok.\n");
30printf("Hit enter to continue.\n");
31getchar();
32#ifdef TEST_LOCK
33printf("Unpinning memory...");
34fflush(stdout);
35munlock(mem, sizeof(char) * N);
36printf(" ok.\n");
37#endif
38printf("Freeing memory...");
39fflush(stdout);
40free(mem);
41printf(" ok.");
42printf("All done, exit!\n");
43return EXIT_SUCCESS;
44}

NOTE: If you'll be running it make sure you've got enough limits, check ulimit -lfor that.

The change to SHADOW_RECORD_ALL wasn't meant as a permanent solution, but to compare how GPU performance would be in the largest Blender scene available to me (will happily render any other scenes). When SHADOW_RECORD_ALL is different for CPU and GPU, we're not tracing the same number of rays, and benchmarks with high levels of transparent shadows will always perform better on the CPU.

If you are in favour of not having limits on pinned memory, sure, I can change it to that. Your sample code allocating pinned memory worked without hiccups on my OS X machine (14 GB out of 16 GB just fine), so I'm not sure if the CUDA Runtime does something additional there or if my machine had a bad day when I tested it last time.

I will add the CU_CTX_MAP_HOST, thanks for pointing that out.

intern/cycles/device/device_cuda.cpp
457

They should be the same as in CUDA_LAUNCH_BOUNDS(). By querying from the loaded kernel, we won't need to recompile device_cuda.cpp if we change the kernel launch bounds (good for on the fly compilation).
This may leave unused memory on a regular path trace run, but I don't think there is any reasonable way of safely maxing out every last byte of VRAM. Either we'll err on the safe side and potentially leave unused VRAM, or we'll have renders aborted due to running out of VRAM.

482

With less memory, a kernel launch would throw out of memory errors. I can't find any documentation about what the actual memory is that the CUDA Runtime requires for a kernel launch, but this patch needs some kind of heuristic. cuMemGetInfo() is also not telling us about whether the available memory is continuous or fragmented, and I don't know whether a kernel launch requires continuous memory for reserving the stack/local memory.

Looks like cuMemAllocHost() is not the same as malloc() followed by mlock(). While malloc/mlock easily allows me to get 14 out of 16GB on my machine, cuMemAllocHost() freezes the entire machine at 12GB.

This comment was removed by John Roper (johnroper100).

This is an update of the patch against the latest master. It still needs to be changed to share host memory between GPUs where possible instead of creating duplicate allocations.

Another update to stay in sync with master.

Great idea for a patch! What blender version does it currently apply to?
It doesn't seem to work for me on 2.79, 2.77 nor 2.80 release candidate.

(Host OS: Linux Ubuntu 16.04.3 64bit)

Updated to latest master, some notes:

  • The latest code refactoring in master should make it easier to do a single pinned memory allocation for the host and all GPU devices. I did a quick test with Pascal unified memory (P545) that works like this, but the Windows driver does not support it so it's not a real replacement for pinned memory.
  • For tile buffers and split kernel memory we should not used pinned memory, since they are allocated last they are most likely to currently. In general it would be good to have a priority for which memory goes where, for example image textures should be more likely kept in host memory than texture_info. Implementation might be tricky though.
  • I started using structs in kernel_textures.h, and was hoping to do that more and get rid of the quite cryptic texture packing/unpacking. But if we need to keep textures for the Kepler L1 cache then it seems that will have to wait.

Would be great to go over this patch at bconf and figure out the way forward.

Brecht Van Lommel (brecht) updated this revision to Diff 9507.EditedSat, Nov 4, 2:58 AM

Rebase patch, part of the refactoring in this patch was committed to master.

Switching to texture objects for all memory is giving me a ~7% slowdown on benchmarks with a Titan Xp. Solution would be to either do this change only for Kepler, or to only support image textures in host memory on Kepler.

Side note: predicting the amount of headroom needed by tiles is quite difficult for denoising, since we keep a bunch of them around, and how many depends on other devices and tile order. Also resizing the viewport render can be problematic.

I also tried a bit to get 3D textures allocated with host memory, but can't find anyway way to do it with the CUDA API.

  • Reserve local memory ahead of scene data allocation in rB5475314f4955: Cycles: reserve CUDA local memory ahead of time.. By launching the kernel with zero work and CU_CTX_LMEM_RESIZE_TO_MAX, we can reserve the exact amount required without over/underestimation.
  • Sharing of host memory between multiple GPU devices and CPU device. The implementation of this is rather weak. Pitch alignment requirements in CUDA means we can't do it for some small or non-power of two textures.
  • Logging now shows which buffers got allocated in device or host memory, for easier debugging. There is also some commented out code to fill up GPU memory, for stress testing.
  • Allow using more pinned memory when the system has > 8GB RAM, in that case we always leave 4 GB free for the OS and other software, instead of half.

The main remaining issue is that rendering with denoise slows down massively (30x) when we allocate tile and denoising buffers in host memory. The amount of memory required for that is difficult to predict in advance, though we could try. With a multi-device render in fact the tile buffer memory is not well bounded at all, if a faster device needs to wait a long time for a slower device to finish a tile. Dynamically migrating device memory to host memory as needed seems like a more robust solution, but also tricky to implement with multi device rendering.

Without denoising it seems pretty robust so far, though the amount of tile memory needed is not estimated accurately yet. For image textures on the host performance loss is about ~20-30% here. But when allocating e.g. BVH memory on the host render time goes up 10x or 20x, which isn't really useful either, might as well render everything the CPU then. Perhaps we should only allocate image textures in host memory? It's also unfortunate that 3D textures can't be in host memory. We could work around it by storing them as 2D textures and doing some manual interpolation, but I would prefer to avoid that complexity.

Optimizing the Kepler case will be moved to another diff. If we only put image textures in host memory we won't need any changes there though.

Thanks for your work, this is looking good! Especially the addition for sharing memory with CPU devices is good.