Cycles: reduced memory usage of subsurface scattering
AbandonedPublic

Authored by Brecht Van Lommel (brecht) on Nov 18 2015, 7:11 PM.

Details

Summary

This path reduces memory usage of the experimental kernel on my CUDA card by several hundred MB. Instead of keeping an array of ShaderData around for subsurface scattering, it only keeps an array of Intersections and creates in the ShaderData one by one as it processes the subsurface ray hits.

Diff Detail

Repository
rB Blender
Stefan Werner (swerner) retitled this revision from to Cycles: reduced memory usage of subsurface scattering.Nov 18 2015, 7:11 PM
Stefan Werner (swerner) updated this object.
Stefan Werner (swerner) set the repository for this revision to rB Blender.

Great, what percentage memory reduction is this? Memory usage depends on the number of CUDA cores, so several hundred MB could mean anywhere from 10% to 50% or so.

We should find a way to avoid the code duplication though, while still avoiding the ShaderData memory usage.

I'm away from my dev machine right now, but I think that sizeof(ShaderData) is somewhere in the range of 20kB-40kB. The patch should save three instances of ShaderData, so it's reducing local memory usage in the ballpark of 80kB per thread. Don't quote me on those exact numbers though, I can look it up in more detail tomorrow.

Percentage wise, I think it was about 40%.

Nevermind, I have to dampen expectations. sizeof(ShaderData) is ~5KB, the amount of memory saved is about 14kB/thread.

Here's the output of nvcc (sm_30) before the patch:

ptxas info    : Function properties for _Z30kernel_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3RayP6float382
    25256 bytes stack frame, 2596 bytes spill stores, 2520 bytes spill loads
[..]
ptxas info    : Function properties for _Z39kernel_branched_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3Ray6float381
    25032 bytes stack frame, 1212 bytes spill stores, 2240 bytes spill loads

and after the patch:

ptxas info    : Function properties for _Z30kernel_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3RayP6float382
    11288 bytes stack frame, 2184 bytes spill stores, 2208 bytes spill loads
[..]
ptxas info    : Function properties for _Z39kernel_branched_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3Ray6float381
    11112 bytes stack frame, 1136 bytes spill stores, 2096 bytes spill loads

Querying for CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES (via cuFuncGetAttribute) on a machine with SM 5.2, returns 89024 bytes before the patch and 75056 after the patch. So it's about 15% less local memory per kernel thread.

The numbers I remembered initially were for our branch of the kernel, which has a few things turned off, so the percentage saved is higher there.

Sergey Sharybin (sergey) requested changes to this revision.Nov 19 2015, 1:47 PM

This is a nice memory saving indeed. Does it has any affect on render time?

And agree there should be a way to avoid code duplication -- moving some parts into utility functions and such.

This revision now requires changes to proceed.Nov 19 2015, 1:47 PM

Sorry for not responding earlier, there were a number of other things to work on. Sergey, is this still relevant or do your latest changes to SSS take care of this?

@Stefan Werner (swerner), mine approach was very similar yes. Basically, storing simplified version of ShaderData which only has intersection point, geometric normal and such. Plus were some additional tweaks done to reduce memory usage even further.

There's still some room for the improvement, but we don't have array of ShaderData on the stack now.

I'll close this revision then, thanks for the inspiration.