This path reduces memory usage of the experimental kernel on my CUDA card by several hundred MB. Instead of keeping an array of ShaderData around for subsurface scattering, it only keeps an array of Intersections and creates in the ShaderData one by one as it processes the subsurface ray hits.
Great, what percentage memory reduction is this? Memory usage depends on the number of CUDA cores, so several hundred MB could mean anywhere from 10% to 50% or so.
We should find a way to avoid the code duplication though, while still avoiding the ShaderData memory usage.
I'm away from my dev machine right now, but I think that sizeof(ShaderData) is somewhere in the range of 20kB-40kB. The patch should save three instances of ShaderData, so it's reducing local memory usage in the ballpark of 80kB per thread. Don't quote me on those exact numbers though, I can look it up in more detail tomorrow.
Percentage wise, I think it was about 40%.
Nevermind, I have to dampen expectations. sizeof(ShaderData) is ~5KB, the amount of memory saved is about 14kB/thread.
Here's the output of nvcc (sm_30) before the patch:
ptxas info : Function properties for _Z30kernel_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3RayP6float382 25256 bytes stack frame, 2596 bytes spill stores, 2520 bytes spill loads [..] ptxas info : Function properties for _Z39kernel_branched_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3Ray6float381 25032 bytes stack frame, 1212 bytes spill stores, 2240 bytes spill loads
and after the patch:
ptxas info : Function properties for _Z30kernel_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3RayP6float382 11288 bytes stack frame, 2184 bytes spill stores, 2208 bytes spill loads [..] ptxas info : Function properties for _Z39kernel_branched_path_subsurface_scatterP13KernelGlobalsP10ShaderDataP12PathRadianceP9PathStatePjP3Ray6float381 11112 bytes stack frame, 1136 bytes spill stores, 2096 bytes spill loads
Querying for CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES (via cuFuncGetAttribute) on a machine with SM 5.2, returns 89024 bytes before the patch and 75056 after the patch. So it's about 15% less local memory per kernel thread.
The numbers I remembered initially were for our branch of the kernel, which has a few things turned off, so the percentage saved is higher there.
This is a nice memory saving indeed. Does it has any affect on render time?
And agree there should be a way to avoid code duplication -- moving some parts into utility functions and such.
@Stefan Werner (swerner), mine approach was very similar yes. Basically, storing simplified version of ShaderData which only has intersection point, geometric normal and such. Plus were some additional tweaks done to reduce memory usage even further.
There's still some room for the improvement, but we don't have array of ShaderData on the stack now.