- User Since
- Mar 31 2015, 9:29 AM (98 w, 5 d)
Fri, Feb 17
Oct 6 2016
I have experimented in the same direction. My code was using shared memory, which should in theory be faster than global/local memory. As Brecht mentions, moving through shadow in steps of N intersections at a time could make this work for arbitrary shadow depths. One would need to find the N closest intersections then, which I would try by keeping the intersections in a sorted heap (similar to Jensen's photon mapping code). Then sorting and volumes should also be doable.
Jul 21 2016
Here's an updated patch.
Jun 30 2016
Looks like cuMemAllocHost() is not the same as malloc() followed by mlock(). While malloc/mlock easily allows me to get 14 out of 16GB on my machine, cuMemAllocHost() freezes the entire machine at 12GB.
Jun 29 2016
Right now ray traversal is not making any use of shared memory at all. There may be an opportunity here, either for storing a short stack or the bit mask from the Barringer paper in shared memory. If I understood the Barringer paper correctly, they compared a CPU implementation only. The differences between global/local memory access and shared memory access could give different results in a GPU implementation.
The change to SHADOW_RECORD_ALL wasn't meant as a permanent solution, but to compare how GPU performance would be in the largest Blender scene available to me (will happily render any other scenes). When SHADOW_RECORD_ALL is different for CPU and GPU, we're not tracing the same number of rays, and benchmarks with high levels of transparent shadows will always perform better on the CPU.
Jun 22 2016
My computer just finished rendering the Victor/Gooseberry benchmark scene on two GPUs, coming in at about 1h:36m (conservative tile size of 64x64). So I think my patch is pretty solid.
Jun 19 2016
Jun 16 2016
There is no swapping with pinned memory. That's the whole point about pinned memory - it always remains fixed in physical memory, at the same address, no matter how severe the OS' memory pressure is. That said, it was an SSD.
Octane exposes what they call "out-of-core" textures, which I assume is textures in pinned memory, as a feature that the user has to enable and to pick how much memory it is supposed to use: https://docs.otoy.com/Standalone_2_0/?page_id=3216
Different operating systems may handle this differently, but for what it's worth, I ran a simple test on OS X. cuMemAllocHost() in a loop, allocating 200MB chunks, 100 times - which should try to allocate ~20GB in total. My machine has 16GB of physical memory. At just under 10GB allocated, the machine froze completely. That is, not even the mouse pointer will move and the machine does not respond any more to network requests. After about one or two two minutes of being frozen, the machine rebooted, the "crash" log pointing to the Watchdog task.
Jun 15 2016
From the CUDA documentation it sounds like pinning all available memory may not be a good idea:
It only pins the amount of memory required, no more. The user preference would only set an upper bound to that.
A couple of ideas for improvement:
This should now include all changes, squashed into a single commit.
Yes. I'm still trying to figure out how this system and git patches work. It looks like when I try to upload my diff file with multiple commits in it, it takes only the first one.
This seems to be solid enough to allow me to launch a render of the Gooseberry benchmark scene on my 4GB GPU with TDR turned off. However, some tiles render extremely slow (counting seconds per sample instead of samples per second!), I haven't found out yet what crazy things happening in them.
OK, here you are: https://developer.blender.org/D2056
I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.
The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.
Jun 14 2016
I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?
PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.
Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.
Jun 13 2016
May 24 2016
May 19 2016
May 18 2016
May 17 2016
This updated patch removes the now unused ray_quad_intersect() and ray_triangle_intersect().
May 16 2016
Here is a before/after comparison with an area light:
The same shader on a point light:
May 15 2016
Nov 27 2015
Sorry for not responding earlier, there were a number of other things to work on. Sergey, is this still relevant or do your latest changes to SSS take care of this?
Nov 20 2015
Here we go. This would be my proposed fix. I hope my code style isn't too far from your standards.
Nov 19 2015
Querying for CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES (via cuFuncGetAttribute) on a machine with SM 5.2, returns 89024 bytes before the patch and 75056 after the patch. So it's about 15% less local memory per kernel thread.
Nevermind, I have to dampen expectations. sizeof(ShaderData) is ~5KB, the amount of memory saved is about 14kB/thread.
Nov 18 2015
I'm away from my dev machine right now, but I think that sizeof(ShaderData) is somewhere in the range of 20kB-40kB. The patch should save three instances of ShaderData, so it's reducing local memory usage in the ballpark of 80kB per thread. Don't quote me on those exact numbers though, I can look it up in more detail tomorrow.
Sure, no problem:
The ground truth would be the path traced result without MIS, which matches PT with MIS and BPT without MIS.
Nov 17 2015
Brecht, I'm not sure I'm following you. In my opinion, the path tracing integrator gives the correct result, where diffuse is uniform, the branched path tracer is incorrect and darkens diffuse in the presence of specular.
Nov 13 2015
Jul 16 2015
Jul 15 2015
JSON or XML should be irrelevant, both are going to be equally easy to read or write with a decent library. I hope nobody has intentions of reinventing the wheel by writing yet another XML parser!