Stefan Werner (swerner)
User

Projects

User Since
Mar 31 2015, 9:29 AM (98 w, 5 d)

Recent Activity

Fri, Feb 17

Stefan Werner (swerner) awarded D2348: Cycles: Refactor split kernel and implement for CPU a Like token.
Fri, Feb 17, 11:37 AM

Oct 6 2016

Stefan Werner (swerner) added a comment to D2226: Cycles: Speedup transparent shadows on CUDA.

I have experimented in the same direction. My code was using shared memory, which should in theory be faster than global/local memory. As Brecht mentions, moving through shadow in steps of N intersections at a time could make this work for arbitrary shadow depths. One would need to find the N closest intersections then, which I would try by keeping the intersections in a sorted heap (similar to Jensen's photon mapping code). Then sorting and volumes should also be doable.

Oct 6 2016, 10:03 AM

Jul 21 2016

Stefan Werner (swerner) added inline comments to D1995: Parametric Geometry coordinates for point and area lights in Cycles.
Jul 21 2016, 3:20 AM · Cycles
Stefan Werner (swerner) updated the diff for D1995: Parametric Geometry coordinates for point and area lights in Cycles.

Here's an updated patch.

Jul 21 2016, 3:17 AM · Cycles

Jun 30 2016

Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

Looks like cuMemAllocHost() is not the same as malloc() followed by mlock(). While malloc/mlock easily allows me to get 14 out of 16GB on my machine, cuMemAllocHost() freezes the entire machine at 12GB.

Jun 30 2016, 7:53 AM · Cycles

Jun 29 2016

Stefan Werner (swerner) added a comment to D2032: Cycles: Implement stackless BVH traversal.

Right now ray traversal is not making any use of shared memory at all. There may be an opportunity here, either for storing a short stack or the bit mask from the Barringer paper in shared memory. If I understood the Barringer paper correctly, they compared a CPU implementation only. The differences between global/local memory access and shared memory access could give different results in a GPU implementation.

Jun 29 2016, 12:25 PM
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

The change to SHADOW_RECORD_ALL wasn't meant as a permanent solution, but to compare how GPU performance would be in the largest Blender scene available to me (will happily render any other scenes). When SHADOW_RECORD_ALL is different for CPU and GPU, we're not tracing the same number of rays, and benchmarks with high levels of transparent shadows will always perform better on the CPU.

Jun 29 2016, 7:36 AM · Cycles

Jun 22 2016

Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

My computer just finished rendering the Victor/Gooseberry benchmark scene on two GPUs, coming in at about 1h:36m (conservative tile size of 64x64). So I think my patch is pretty solid.

Jun 22 2016, 4:00 PM · Cycles

Jun 19 2016

Stefan Werner (swerner) updated subscribers of D1985: Light Linking.
Jun 19 2016, 8:07 PM · Cycles

Jun 16 2016

Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

There is no swapping with pinned memory. That's the whole point about pinned memory - it always remains fixed in physical memory, at the same address, no matter how severe the OS' memory pressure is. That said, it was an SSD.

Jun 16 2016, 2:50 PM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

Octane exposes what they call "out-of-core" textures, which I assume is textures in pinned memory, as a feature that the user has to enable and to pick how much memory it is supposed to use: https://docs.otoy.com/Standalone_2_0/?page_id=3216

Jun 16 2016, 11:32 AM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

It might also make sense to use pinned memory for all device_vector allocations even if we want to copy memory to the GPU. That way we can use async copies which should be faster, particularly for multi GPU.

Jun 16 2016, 7:44 AM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

Different operating systems may handle this differently, but for what it's worth, I ran a simple test on OS X. cuMemAllocHost() in a loop, allocating 200MB chunks, 100 times - which should try to allocate ~20GB in total. My machine has 16GB of physical memory. At just under 10GB allocated, the machine froze completely. That is, not even the mouse pointer will move and the machine does not respond any more to network requests. After about one or two two minutes of being frozen, the machine rebooted, the "crash" log pointing to the Watchdog task.

Jun 16 2016, 7:38 AM · Cycles

Jun 15 2016

Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

From the CUDA documentation it sounds like pinning all available memory may not be a good idea:

Jun 15 2016, 10:00 PM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

It only pins the amount of memory required, no more. The user preference would only set an upper bound to that.

Jun 15 2016, 7:59 PM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

A couple of ideas for improvement:

Jun 15 2016, 7:06 PM · Cycles
Stefan Werner (swerner) updated the diff for D2056: Allow CUDA GPU rendering to use host memory.

This should now include all changes, squashed into a single commit.

Jun 15 2016, 4:38 PM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

Yes. I'm still trying to figure out how this system and git patches work. It looks like when I try to upload my diff file with multiple commits in it, it takes only the first one.

Jun 15 2016, 4:30 PM · Cycles
Stefan Werner (swerner) added a comment to D2056: Allow CUDA GPU rendering to use host memory.

This seems to be solid enough to allow me to launch a render of the Gooseberry benchmark scene on my 4GB GPU with TDR turned off. However, some tiles render extremely slow (counting seconds per sample instead of samples per second!), I haven't found out yet what crazy things happening in them.

Jun 15 2016, 4:20 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

OK, here you are: https://developer.blender.org/D2056

Jun 15 2016, 4:17 PM · Cycles
Stefan Werner (swerner) created D2056: Allow CUDA GPU rendering to use host memory.
Jun 15 2016, 4:17 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

I have attached a patch should hopefully take care of the Linux memory query. Apply this on top of the first patch.
linuxmem.diff

Jun 15 2016, 12:25 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

The code for determining the amount of physical system memory is not implemented yet for Linux - I am about to create a Linux VM to implement and test that part. So it is possible that the current patch is not working as intended on Linux systems.

Jun 15 2016, 8:42 AM · Cycles

Jun 14 2016

Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

I can get the Victor scene to start rendering on the GPU, but it fails with a kernel time out after about 20 tiles (using the default small tile size). Maybe someone with TDR turned off can try benchmarking it?

Jun 14 2016, 10:54 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

PCIe bandwidth seems to be the main factor here. When I move the K5000 to a slower slot (x4 instead of x16), performance drops dramatically. I haven't had the patience to let it run all the way through, but the BMW scene with all data in host memory is still at the first tile after 4 minutes, showing a remaining estimate of over one hour.

Jun 14 2016, 6:08 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

Well, the answer is a clear "it depends". My desktop machine has a single i5 CPU and a K5000 GPU, so the GPU can take quite a performance penalty before the CPU overtakes it - my laptop has an i7 quad core and a 750M, there the GPU/CPU difference is not as strong. I know, K5000 sounds like a monster card, but for this purpose, it's more or less a GTX 670 with twice the memory.

Jun 14 2016, 2:53 PM · Cycles
Stefan Werner (swerner) added a comment to T48651: Allow CUDA GPU rendering to use host memory.

patch.diff

Jun 14 2016, 10:45 AM · Cycles
Stefan Werner (swerner) created T48651: Allow CUDA GPU rendering to use host memory.
Jun 14 2016, 10:44 AM · Cycles

Jun 13 2016

Stefan Werner (swerner) committed rB2566652ae6c0: Cycles: fixed a typo that would crash shaders that use the "Is Diffuse Ray"… (authored by Stefan Werner (swerner)).
Cycles: fixed a typo that would crash shaders that use the "Is Diffuse Ray"…
Jun 13 2016, 1:34 PM

May 24 2016

Stefan Werner (swerner) awarded D1999: Cycles: Add support for bindless textures. a Like token.
May 24 2016, 10:21 AM

May 19 2016

Stefan Werner (swerner) created D2008: Fixed a rare case of NaN in Cycles.
May 19 2016, 10:51 AM · Cycles

May 18 2016

Stefan Werner (swerner) updated subscribers of D1999: Cycles: Add support for bindless textures..
May 18 2016, 8:21 AM
Stefan Werner (swerner) updated subscribers of D2002: Cycles: Add multi-scattering, energy-conserving GGX as an option to the Glossy, Anisotropic and Glass BSDFs.
May 18 2016, 8:21 AM
Stefan Werner (swerner) updated subscribers of D2003: Cycles: Add a new Metallic BSDF, combining condictive fresnel and multi-scattering GGX.
May 18 2016, 8:21 AM

May 17 2016

Stefan Werner (swerner) updated the diff for D1995: Parametric Geometry coordinates for point and area lights in Cycles.

This updated patch removes the now unused ray_quad_intersect() and ray_triangle_intersect().

May 17 2016, 3:16 PM · Cycles

May 16 2016

Stefan Werner (swerner) added a comment to D1995: Parametric Geometry coordinates for point and area lights in Cycles.

Here is a before/after comparison with an area light:


The same shader on a point light:

May 16 2016, 1:02 PM · Cycles

May 15 2016

Stefan Werner (swerner) created D1995: Parametric Geometry coordinates for point and area lights in Cycles.
May 15 2016, 9:16 PM · Cycles

Nov 27 2015

Stefan Werner (swerner) added a comment to D1621: Cycles: reduced memory usage of subsurface scattering.

Sorry for not responding earlier, there were a number of other things to work on. Sergey, is this still relevant or do your latest changes to SSS take care of this?

Nov 27 2015, 4:24 PM · Cycles
Stefan Werner (swerner) added a comment to T46760: Branched Path Tracing converges to different result than plain Path Tracing.

Thanks!

Nov 27 2015, 4:21 PM · BF Blender, Cycles

Nov 20 2015

Stefan Werner (swerner) added a comment to T46760: Branched Path Tracing converges to different result than plain Path Tracing.

Here we go. This would be my proposed fix. I hope my code style isn't too far from your standards.
0001-fix-for-T46760.patch

Nov 20 2015, 5:45 PM · BF Blender, Cycles

Nov 19 2015

Stefan Werner (swerner) added a comment to D1621: Cycles: reduced memory usage of subsurface scattering.

Querying for CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES (via cuFuncGetAttribute) on a machine with SM 5.2, returns 89024 bytes before the patch and 75056 after the patch. So it's about 15% less local memory per kernel thread.

Nov 19 2015, 12:57 PM · Cycles
Stefan Werner (swerner) added a comment to D1621: Cycles: reduced memory usage of subsurface scattering.

Nevermind, I have to dampen expectations. sizeof(ShaderData) is ~5KB, the amount of memory saved is about 14kB/thread.

Nov 19 2015, 11:33 AM · Cycles

Nov 18 2015

Stefan Werner (swerner) added a comment to D1621: Cycles: reduced memory usage of subsurface scattering.

I'm away from my dev machine right now, but I think that sizeof(ShaderData) is somewhere in the range of 20kB-40kB. The patch should save three instances of ShaderData, so it's reducing local memory usage in the ballpark of 80kB per thread. Don't quote me on those exact numbers though, I can look it up in more detail tomorrow.

Nov 18 2015, 9:12 PM · Cycles
Stefan Werner (swerner) added a comment to T46814: Cycles: reduced memory usage of subsurface scattering.

Sure, no problem:
https://developer.blender.org/D1621

Nov 18 2015, 7:11 PM · Cycles
Stefan Werner (swerner) created D1621: Cycles: reduced memory usage of subsurface scattering.
Nov 18 2015, 7:11 PM · Cycles
Stefan Werner (swerner) created T46814: Cycles: reduced memory usage of subsurface scattering.
Nov 18 2015, 6:05 PM · Cycles
Stefan Werner (swerner) added a comment to T46760: Branched Path Tracing converges to different result than plain Path Tracing.

The ground truth would be the path traced result without MIS, which matches PT with MIS and BPT without MIS.

Nov 18 2015, 10:52 AM · BF Blender, Cycles

Nov 17 2015

Stefan Werner (swerner) added a comment to T46760: Branched Path Tracing converges to different result than plain Path Tracing.

Brecht, I'm not sure I'm following you. In my opinion, the path tracing integrator gives the correct result, where diffuse is uniform, the branched path tracer is incorrect and darkens diffuse in the presence of specular.

Nov 17 2015, 6:28 PM · BF Blender, Cycles

Nov 13 2015

Stefan Werner (swerner) created T46760: Branched Path Tracing converges to different result than plain Path Tracing.
Nov 13 2015, 1:50 PM · BF Blender, Cycles

Jul 16 2015

Stefan Werner (swerner) added a comment to T45447: Area light importance sampling improvement.

Excellent, thanks!

Jul 16 2015, 10:04 AM · BF Blender, Cycles

Jul 15 2015

Stefan Werner (swerner) created T45447: Area light importance sampling improvement.
Jul 15 2015, 5:01 PM · BF Blender, Cycles
Stefan Werner (swerner) added a comment to T38279: Improve Cycles standalone.

JSON or XML should be irrelevant, both are going to be equally easy to read or write with a decent library. I hope nobody has intentions of reinventing the wheel by writing yet another XML parser!

Jul 15 2015, 1:13 PM · BF Blender, Cycles

Mar 31 2015

Stefan Werner (swerner) updated subscribers of D1200: Cycles OpenCL kernel-splitting work.
Mar 31 2015, 9:36 AM · Cycles