Cycles: Speedup transparent shadows on CUDA

Authored by Sergey Sharybin (sergey) on Sep 13 2016, 6:19 PM.



There are couple of aspects here.

First idea is to utilize the fact then for scenes without volumes it's not
really needed to traverse intersection in the z-sorted order. This is
achieved by special BVH traversal function which has a callback function
which is called on every intersection. Form within this function we can
modify the throughput or abort traversal if we hit something opaque.

From tests with koro_blend.blend had about 2x speedup (1:32 with the RC1
build and 0:38 with the patched master). Be careful when comparing results
because GPU version used to give some wrong results because of missing
intersections (as far s i can see, this is caused by moving ray too much
further far the bounce, which moves it out of the Koro's fur).

This speedup only works for scenes without volumes. It also requires a
bit more VRAM: since it's one extra BVH traversal it needs one more
stack which isn't getting de-duplicated by the CUDA compiler. For my
current GTX1080 it is about 48 megabytes,

Second idea is about optimization transparent shadows on GPU for scenes
with volume and few transparent bounces.

This is reasonably faster to perform array sort on a scenes with few
transparent shadows than to do full BVH intersection query for each
of the intersection step. Unfortunately, it gives some more memory
usage penalty and now it's about 20 megabytes (weirdly, it's half
of the previous bump, so maybe some of the static arrays were de
duplicated this time).

Makes koro.blend rendered in the same time even after adding a volume
to the scene.

Currently this onyl works for until 16 transparent bounces.

Diff Detail

rB Blender
Build Status
Buildable 414
Build 414: arc lint + arc unit
Sergey Sharybin (sergey) retitled this revision from to Cycles: Cycles: Speedup transparent shadows on CUDA.Sep 13 2016, 6:19 PM
Sergey Sharybin (sergey) updated this object.

Run tests with barcelona file. With the master it rendered in 1min 14sec using 1052MB, with the patch it's rendered in 58sec and used 1118MB on GTX1080.

I didn't have time yet to review the code. But it would be cool if we could avoid duplicating the code between CPU and GPU and volume and non-volume cases, with all these special cases it becomes difficult to maintain. Could all these ideas be combined into a single algorithm that works on CPU and GPU?

Out of order shader evaluation could work well on the CPU too. Depending on the scene it might be faster than the sorting we do now, if a shader evaluation returns an opaque result and BVH traversal can be terminated early. For example texture mapped tree leaves could be 75% opaque and you'd get a 75% probability of being able to terminate on the first intersection.

And perhaps that could be combined with the second idea, doing immediate shader evaluation for surfaces while recording volume intersections to be evaluated afterwards or as soon as complete volume segments are found? That would let you skip volume shading if an surface shader evaluates to opaque first.

It's not always ideal though, the failure case for immediate shader evaluation is if you have e.g. a character with transparent fur next to an opaque wall. In that case it can be better to wait with shading until the opaque intersection has been found.

Even if we do need different algorithm on CPU and GPU, perhaps the bvh_shadow_all kernel can be replaced by this callback kernel + a callback function?

Nice speedup. Is there a technical reason why it's done only for CUDA?

@Brecht Van Lommel (brecht), the duplicated code is indeed annoying, but i don't really see how can we avoid duplication between CPU and GPU.

From the performance point of view, on CPU it might be faster to record all intersections at once (from a single BVH traversal) and then deal with that rather than doing things from a callback which can't be inlined. Note: we can easily avoid qsort()on CPU if we know there is no volumes in the scenes.

Now, annoying part continues on GPU as well. There callback-based approach is some reasonable percent (around ~5% AFAIR) slower than recording all intersections (well, limited amount of intersections, the ones we limited to 16 atm). So this part we might re-iterate on GPU as well and always do things like shadow_blocked_transparent_volume_all_static when having few transparent bounces.

OpenCL is even more tricky. I didn't check specs in details, but just trying to compile callback-based BVH traversal crashes my video driver on linux. So here we can't do much at this moment i'm afraid.

So the only way to avoid much duplication i see is to have some #include'ed "template" several times with different define flags. That way we can avoid duplication of shadow_blocked_transparent_volume_all_static() and CPU's shadow_blocked() at least. And also should be possible to avoid duplication of shadow_blocked_transparent() when not having callback BVH traversal and shadow_blocked_transparent_volume(). Sounds a bit tricky, but in practice should be rather straightforward implementation.

@john peterson (bliblubli), CPU is already close to ideal performance here, OpenCL does not callback functions, so the ideas implemented in this patch do not apply here as well.

Perhaps we can record the first N intersection on the GPU, and then do a second/third/.. intersection in the hopefully rare case it's needed? That could also share most code with the CPU, the difference just would be the limit on the max number of intersection to store.

I don't think the memory usage increase would be too bad even if you store e.g. 32 intersections (32*6*4 = 768 bytes). If you compare that to other memory usage (D2023#46333), it's not that much, and we can with that back elsewhere with optimizations like D2023.

Majorly reworked the patch, now there should be much less duplication

Have similar speed to the previous approach, but uses more memory.
Now memory bump is about 100MB on my 1080,

Hopefully with some trickery from D2247 and D2249 will let us to
compensate for that.

Just run some quick tests with D2249 applied and had same render time speedup but overall had 100MB improvement in memory usage (improvement comparing to master branch). So ti is a win-win actually.

With D2247 applied the improvement can be even more on Maxwell, but for some reason that BVH change brings nothing to Pascal.

Now, as for the moving forward.

Surely it is annoying to have memory bump, but we can use lower stack storage on GPU. Having 32 stack intersections will bring us to the same memory bump as adding extra BVH traversal.

And being evil: this patch gives nice transparent shadows speedup (at least on Pascal, i still didn't hear anything from Maxwell benchmarks, but 32sec vs 1min33sec i've got here is cool). So we can just accept few % slowdown in D2247 and one of the closure patches (surely we cant' apply both of them, which one we use to move forward i don't care that much. whatever looks less crappy).

Thoughts? :)

Results are mixed here with Linux and CUDA 7.5, excellent speedup in the Koro scene but also some slowdowns in others. I'll try to narrow down where the issue is.

GTX 960 render time
Fishy Cat+11.1%
Pabellon Barcelona-15.1%

I have experimented in the same direction. My code was using shared memory, which should in theory be faster than global/local memory. As Brecht mentions, moving through shadow in steps of N intersections at a time could make this work for arbitrary shadow depths. One would need to find the N closest intersections then, which I would try by keeping the intersections in a sorted heap (similar to Jensen's photon mapping code). Then sorting and volumes should also be doable.

This is getting interesting. Tracing additional PATH_RAY_SHADOW_OPAQUE ray and testing whether intersection is transparent or not prior to call of shadow_blocked_all() seems to solve the majority of the slowdown on classroom scene (there still might be barely measurable one here).

Here is a quick patch for that:

1diff --git a/intern/cycles/kernel/kernel_shadow.h b/intern/cycles/kernel/kernel_shadow.h
2index e69eac6ab83..dbaf266736c 100644
3--- a/intern/cycles/kernel/kernel_shadow.h
4+++ b/intern/cycles/kernel/kernel_shadow.h
5@@ -213,7 +213,9 @@ ccl_device_noinline bool shadow_blocked_stepped(KernelGlobals *kg,
6​ ShaderData *shadow_sd,
7​ ccl_addr_space PathState *state,
8​ ccl_addr_space Ray *ray_input,
9- float3 *shadow)
10+ float3 *shadow,
11+ const bool blocked,
12+ Intersection *isect)
13​ {
14​ *shadow = make_float3(1.0f, 1.0f, 1.0f);
15​ if(ray_input->t == 0.0f) {
16@@ -226,19 +228,23 @@ ccl_device_noinline bool shadow_blocked_stepped(KernelGlobals *kg,
17​ Ray *ray = ray_input;
18​ #endif
20+ /*
21​ #ifdef __SPLIT_KERNEL__
22​ Intersection *isect = &kg->isect_shadow[SD_THREAD];
23​ #else
24​ Intersection isect_object;
25​ Intersection *isect = &isect_object;
26​ #endif
27+ */
28​ /* Early check for opaque shadows. */
29+ /*
30​ bool blocked = scene_intersect(kg,
31​ *ray,
33​ isect,
34​ NULL,
35​ 0.0f, 0.0f);
36+ */
38​ if(blocked && kernel_data.integrator.transparent_shadows) {
39​ if(shader_transparent_shadow(kg, isect)) {
40@@ -311,16 +317,34 @@ ccl_device_inline bool shadow_blocked(KernelGlobals *kg,
41​ #ifdef __SHADOW_RECORD_ALL__
42​ # ifdef __KERNEL_CPU__
43​ return shadow_blocked_all(kg, shadow_sd, state, ray, shadow);
44-# else
45- const int transparent_max_bounce = kernel_data.integrator.transparent_max_bounce;
46- const uint max_hits = transparent_max_bounce - state->transparent_bounce - 1;
47- if(max_hits + 1 < SHADOW_STACK_MAX_HITS) {
48- return shadow_blocked_all(kg, shadow_sd, state, ray, shadow);
49- }
50-# endif
51+# else /* __KERNEL_CPU__ */
53+#ifdef __SPLIT_KERNEL__
54+ Intersection *isect = &kg->isect_shadow[SD_THREAD];
56+ Intersection isect_object;
57+ Intersection *isect = &isect_object;
58​ #endif
60+ bool blocked = scene_intersect(kg,
61+ *ray,
63+ isect,
64+ NULL,
65+ 0.0f, 0.0f);
66+ if(blocked && kernel_data.integrator.transparent_shadows) {
67+ if(shader_transparent_shadow(kg, isect)) {
68+ const int transparent_max_bounce = kernel_data.integrator.transparent_max_bounce;
69+ const uint max_hits = transparent_max_bounce - state->transparent_bounce - 1;
70+ if(max_hits + 1 < SHADOW_STACK_MAX_HITS) {
71+ return shadow_blocked_all(kg, shadow_sd, state, ray, shadow);
72+ }
73+ }
74+ }
75+# endif /* __KERNEL_CPU__ */
76+#endif /* __SHADOW_RECORD_ALL__ */
77​ #ifndef __KERNEL_CPU__
78- return shadow_blocked_stepped(kg, shadow_sd, state, ray, shadow);
79+ return shadow_blocked_stepped(kg, shadow_sd, state, ray, shadow, blocked, isect);
80​ #endif
81​ }

Yet the patch is still making koro.blend really faster.

Now, why is it so:

  • First guess was that it's the order of switch() and for() in the traversal functions (bvh_traversal does loop inside of a case, shadow_all does it other way around). But applying updated version of D830 (to minimize amount of duplicated code) and swapping order of switch() and for() did not change anything . D830 on it's own seemd to not give any measurable difference here in fact.
  • Second guess was that it's the difference in the way how we invoke traversals. For some reason, scene_intersect foes not check for kernel_data.bvh.have_instancing on GPU and always uses bvh_intersect_instancing() if __INSTANCING__ is defined. Changing this for scene_intersect_shadow_all() again did not give any measurable; speed difference.

Running out of ideas here what could be a reason why extra ray cast makes things faster here and wouldn't mind having second pair of eyes here.

Updating against latest master

Sergey Sharybin (sergey) retitled this revision from Cycles: Cycles: Speedup transparent shadows on CUDA to Cycles: Speedup transparent shadows on CUDA.Jan 27 2017, 6:25 PM

It seems that there is a difference between CPU and CUDA currently, and this patch fixes it.

CPUCUDA beforeCUDA after

I'm still seeing a slowdown in the classroom scene though.

GTX 1080 render time
Fishy Cat+4.5%
Pabellon Barcelona-15.2%

@Brecht Van Lommel (brecht), indeed there's something with shadows on CUDA. I think it's because of the ray is pushed too far away at some point when doing stepped traversal.

What was the GPU you've tested performance? Did you apply snipped from my comment above?

The above times are without the snippet above. With the snippet I can confirm the slowdown is gone on the other scenes, but Koro renders wrong. This is using GTX 1080 on Linux.

Getting quite interesting. Definitely i had no such a bug. Having similar setup (1080 on Linux). Will dig deeper next week!

Sergey Sharybin (sergey) updated this revision to Diff 8212.EditedJan 30 2017, 4:26 PM

There are various fixes mainly related:

Avoid duplicated logic in parent and children scope

This reduces register pressure a bit and seems to give barely measurable speedup. The code is a bit more of function calls now, but hopefully they are still easy to follow.

This is always a tradeof between nice readable code and ultimate performance by the looks of it.

Fix regressions rendering volumes

There were various artifacts caused by using wrong state to update the volume stack.

Hopefully fix speed regression

Apparently it's still faster to shoot opaque ray and see if it hits opaque surface or not. Not sure why is it so: the amount of traversed BVH nodes and intersection tests is the same when ray hits something opaque, but for some reason scene_intersect does it much faster.

My guess here is that it's caused by amount of spill bytes and number of spill loads which kills performance for shadow_all BVH traversal.

In any case, it should still be faster than previous implementation so proposal is to have such extra check unless it causes some real and speed regressions.

Prepared code for the split kernel

Did various tweaks here and there to make OpenCL happy. Unfortunately, making split kernel to work is not yet easy mainly because of the annoying address space and lack of auto address space.

Here is a patch P439 which adds split kernel support, but perhaps better to apply it later separately with some trick to avoid duplication of instance push/pop functions.

On NVidia OpenCL i've got 2.6 times speedup on koro.blend. memory usage was higher but that could be lowered i think by only using one of coop arrays for intersections.

Seems to be working correctly, and performance is still great.

GTX 1080 render time
Fishy Cat-0.0%
Pabellon Barcelona-11.6%

I couldn't spot bugs in the new code, LGTM.

This revision is now accepted and ready to land.Feb 1 2017, 4:31 AM

Committed as a series of patches to simplify bisecting and such.

Will finish D2249 later to bring memory usage down and implement record-all behavior on OpenCL in the coming days.

@Brecht Van Lommel (brecht), thanks for the review and tests!