Cycles: Implement stackless BVH traversal
Changes PlannedPublic

Authored by Sergey Sharybin (sergey) on May 27 2016, 4:16 PM.



This is expected to give memory saving ob GPU, hopefully without
speed penalty.

Comes quite directly from the following paper:

Current code is slower, most likely because it lacs closes child
traversal heuristic (other than that it's expected to be same
amount of ray-to-AABB intersections). I'll look into solving this
regression if someone confirms any measurable GPU memory improvement
because for now memory usage on GTX760 and C2075 stays the same
before and after the patch.

Diff Detail

rB Blender

Ok, that's weird. PTax shows reasonable stack memory usage difference:

but even on 980Ti there's no visible VRAM difference during rendering.

Some tests with GTX 960 on Windows.

PathBranched PathBMWPabellon

Is there a paper that shows stackless being faster than or matching stack based traversal on the GPU? The ones I found all seem to show it's slower, but I also found no papers testing on AMD cards where it might help more?

Wow, 40 to 100 percent slowdown, that's bad and something against claims in the paper.

There is another interesting approach [1] which claims to be more efficient than Hapala (which i've implemented). However that is still 7% slower.

So one possible way to more on would be to check if this stackless thing gives improvements for split kernel, where we've got much more parallel rays being shoot. If it gives reasonable memory saving we can perhaps go with Barringer approach (memory usage of split kernel is huge currently :()

Other possibility to move on will be to solve CUDA memory usage by wrapping intersections into single function and pass stack to actual traversals. Would work around multiple BVH stacks allocated.


Right now ray traversal is not making any use of shared memory at all. There may be an opportunity here, either for storing a short stack or the bit mask from the Barringer paper in shared memory. If I understood the Barringer paper correctly, they compared a CPU implementation only. The differences between global/local memory access and shared memory access could give different results in a GPU implementation.

Sergey Sharybin (sergey) planned changes to this revision.Jul 14 2016, 3:05 PM

While it is interesting experiment, with that slowdown it's not near to be acceptable at all. I'll keep poking other ideas here, but for now marking as changes planned so the patch does not annoy everyone involved into code review.