I'd like to unify path and branched path tracing, first to deduplicate code and improve performance on GPUs, and then to try to combine the best sampling strategies from both. The big difference between the two is that branched path tracing samples all lights and closures, instead of just 1.
The branched path code is not particularly efficient on either CUDA or OpenCL, and I think the best way to solve that is to use the same mechanism as we use for SSS in the path tracing megakernel, storing the state for each branch and restarting the path from that. This should also significantly simplify and deduplicate code.
The immediate problem is stack memory usage. Users can specify and arbitrarily high number of diffuse/glossy/... samples, and each sample would require extra state to be stored. We can address that in two ways:
- Reduce size of the state that must be stored. I believe it can be reduced to about 200 bytes per branch, and possibly 150-100 bytes with more tricky optimizations.
- With more branches there are quickly diminishing returns, if you have e.g. 16 branches that means you can cut the cost of the camera ray trace + shader evaluation to 1/16th, and at that point the cost of direct and indirect lighting are likely to be much bigger, and the branched path GPU code is already significantly slower. Extra AA samples to reduce aliasing, DoF and motion blur noise seems more useful at that point. So we could cap the number of branches, and if there are more closures sample only a subset of them.
Assuming we use 16 x 200 bytes = 3.2kb, that's the same memory usage as the current SSS stack, and not that big compared to the total 21k of the CUDA path tracing kernel. In fact I'm guessing the optimal branch factor for most scenes with GI is much lower than 16, but this needs to be tested with equal time renders of production scenes, which I plan to do in this task.