- Reduce size of IntegratorState
- Overlap SSS parameters with other memory
- Compress some floats as half float
- Reduce bits for integers where possible
- Dynamic volume stack size depending on scene contents
- Individual arrays for XYZ components of float3?
- Pack together 8bit/16bit values into 32bit?
- Reduce size of ShaderData
- Compute some differentials on the fly?
- Read some data directly from ray/intersection?
- Don't copy matrix if motion blur is disabled
- Auto detect good integrator state size depending on GPU (hardcoded to 1 million now)
- Automatically increase integrator state size depending on available memory
- Replace megakernel used for last few samples? Only helps about 10% with viewport render, barely with batch render. But makes OptiX runtime compilation slow.
- Enqueue all kernels for a single path iteration at once?
- Device side enqueue? Not likely for OptiX.
- Make svm_eval_nodes a templated function and specialize it for
- Verify if more specialization is possible in svm_eval_nodes
- Deduplicate shader evaluation call in shade_background
- Try pushing (part of) integrator state in queues rather than persistent location, for more coherent memory access
- Check potential performance benefit by coalescing state
- Shadow paths
- Main path
- Make shadow paths fully independent so they do not block main path kernels?
- Accurate countely number of available paths for scheduling additional work tiles
- Tweak work tiles or pixel order to improve coherence (many small tiles, Z-order, ..)
- Try other shader sorting techniques (but will be more expensive than bucket sort)
- Take into account object ID
- Take into account random number for BSDF/BSSRDF selection?
- Overlapped kernel execution
- Use multiple independent GPU queueus? (so far was found to be 15% slower)
- Use multiple GPU queues to schedule different kernels?
- Optimize active index, sorting and prefix sum kernels
- Parallelize prefix_sum
- Build active/sorted index array for specific based on another array indicating active paths for all kernels, especially when number of paths is small
- Transparent shadows: tune max hits for performance / memory usage
- Transparent shadows: can we terminate OptiX rays earlier when enough hits are found?
- For final render, let Cycles draw the image instead of copying pixels to Blender
- Render pass support
- Adaptive determine integrator state size based on number of GPU cores and available memory
- Tweak kernel compilation parameters (num threads per block, max registers)
- Different parameters per kernel?