## Memory usage
[ ] Reduce size of IntegratorState
[ ] Overlap SSS parameters with other memory
[ ] Compress some floats as half float
[ ] Reduce bits for integers where possible
[ ] Dynamic volume stack size depending on scene contents
[ ] SoA
[ ] Individual arrays for XYZ components of float3?
[ ] Pack together 8bit/16bit values into 32bit?
[ ] Reduce size of ShaderData
[ ] Compute some differentials on the fly?
[ ] Read some data directly from ray/intersection?
[ ] Don't copy matrix if motion blur is disabled
[ ] Auto detect good integrator state size depending on GPU (hardcoded to 1 million now)
[ ] Automatically increase integrator state size depending on available memory
## Kernel Size
[ ] Replace megakernel used for last few samples? Only helps about 10% with viewport render, barely with batch render. But makes OptiX runtime compilation slow.
[ ] Enqueue all kernels for a single path iteration at once?
[ ] Device side enqueue? Not likely for OptiX.
[ ] Make `svm_eval_nodes` a templated function and specialize it for
[ ] Background
[ ] Lights
[ ] Shadows
[ ] Shader-raytracing
[ ] Volumes
[ ] Deduplicate shader evaluation call in `shade_background`
[ ] Try pushing (part of) integrator state in queues rather than persistent location, for more coherent memory access
[ ] Check potential performance benefit by coalescing state
[ ] Shadow paths
[ ] Main path
[ ] Make shadow paths fully independent so they do not block main path kernels?
[ ] Accurate countely number of available paths for scheduling additional work tiles
[ ] Tweak work tiles or pixel order to improve coherence (many small tiles, Z-order, ..)
[ ] Try other shader sorting techniques (but will be more expensive than bucket sort)
[ ] Take into account object ID
[ ] Take into account random number for BSDF/BSSRDF selection?
[ ] Overlapped kernel execution
[ ] Use multiple independent GPU queueus? (so far was found to be 15% slower)
[ ] Use multiple GPU queues to schedule different kernels?
[ ] Optimize active index, sorting and prefix sum kernels
[ ] Parallelize prefix_sum
[ ] Build active/sorted index array for specific based on another array indicating active paths for all kernels, especially when number of paths is small
## Render Algorithms
[ ] Transparent shadows: tune max hits for performance / memory usage
[ ] Transparent shadows: can we terminate OptiX rays earlier when enough hits are found?
[ ] For final render, let Cycles draw the image instead of copying pixels to Blender
[ ] Render pass support
[ ] Adaptive determine integrator state size based on number of GPU cores and available memory
[ ] Tweak kernel compilation parameters (num threads per block, max registers)