Moved the kernels back into kernel/.
The intuition back then was to have them as a separate complation unit to allow for faster development iteration, but that's of course also possible when they're inside kernel/, so it was kind of pointless.
Much better, thanks!
I'm not going to have time to review these filter kernels in detail before next bcon date, but I guess that's fine and we can go ahead committing this anyway. The way they are integrated with the rest of the code seems ok to me, so I'll only leave some minor comments here.
I think there's potential here for deduplicating code, switching to float4 once that is unified with ssef. I feel like we can also better abstract away differences between CUDA/OpenCL/CPU still, not really specific to the denoising code though. But that's for another time I think.
This doesn't seem to be used.
This seems to be unused.
Can we rename this to ccl_restrict_ptr? I think this has a specific meaning and "readonly" doesn't quite cover it.
I'm curious if you measured any performance improvements from using restrict? Every time I try it, it doesn't seem to help, but maybe because we already try to write code to avoid reading from the same memory location multiple times.
I don't remember the exact figures, but at least back when I added ccl_readonly_ptr, it improved performance on CUDA significantly, at least by 20%.
The denoiser kernels are much more memory-bound compared to the path tracing kernels (to the point where moving the feature vector to shared memory doubles performance), so it makes sense that adding const restrict (which allows the CUDA compiler to use the L1 cache iirc) helps more for them.