This implements Arvo's "Stratified sampling of spherical triangles". Similar to how we sample rectangular area lights, this is sampling triangles over their solid angle. It does significantly improve sampling close to the triangle, but doesn't do much for more distant triangles. So I added a simple heuristic to switch between the two methods. Unfortunately, I expect this to add render time in any case, even when it does not make any difference whatsoever. It'll take some benchmarking with various scenes and hardware to estimate how severe the impact is and if it is worth the change.
This is great, just like quad solid angle sampling the improvement will be especially noticeable inside volumes.
You can pass NULL to avoid computing itfm.
This might not be a great estimate for shading points near the middle of long thin triangles. The distance from the point to the plane would have some false positives, but still avoids most of the cost I expect.
distance_to_plane = abs(dot(N, A)) / len(N)
triangle area computation could be optimized since it's 0.5f * len(N), and we already computed len_squared(N).
Spaces after =.
I think this formula is wrong. There's two pdf's here that need to be multiplied together. One for sampling a point in the triangle:
pdf_triangle = t*t/(cos_pi * area_post)
And the other for picking a triangle in light_distribution_sample, which is the same as the solid angle case.
pdf_distribution = area_pre * kernel_data.integrator.pdf_triangles
triangle_light_pdf_area assumes area_pre and area_post cancel out, which they don't for motion blur. So I think the code here should just be this:
pdf *= area_pre / area_post;
We can immediately return here.
I think it would be simpler and faster to share the computation of ls->Ng and ls->shader with the solid angle case, moving that to the start of this function. ls->P can always be computed like the has_motion case.
A new update, taking Brecht's comments into account and a few other improvements.
I'm not sure if there is a perfect heuristic to switch between the two sampling strategies - in certain cases, I could make the line where the switch happens visible as a sudden change in noise - still, at an overall better quality than before the patch.
This one's not quite ready yet. I'm seeing odd artifacts when using this on CUDA hardware that doesn't show up when rendering on the CPU. Can't say yet what's causing it. This needs some investigation.
One more round of improvements. A few optimisations, and a change to the heuristic for switching between sampling strategies. Now it looks at the triangle's edge lengths instead of its area, that should hopefully help long and thin triangles.
About the cost of the improved sampling:
The BMW benchmark scene, which is lit entirely by a large mesh light, goes on my machine from 11m 6s to 11m 50s. Since those are large mesh lights, pretty much every pixel in the scene is using the more expensive sampling path. The benefits of the better sampling are however not visible, since most of the noise in that scene comes from glossy reflections, not the area light.
Most of the performance penalty appears to come from the trig functions. I did experiment with using SSE to vectorise the calls to normalise(), but that didn't do much other than make the code less readable. If anyone else has suggestions for improvements, I'm all ears.
I don't immediately have a good suggestion to optimize this. It's possible in principled to use SIMD for the cross product, normalize and fast_acos (which is likely the slowest part), but that's not so simple to implement. To me performance seems acceptable.