Cycles: Add multi-scattering, energy-conserving GGX as an option to the Glossy, Anisotropic and Glass BSDFs

Authored by Lukas Stockner (lukasstockner97) on May 18 2016, 1:47 AM.



This patch adds a new distribution to the Glossy, Anisotropic and Glass BSDFs that implements the
multiple-scattering microfacet model described in "Multiple-Scattering Microfacet BSDFs with the Smith Model".

Essentially, the improvement is that unlike classical GGX, which only models single scattering and sets
the contribution of multiple bounces to zero, this new model performs a random walk on the microsurface until
the ray leaves it again, which ensures perfect energy conservation.
In practise, this means that the "darkening problem" - GGX materials becoming darker with increasing
roughness - is solved in a physically correct and efficient way.

The downside of this model is that it has no (known) analytic expression for evalation. However, it can be
evaluated stochastically, and although the correct PDF isn't known either, the properties of MIS and the
balance heuristic guarantee an unbiased result at the cost of slightly higher noise.

OSL is not supported yet and there are a few ToDos, but I guess it's good enough for a first round of review.
Metallic BSDF coming up in another diff.

Diff Detail

rB Blender
Lukas Stockner (lukasstockner97) retitled this revision from to Cycles: Add multi-scattering, energy-conserving GGX as an option to the Glossy, Anisotropic and Glass BSDFs.May 18 2016, 1:47 AM

Nice work. Some minor notes inline.


The Multiscatter is not shown in the UI for Refraction, so why have this at all?


Is it needed to explicitly pass Color?


I don't really remember why I added it at all to be honest, I guess it can be removed.


Yes, since there might be multiple bounces on the microsurface, and the throughput has to be multiplied by the color at each bounce - an interesting side effect of that is that rougher materials get more saturated colors.

Brecht Van Lommel (brecht) requested changes to this revision.May 18 2016, 11:59 PM

Generally seems to fit in quite nicely, haven't tried building the patch yet or looked closely at the (quite daunting :) math.

But I guess this is derived from the reference implementation and if it passes the white furnace test of being perfectly energy conserving, then I'm not too worried about correctness issues.


I'm not sure if lgammaf is available in OpenCL, but we should probably try to compiling this code before committing.

And also see what kind of performance impact it has on CUDA / OpenCL kernels. I can do that too this weekend if you want.


Can be simplified to:

wm.z *= wm.z;
float x = (1.0f - wm.z) + alpha * alpha * wm.z;
return alpha / (M_PI_F * x * x);

if(!isfinite(slope_x)) is simpler.


I guess we could pass the full PathState and get stratified sampling, though not sure how much it will help. The other way to do it would be to pass in an lcg_state computed from the PathState similar to what we do for hair, and then generate random numbers from that.

I guess that means passing path state into bsdf_sample, which isn't too bad I think.


For the GPU it is better to have a single function call to mf_eval_glossy, to avoid divergence and reduce code size when inlining. There's a few cases like this in the code, probably best to change them all.


Function overloading doesn't work in OpenCL, so will need a different name than the float version.

This revision now requires changes to proceed.May 18 2016, 11:59 PM

Yes, the code is generally based on the reference implementation, but with a lot of modifications to optimize calculations, get rid of unused values and generally get all of the phase functions into one sample/eval loop.
I'll also try to match the code to the equations from the paper as good as possible and note it in comments, that might help later (since some of the calculations in here don't really resemble the original equations/code at all anymore).


It seems that OpenCL does have the function, but it's called lgamma there - so, I guess the best option is to define a wrapper called lgammaf for OpenCL.
Regarding performance impact: You mean the general slowdown when adding the patch but not using the BSDFs? Hm, I haven't tested that yet...


Ah, yes, of course - thanks for the reminder :)


I'm actually not sure whether that test is needed at all in release builds, a simple kernel_assert(isfinite(slope_x)) might be enough - I'll run some renders in a debug build to check whether the assert is ever hit.


Yes, makes sense, I'll change it.


Okay, I guess saturate3 would be fine?

Okay, most issues should be fixed, except for the additional comments and the RNG - I'll probably do these tomorrow.
It turns out that the explicit handling of MULTI_GGX in blender_shader.cpp is needed because GCC complains about unhandled
cases otherwise.

Now the RNG handling is cleaned up as well - the ShaderData now contains "uint lcg_state", which
is initialized when a closure needs it (determined via a closure flag) and is used by the sampling
and eval code.

Brecht Van Lommel (brecht) requested changes to this revision.Jun 11 2016, 4:22 PM

Seems generally fine to me, I'll run some GPU performance benchmarks.


The previous code inverted eta, where does that happen now?


We don't need this for the background.


We don't need this for displacement.

This revision now requires changes to proceed.Jun 11 2016, 4:22 PM

Can be simplified to:

float slope_x = wm.x/alpha.x;
float slope_y = wm.y/alpha.y;
float tmp =  wm.z*wm.z + (slope.x*slope.x + slope_y*slope_y);
return 1.0f / (M_PI_F * tmp*tmp * alpha.x*alpha.y);

More optimizatons are possible here and in other parts of the code, but probably best to not get distracted with that now, can always be done later.


Computing float inv_eta = 1.0f/eta; once and then multiplying would be a little faster.


This increases GPU stack memory usage , because float3 is actually a float4 and needs to be stored 16 bytes aligned.

For now maybe best to keep custom1, custom2, custom3 and live with the extra code until we properly refactor this code so closures can store arbitrary amounts of data.


Should this be data_node.w or is the assert wrong?

Lukas Stockner (lukasstockner97) added inline comments.

Oh, right, I totally missed that one. Thanks!


Well, my original intention was to have it everywhere in case we use the LCG elsewhere in the future, but for now I guess having it in shader_eval_surface is indeed enough.

Lukas Stockner (lukasstockner97) marked 2 inline comments as done.

Addressed review, rebased to current master.

Nice work. Here a side by side comparison, rendered with latest patch.

I ran some benchmarks and it seems this patch is slowing down GPU rendering even if this BSDF is not used. For example with my GTX 960 and BMW scene it's about 3% slower and it's similar for other scenes.

The reason I guess is increased register spilling. As I understand it the compiler does register spilling taking into account the BSDF that uses the most registers. So if we add a new more expensive BSDF that affects registers spilling for all bsdf_eval and bsdf_sample calls, most of which will not end up actually calling multi scatter GGX.

I tried uninlining the multi scatter GGX functions but that makes it even slower. Presumably because then the compiler knows less about what happens in the unlinined function and must spill even more registers or assume some register need to be reloaded from local memory because it doesn't know if the data changed.

I don't know of compiler flags or attributes to control this directly. We can try to tweak the code to reduce register usage, and if that isn't enough somehow force it to use local memory over register, potentially slowing it down a bit but speeding up other BSDFs. That's the theory anyway but so far I didn't success trying to tweak this code. Maybe someone else has ideas for how to solve this, or we just have to accept it as we have before with other features.

A split kernel might eventually help with this, but it may well have the same issue as we wouldn't split this BSDF into a separate kernel from the other BSDFs.

I would accept that little slowdown. With GPUs it's always some up and down. We made some significant memory improvements since 2.77, so some little downsides are ok now imho.

3% slowdown is perfectly acceptable imo

Many thanks @Thomas Dinges (dingto)
I'll run some tests in coming days, just played a bit and it seems Refraction BSDF still compute standard GGX instead of Multiscatter (same exact results with the two different models), not sure if it's intentional at the moment.

This won't work with the Refraction BSDF. For multiple scattering (and correct handling of roughness) reflection, refraction and Fresnel must all be handled inside a single BSDF.

(In fact we should probably remove the Refraction BSDF for Blender 2.8.)

Brecht, just wanting to ask.

If the Refraction BSDF is removed in Blender 2.8, then how will we be able to have glass materials with a fully custom fresnal component?

I for one find it useful in varying cases because in some instances, the regular glass material is too reflective at grazing angles and it provides no way to change that. Also, sometimes it's desired for a material to have a simple refractive component without the reflection (which can't be done with the glass node).

My opinion, either leave the refraction node as is or add the ability to enhance or reduce (or even remove) the reflection component in the glass node.

We can discuss that when we actually do that change, I don't want to go off topic here. Mainly the point I wanted to make is that the Refraction BSDF is conceptually broken, and trying to fit multiple scattering into it now is not so useful if bigger changes are needed.

it's a great plus. If one node can not profit from it, it's still a big plus. Commit :)

I think the slowdown is acceptable. I found a few tweaks that reduce it to maybe 2-2.5%:

1diff --git a/intern/cycles/kernel/closure/bsdf_microfacet_multi_impl.h b/intern/cycles/kernel/closure/bsdf_microfacet_multi_impl.h
2index ba03b31..eb15547 100644
3--- a/intern/cycles/kernel/closure/bsdf_microfacet_multi_impl.h
4+++ b/intern/cycles/kernel/closure/bsdf_microfacet_multi_impl.h
5@@ -59,26 +59,20 @@ ccl_device float3 MF_FUNCTION_FULL_NAME(mf_eval)(float3 wi, float3 wo, const boo
7​ const float2 alpha = make_float2(alpha_x, alpha_y);
9- float3 wr = -wi;
10- float lambda_r = mf_lambda(wr, alpha);
11- float hr = 1.0f;
12- float C1_r = 1.0f;
13- float G1_r = 0.0f;
14- bool outside = true;
15+ float lambda_r = mf_lambda(-wi, alpha);
16​ float shadowing_lambda = mf_lambda(wo_outside? wo: -wo, alpha);
18​ /* Analytically compute single scattering for lower noise. */
19- float3 singleScatter;
20​ /* TODO(lukas): Multiply with color here? */
21​ #ifdef MF_MULTI_GLASS
22- singleScatter = mf_eval_phase_glass(wr, lambda_r, wo, wo_outside, alpha, eta);
23+ float3 eval = mf_eval_phase_glass(-wi, lambda_r, wo, wo_outside, alpha, eta);
24​ if(wo_outside)
25- singleScatter *= -lambda_r / (shadowing_lambda - lambda_r);
26+ eval *= -lambda_r / (shadowing_lambda - lambda_r);
27​ else
28- singleScatter *= -lambda_r * beta(-lambda_r, shadowing_lambda+1.0f);
29+ eval *= -lambda_r * beta(-lambda_r, shadowing_lambda+1.0f);
30​ #elif defined(MF_MULTI_DIFFUSE)
31​ /* Diffuse has no special closed form for the single scattering bounce */
32- singleScatter = make_float3(0.0f, 0.0f, 0.0f);
33+ float3 eval = make_float3(0.0f, 0.0f, 0.0f);
34​ #else /* MF_MULTI_GLOSSY */
35​ const float3 wh = normalize(wi+wo);
36​ const float G2 = 1.0f / (1.0f - (lambda_r + 1.0f) + shadowing_lambda);
37@@ -87,16 +81,30 @@ ccl_device float3 MF_FUNCTION_FULL_NAME(mf_eval)(float3 wi, float3 wo, const boo
38​ val *= D_ggx(wh, alpha.x);
39​ else
40​ val *= D_ggx_aniso(wh, alpha);
42+ float3 eval;
43​ if(n && k) {
44- singleScatter = fresnel_conductor(dot(wh, wi), *n, *k) * val;
45+ eval = fresnel_conductor(dot(wh, wi), *n, *k) * val;
46​ }
47​ else {
48- singleScatter = make_float3(val, val, val);
49+ eval = make_float3(val, val, val);
50​ }
51​ #endif
53- float3 multiScatter = make_float3(0.0f, 0.0f, 0.0f);
54​ float3 throughput = make_float3(1.0f, 1.0f, 1.0f);
56+ if(swapped) {
57+ const float fac = fabsf(wi.z / wo.z);
58+ throughput = make_float3(fac, fac, fac);
59+ eval *= fac;
60+ }
62+ float3 wr = -wi;
63+ float hr = 1.0f;
64+ float C1_r = 1.0f;
65+ float G1_r = 0.0f;
66+ bool outside = true;
68​ for(int order = 0; order < 10; order++) {
69​ /* Sample microfacet height and normal */
70​ if(!mf_sample_height(wr, hr, C1_r, G1_r, lambda_r, lcg_step_float(lcg_state)))
71@@ -107,23 +115,20 @@ ccl_device float3 MF_FUNCTION_FULL_NAME(mf_eval)(float3 wi, float3 wo, const boo
72​ if(order == 0) {
73​ /* Compute single-scattering for diffuse. */
74​ const float G2_G1 = -lambda_r / (shadowing_lambda - lambda_r);
75- multiScatter += throughput * G2_G1 * mf_eval_phase_diffuse(wo, wm);
76+ eval += throughput * G2_G1 * mf_eval_phase_diffuse(wo, wm);
77​ }
78​ #endif
79​ if(order > 0) {
80​ /* Evaluate amount of scattering towards wo on this microfacet. */
81​ float3 phase;
82​ #ifdef MF_MULTI_GLASS
83- if(outside)
84- phase = mf_eval_phase_glass(wr, lambda_r, wo, wo_outside, alpha, eta);
85- else
86- phase = mf_eval_phase_glass(wr, lambda_r, -wo, !wo_outside, alpha, 1.0f/eta);
87+ phase = mf_eval_phase_glass(wr, lambda_r, (outside)? wo: -wo, (outside)? wo_outside: !wo_outside, alpha, (outside)? eta: 1.0f/eta);
88​ #elif defined(MF_MULTI_DIFFUSE)
89​ phase = mf_eval_phase_diffuse(wo, wm);
90​ #else /* MF_MULTI_GLOSSY */
91​ phase = mf_eval_phase_glossy(wr, lambda_r, wo, alpha, n, k) * throughput;
92​ #endif
93- multiScatter += throughput * phase * mf_G1(wo_outside? wo: -wo, mf_C1((outside == wo_outside)? hr: -hr), shadowing_lambda);
94+ eval += throughput * phase * mf_G1(wo_outside? wo: -wo, mf_C1((outside == wo_outside)? hr: -hr), shadowing_lambda);
95​ }
96​ if(order+1 < 10) {
97​ /* Bounce from the microfacet. */
98@@ -150,9 +155,6 @@ ccl_device float3 MF_FUNCTION_FULL_NAME(mf_eval)(float3 wi, float3 wo, const boo
99​ }
100​ }
102- float3 eval = singleScatter + multiScatter;
103- if(swapped)
104- eval *= fabsf(wi.z / wo.z);
105​ return eval;
106​ }


float3 &weight won't work with OpenCL C.


float &h won't work with OpenCL C.


float3 &wo won't work with OpenCL C.

Replaced references with pointers, moved variables around to reduce stack usage. Thanks for the hint, especially the singlescatter+multiscatter was totally useless :)

Forgot to remove the Multiscattering GGX option from the refraction node, sorry for the spam.

No more comments from me. There's still some To Do's in the code which you might plan to solve? I can't see bugs in the code those comments refer to, but I'm also not sure what the comments mean exactly.

This revision is now accepted and ready to land.Jun 21 2016, 8:54 PM

I actually forgot that OSL was still missing, so this version now comes with OSL support.
Unfortunately, I needed to implement custom classes for the Closures instead of just using the macros
because of the color parameter - the float3 coming from OSL has to be split up into the three custom variables.
If anyone has a suggestion how to do this with less code, please tell me :)

Also, I fixed a few remaining numerical problems in the code, checked the ToDos (none of them were relevant),
got rid of a reciprocal and noticed that the anisotropic PDF wasn't actually used, so I fixed that as well.

Tested latest patch, works fine for me (also with OSL). LGTM.

This revision was automatically updated to reflect the committed changes.