Eevee: Shader recompilation issue
Open, NormalPublic

Description

The scene can be as simple as the initial scene with the initial cube. Set "use nodes" for the cube material and drag any of the material output node parameters. The slider dragging is not a smooth experience at all.

Details

Type
Bug

I did some investigation, and I get the following results comparing Eevee and master. Note that the basic Eevee shader has ~ 5,000 lines of code, and master has ~ 3,500. Also, the tests below are focusing only on the shader. In Blender itself we also spend a considerable time in the GWN_shaderinterface_create() function.

Results

In Linux with an AMD Radeon RX480 running Mesa (Gallium 0.4 - 4.10.0-24-generic) we get:

Version: 4.5
Core profile: 1

time start (control):  hello.cpp:115
time end   (control): 0.003271  hello.cpp:117
time start (master):  hello.cpp:101
time end   (master): 0.088870  hello.cpp:103
time start (eevee):  hello.cpp:108
time end   (eevee): 0.189856  hello.cpp:110

That means Eevee shader compilation is taken 2x as much as master shaders. And those ~200ms means there is a big lag every time the shader is recompiled (e.g., when the user drags a slider in a node).
Running from within Blender I get a similar result, so this sandbox seems well representative of the real production environment.

In a NVIDIA Quadro K6000, proprietary driver 375.39 I get:

Version: 3.3
Core profile: 1
time start (control):  hello.cpp:115
time end   (control): 0.032880  hello.cpp:117
time start (master):  hello.cpp:101
time end   (master): 0.063129  hello.cpp:103
time start (eevee):  hello.cpp:108
time end   (eevee): 0.320869  hello.cpp:110

Now that's more interesting. While the master shader compiles faster, the eevee one compiles considerably slower.

Note: All tests were ran with __GL_SHADER_DISK_CACHE=0, so to prevent cached shaders from being used.

To run it for yourself, check the code on: https://github.com/dfelinto/opengl-sandbox

You could try removing unused functions from that eevee.fp and see if it's still slow. But since linking is not affected by unused functions, it's not going to help reduce that. Maybe some particularly problematic code can be identified by elimination.

Looking at this Unreal tutorial, it takes 3s (!) to update when they edit one value in the shader graph. They made shader compilation non-blocking so it's not as bad, and ideally we should do the same in Blender. Maybe it's slow because of the little shader preview renders or something else, but still makes you wonder if it can actually be made as fast as we would like. If you want instant feedback in Unreal, it seems you need to create material instances where some parameters become uniforms.

So perhaps your idea of compiling all node parameters as uniforms is really the main way to get better performance while creating the shader graph. A second more optimized shader with constants could be compiled in the background if needed. Ideally we could avoid users having to think about material instances to get faster feedback when tweaking shaders, but if we do need them then they could be node groups with sockets that you can't link to.

I too think making uniforms for all parameters is the way to go, when editing or animating a material. Constant value nodes can be compiled for faster preview/playback when a material is not being edited. This is one area we can do better than the Unreal editor.

I think the performance issue in GWN_shaderinterface_create() should be fixed though, it's taking up 30% of compilation time here. It's not clear to me why Gawain caches all this data about uniforms and attributes, and the lookup by name has O(n²) behavior. If it's done for performance reasons then I don't think Gawain can do name -> location mappings much faster than glGetUniformLocation(). Caching of those locations needs to happen at a higher level to avoid that mapping entirely.

But since linking is not affected by unused functions, it's not going to help reduce that.

As it turned out, we do can get some benefit from doing it. I updated the github program with a "lean" version of the Eevee shader, with only the needed functions.

Of course there is overhead on including only the required functions. And those numbers are for the simplest shader. We may not get so many things to trim down in a more production-ready material.

In Linux with an AMD Radeon RX480 running Mesa I get:

EeveeEevee LeanMasterControl
glCompileShader65ms20 ms49ms3ms
glLinkProgram125ms83 ms43ms0ms
Total190 ms103 ms125ms4ms

In a NVIDIA Quadro K6000, proprietary driver 375.39 I get:

EeveeEevee LeanMasterControl
glCompileShader36ms10 ms30ms17ms
glLinkProgram287ms257 ms38ms18ms
Total323 ms267 ms67ms36ms
Dalai Felinto (dfelinto) closed this task as Resolved.Jul 17 2017, 11:33 AM

"Fixed" on 2a489273d7e2 by making uniforms out of all the nodetree inputs. I will close it for now, but I would still like to see the performance on GWN_shaderinterface_create addressed.

Ok so I tried to see what causes the huge linking time and the answer is really not pleasing.

This is caused by the number of branching in the shader. So things like rB518e7685790f28789bbe795f370ee3b1a5f776c6 which basically double every eevee's branches roughly double the linking time

Using the test program provided by @Dalai Felinto (dfelinto) I got theses results :
Current principled shader : 3011ms
Previous principled shader (without the branch) : 1781ms

Most of the shader branching complexity is inside the lamps evaluation. And the addition of the Cascaded shadow map lately increased this complexity.
To reduce the branching I can order lamps by type/shadow and iterate on each type one after the other like I do for the probes.

Reducing branching in lamps code could "hypotheticaly" get us to 1667ms. (I did a really short test bypassing all the "if"s and using the worst case).
But of course that will never be the case and I sense the end result will be more like 2000ms.
And removing the branching from the principle shader gets us to 1020ms. But this would remove the purpose of it's runtime optimisation.

Anyway this would still be higher than 1sec for each Material shader. So we REALLY need lazy shader compilation.

Here are the shader files if someone care to take a shot :

Confirmed the really low times linking times due to branching. I even added a new --debug-gpu-shaders that for now dumps the compiled shaders in the Blender temp session folder.