Page MenuHome

Gawain: Optimize shader uniform access

Authored by Sergey Sharybin (sergey) on May 31 2017, 7:13 PM.



Before this change Gawain was doing list lookup twice,
doing string comparison of every and each input which
is not efficient and not friendly for CPUs with small
cache size.

Now we store hash of input name together with actual
name and compare hashes first. Additionally, we do
everything in a single pass which is much better from
cache coherency point of view.

This brings Eevee cache population time from 80ms to
60ms on my desktop and from 800ms to 400ms for Clement
when navigating in a file from T50027.

(patch by @Sergey Sharybin (sergey))

Diff Detail

rB Blender

Event Timeline

Excellent work @Sergey Sharybin (sergey)!

The two old loops were to avoid string matching against builtin names (since those should really be looked up by enum).
The single new loop has the same probabilistic behavior of rejecting builtin uniforms when looking up a non-builtin name.

I'm ready to accept this after inline comments are addressed.


Does inline make a difference? I usually let the compiler do what it wants with static functions.


code formatting:

const char* name
while ((c = *str++))
   i = i * 37 + c;


while ((c = *str++))
   i = i * 37 + c;

Body of set_input_name is indented twice.


Keep this TODO comment, it's still a good idea.

Successful lookup by name requires the hash and the strings to match. Lookup by enum is always faster.

Updates for the review

Code style.

Removed inline just to match style of surrounding code.

Does inline make a difference? I usually let the compiler do
what it wants with static functions.

In this particular case i did not compare assembly. But in general
it surely makes a difference to give such hints to a compiler, and
even sometimes define inline as forceinline. When function becomes
bigger (both the one you marked as inlined and the one using it)
the less chance compiler will actually inline anything, which could
ruin your expectations of what compiler will, and what is the
expected performance. You can see that in all fine-tuning commits
to Cycles, where tweaking inline policy of triangle intersection
routines can bring CPU ticks from 13M to 6M per test case.

This revision was automatically updated to reflect the committed changes.