Page MenuHome

Manage GPU_matrix stacks per GPUContext
ClosedPublic

Authored by Julian Eisel (Severin) on Aug 2 2019, 6:05 PM.

Details

Summary

Previously, we had one global GPU_matrix stack, so the API was not thread safe. This patch makes the stack be per GPUContext, effectively making it local per thread.

A little gripe is that the API now calls into the active GPUContext, so it's not totally self-contained/independent. We could add _ex variations for all public GPU_matrix functions, taking the GPUContext as argument, to make this dependency explicit and controllable. We probably wouldn't use it for the time being though.

Needed for VR session drawing on a separate thread.

Diff Detail

Repository
rB Blender

Event Timeline

Julian Eisel (Severin) edited the summary of this revision. (Show Details)Aug 2 2019, 10:53 PM
Julian Eisel (Severin) edited the summary of this revision. (Show Details)
Julian Eisel (Severin) edited the summary of this revision. (Show Details)Aug 2 2019, 11:00 PM

I remember thread local storage being quite slow some years ago. Have you profiled this?

As far as I know this was mainly due to poor compiler implementations, that may have been solved. But it would be good to check since we don't use TLS this frequently in other code as far as I know.

I find it a bit hard to reason about this. There are some older posts describing why it is or was slow (e.g. this, or this). Newer posts seem more optimistic (e.g. this). But ultimately, this is of course heavily platform/implementation dependent.

I did some tests with latest MSVC comparing access to a variable 1.000.000.000 times (see P1066) and profiled it using Intel VTune. There's no difference in execution time between making the variable global or thread_local (C++11). Sometimes one is slighly faster, sometimes the other. Surprisingly, in the vast majority of cases thread_local was slightly faster even. Time is spent in the same places in both cases. I tested without optimizations (takes about 4.5 seconds) and with O2, so max optimizations (~1.3 seconds).
MSVC doesn't seem to add an extra function call (which is the main performance issue according to what I read), at least VTune doesn't show any.


I've also just thrown P1066 at Godbolt to compare assembly output of static vs. static thread_local.

MSVC x64 (19.22)
static:

double getNumber(void) PROC                              ; getNumber
        movsd   xmm0, QWORD PTR double number
        ret     0
double getNumber(void) ENDP

static thread_local:

double getNumber(void) PROC                              ; getNumber
        mov     eax, OFFSET FLAT:double number
        mov     eax, eax
        mov     ecx, DWORD PTR _tls_index
        mov     rdx, QWORD PTR gs:88
        mov     rcx, QWORD PTR [rdx+rcx*8]
        movsd   xmm0, QWORD PTR [rax+rcx]
        ret     0
double getNumber(void) ENDP                              ; getNumber

GCC 7.4

static:

getNumber():
        push    rbp
        mov     rbp, rsp
        movsd   xmm0, QWORD PTR number[rip]
        pop     rbp
        ret

static thread_local:

getNumber():
        push    rbp
        mov     rbp, rsp
        movsd   xmm0, QWORD PTR fs:number@tpoff
        pop     rbp
        ret

Newer versions yield the same result.

Clang 8.0
static:

getNumber():                         # @getNumber()
        push    rbp
        mov     rbp, rsp
        movsd   xmm0, qword ptr [_ZL6number] # xmm0 = mem[0],zero
        pop     rbp
        ret

static thread_local:

getNumber():                         # @getNumber()
        push    rbp
        mov     rbp, rsp
        call    TLS wrapper function for number
        movsd   xmm0, qword ptr [rax]   # xmm0 = mem[0],zero
        pop     rbp
        ret

This confirms new compilers solve this much better - Clang being the exception. Seems like its using some dedicated memory space accessed with a per thread offset applied to movsd in these implementations.

Just checked Clang ouput with optimizations enabled (O1 and higher), and it also solves it without the extra function call:

getNumber():                         # @getNumber()
        movsd   xmm0, qword ptr fs:[number@TPOFF] # xmm0 = mem[0],zero
        ret

I'm going to run some more tests, but all seems like accessing a global static thread_local has (virtually) no performance impact for optimized release builds.

Did a few more tests, that is, running Blender and letting it execute gpu_context_active_matrix_state_get 1.000.000 times. Either accessing the matrix state as a static or a static thread_local. Again compiled on latest MSVC. I ran three passes.

static GPUMatrixState *:

% CPU Time gpu_context_active_matrix_state_getTotal CPU Time gpu_context_active_matrix_state_getTotal CPU Time
0.2%0.012s60.8s
0.1%0.008s19.2s
0.1%0.017s16.8s

static thread_local GPUMatrixState *:

% CPU Time gpu_context_active_matrix_state_getTotal CPU Time gpu_context_active_matrix_state_getTotal CPU Time
0.2%0.034s30.6s
0.0%0.006s24.7s
0.1%0.021s18.7s

Those numbers are according to VTune.

This once again confirms there's no performance penalty.

While making this threadsafe is a good idea, I'm not sure you will benefit from running the VR renderer in another thread (at least with opengl).

[Off topic]
It's not that I expect a big performance gain from a separate drawing thread. But we can avoid overhead of the main loop and OpenGL context switches that way. Currently this gives me about +11 FPS in the classroom scene. The other reason is that OpenXR runtimes perform blocking calls for frame synchronization, also blocking the rest of Blender running on the main thread. I've outlined the reasons here: https://devtalk.blender.org/t/gsoc-2019-core-support-of-virtual-reality-headsets-through-openxr/7614/28.

Thanks for the performance tests, let's commit this then.

This revision is now accepted and ready to land.Aug 12 2019, 10:08 AM
This revision was automatically updated to reflect the committed changes.