Page MenuHome

Fix MSVC: Race condition in the use of thread_local in c++11 code

Authored by LazyDodo (LazyDodo) on Thu, Aug 1, 4:04 AM.



Originally found in the USD branch, but the problem can occur in
any C++ 11 code that uses thread_local.

If the opengl driver spins off any threads in the dllmain function
there is the possibility that they start before the CRT on the main
thread has initialized the locks leading to a crash.

this change delayloads opengl32.dll sidestepping this issue.

This problem has been observed when using the softwaregl mesa/llvmpipe opengl32.dll

Diff Detail

rB Blender
tmp_thead_local_fix (branched from master)
Build Status
Buildable 4244
Build 4244: arc lint + arc unit

Event Timeline

Guess whatever is taking to make Blender to work? But this sounds weird. Any DLL which uses C++11's thread_local will cause issues in any application? This doesn't sound correct. Or maybe this CRT is just broken by design?

if that's the case, shall we load all DLLs in a delayed manner? To preemptively solve possible issues with other DLLs which might not be giving issues yet but might in the future?

This may have a performance impact, it means potentially every OpenGL function call needs to check if the OpenGL library has been loaded already. Doing this for every dll could introduce hidden performance issues.

How did you even find this? Is there more information about this somewhere?

This page seems to say that delay loading may cause issues when using __declspec(thread), which is the opposite problem.

"Also, the floating-point registers are not saved on any platform" also seems like a potential performance problem, though the conditions for that are not entirely clear to me.

LazyDodo (LazyDodo) added a comment.EditedThu, Aug 1, 4:36 PM

On reproduction : originally this was found on the USD branch, I can supply the debug libs for this so you can build/see it for your self.
It's as easy as make a build, drop in the softwaregl opengl32.dll and boom it goes.

I'm not entirely sure what the specs say on this or who is at fault here, but this is what appears to be happening (buckle up!)

thread_local variables needs to be allocated each time a thread gets created, the mechanism that exists for this is
called a TLS callback which will be called every time a thread gets created, a few of them are registered for the
blender process.

E:\builds\build_usd\bin\Debug>dumpbin /TLS blender.exe
Microsoft (R) COFF/PE Dumper Version 14.16.27032.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file blender.exe


  Section contains the following TLS directory:

    000000014D7AC000 Start of raw data
    000000014D7FDE73 End of raw data
    000000014D1A2DF0 Address of index
    000000014874F398 Address of callbacks
                   0 Size of zero fill
            00500000 Characteristics
                       16 byte align

    TLS Callbacks

          000000014011EB1D  @ILT+1170200(__dyn_tls_init)

The thing about TLS callbacks is any code in them *CAN* and *WILL* be run before
you will even hit the first line in main. Now what is happening in this case is
that mesa's opengl32 spins off a bunch of threads while the CRT is still initializing
on the main thread.

Main thread

[0x0]   blender!__acrt_initialize_multibyte   
[0x1]   blender!__acrt_execute_initializers + 0x6c   
[0x2]   blender!__acrt_initialize + 0x20   
[0x3]   blender!__scrt_initialize_crt + 0x30   
[0x4]   blender!__scrt_common_main_seh + 0xe   
[0x5]   blender!__scrt_common_main + 0xe   
[0x6]   blender!mainCRTStartup + 0x9   
[0x7]   KERNEL32!BaseThreadInitThunk + 0x14   
[0x8]   ntdll!RtlUserThreadStart + 0x21

Meanwhile on one of the Mesa OpenGL threads

[0x0]   ntdll!RtlpWaitOnCriticalSection + 0xa6   
[0x1]   ntdll!RtlpEnterCriticalSectionContended + 0x1a6   
[0x2]   ntdll!RtlEnterCriticalSection + 0x40    
[0x3]   blender!_Init_thread_lock + 0x11 <--- tries to lock blender!g_tss_mutex and fails 
[0x4]   blender!_Init_thread_header + 0xe   
[0x5]   blender!pxrInternal_v0_19__pxrReserved__::`anonymous namespace'::GetRegistry + 0x7a   
[0x6]   blender!pxrInternal_v0_19__pxrReserved__::`anonymous-namespace'::_Stack::{ctor} + 0x1f   
[0x7]   blender!pxrInternal_v0_19__pxrReserved__::`anonymous namespace'::`dynamic initializer for '_tlStack'' + 0x41   
[0x8]   blender!__dyn_tls_init + 0x6c   
[0x9]   ntdll!LdrpCallInitRoutine + 0x65   
[0xa]   ntdll!LdrpCallTlsInitializers + 0x87   
[0xb]   ntdll!LdrpInitializeThread + 0x1d4   
[0xc]   ntdll!_LdrpInitialize + 0x89   
[0xd]   ntdll!LdrpInitialize + 0x3b   
[0xe]   ntdll!LdrInitializeThunk + 0xe

Allright, now who is responsible for initializing g_tss_mutex? Well lets look at a regular build of blender
and see who initializes it

[0x0]   blender!__scrt_initialize_thread_safe_statics_platform_specific   
[0x1]   blender!__scrt_initialize_thread_safe_statics + 0x9   
[0x2]   blender!_initterm_e + 0x59   
[0x3]   blender!__scrt_common_main_seh + 0x68 
[0x4]   blender!__scrt_common_main + 0xe   
[0x5]   blender!mainCRTStartup + 0x9   
[0x6]   KERNEL32!BaseThreadInitThunk + 0x14   
[0x7]   ntdll!RtlUserThreadStart + 0x21

MS sadly ships no source for __scrt_common_main_seh but looking at the callstack it's safe to say we have a race on our hands

edit: source at c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\crt\src\vcruntime\exe_common.inl but it really doesn't clarify anything we didn't know already.

where the main thread of the USD branch is at when we crash :

[0x4]   blender!__scrt_common_main_seh + 0xe

where a properly running blender initalizes this lock

[0x3]   blender!__scrt_common_main_seh + 0x68

So that more or less explains what is happening, threads are trying to use a lock that the
CRT initializing on the main thread has not initialized yet.

now on solving it, I'm not entirely sure who is at fault here, and i fully admit delay loading
opengl32.dll is merely masking the actual issue..

soo...yeah... fun :)

managed to extract a minimal example showcasing the issue without the need for blender/USD.

Does it happen with both MSVC 2017 and 2019?

The internal variable names in 2017 are slightly different _Tss_mutex (2017) vs g_tss_mutex (2019), but the crash is essentially the same, i filed a ticket with ms since it doesn't really feel all that blender related, we'll see what they think about this.

Thanks for the detailed explanation. I can't really think of a way to work around it in another way.

If we need to do this just for OpenGL and maybe USD then I guess it's ok. Probably we don't do that many calls (now that we no longer use immediate mode) that it is so much of an issue. I wouldn't just do this for every library though. Still will be interesting to see what MS says about this.

This revision is now accepted and ready to land.Mon, Aug 12, 11:20 AM