Page MenuHome

Cycles GPU Compute baking often fails
Closed, ResolvedPublic

Description

System Information

Operating system: Windows 10 Home 64
Graphics card: NVIDIA GeForce GTX 860M

Blender Version

Broken: 2.80 Beta (all)


Worked: 2.79b

Short description of error

When using CUDA and Cycles GPU Compute to bake textures, it usually fails halfway through the process resulting in half-baked textures with horizontal lines through them, like the one displayed below.

In each subsequent attempt, the bake fails immediately resulting in errors like below. It won't bake again unless I use CPU or restart Blender, but I can't recall it ever baking fully with GPU Compute even after restarting.

Exact steps for others to reproduce the error

  1. Select CUDA from Edit > Preferences > System > General > Cycles Compute Device
  2. Check any or all devices
  3. Use following settings:
  4. Bake

Event Timeline

I can't reproduce with 21523b5 under ubuntu 16.04 with a titan black. Can you upload a .blend file with the issue ?

Sure. Try this one. I also packed the resulting texture. If I try to bake it again, it fails and throws an error.

No error. Correct baking, here.
It reminds to me problems at first attempts of hybrid rendering.

Does it work if you only use Graphic Card as CUDA device ?

Just gave that a try. No luck. Actually, it failed immediately.

This appears in the taskbar:

Win 10 Pro, gtx 1080

First bake give me an image like the one OP demonstrated, with scan-line like effect, bad bake.

Second try give me nothing. And my viewport cycle straight out vanishes, give me only the GL layer where I can see the light objects and cameras, selection outline, etc.

It also throws an error: "CUDA error at cuCtxCreate: Illegal address", but then reports "Baking map saved to internal image, save it..."

dev369 (dev369) awarded a token.

Hell on not sure if this its the same issue but looks similar - baking doesnt work gtx 1070

here is the entire error log

https://pastebin.com/JLepKyff

Reproduced on a GTX 1070 in blender-2.80.0-git.445433a6913f-windows64 using the default cube scene and attempting to do a combined bake onto a 1024x1024 image with all default render settings.

https://pastebin.com/DgAp8hxg

Reproduced this issue in every comitted Blender file, latest blender-2.80.0-git.d3870471edd7-windows64, using my own Shaderstatue for Shader testing. All bakings fail before 10% got actually baked.

Message Bake stored internally please safe and trying to bake again results in the orange CUDA message " CUDA error: illegal adress".

GPU: 1080 GTX latest graphics driver
CPU: i5 4460
Windows 10 latest build{F6590069}

From my research this issue has been present for more then a year now. Its a pretty big issue, id say. So far devss kept complete radio silence on this issue, or i havent been able to find anywhere where it was anything said about it. Could devs say anything about it like, is it getting fixed at all and when? Why has it been neglected for so long, is this that hard to fix? Considering it has been around for such a long time.

So far we've not found a way to reproduce this yet, tested on multiple computers with different graphics cards.

It will get fixed, but it's one of 1000 other bugs in the tracker that we are working through.

Are there any ways to log those things in blender so u can get further details from users?

hi everyone , this bug has not been fixed right? is there a possibility in a close future? seems to work ok in ubuntu...is the only reason I need to reboot my pc... to bake a texture,(not trying to offend anyone I do really appreciate your hard work guys) thanks for all the great effort ! :)

Brecht Van Lommel (brecht) triaged this task as Confirmed, High priority.Mar 8 2019, 4:53 PM

Still can't confirm myself, but marking as high priority.

I tred it on 0ba143a1d675 - same thing, and it's also impossible to start baking again due to CUDA error, as I said in T60121 and it works only after restart of a Blender
BUT! New thing appeared - it looks different with different tile size. Looks like it can render only top half of each tile

Target resolution: 1024x1024

Tile 64:

Tile 256:

Tile 512:

Tile 1024 and upwards:

I was able to redo this on a GTX 960 now. The exact cause is not clear to me, but it seems that CUDA 10.1 works correctly. Possibly it's the same compiler bug(s) we've reported to NVIDIA before and that they fixed in 10.0 and 10.1.

artem ivanov (ixd) added a comment.EditedMar 15 2019, 9:24 PM

@Brecht Van Lommel (brecht) A bit too late, but I did some research on this (just about 2 hours before it got fixed lol) trying to compile bake kernel with various CUDA versions (9.1, 10.0, 10.1) and optimizations. Here are the results:

OS: Win 10, GPU: GTX 1070

Legend: OK = bakes well, ERROR = cuda errors while baking

CompilerAssemblerOptimizationFormatMSVC VersionResult
NVCC 9.1-O3cubin14.11OK
NVCC 9.1-Nonecubin14.11OK
NVCC 9.1-O3ptx + JIT14.11OK
NVCC 9.1-Noneptx + JIT14.11OK
NVCC 10.0-O3cubin14.11OK
NVCC 10.0-Nonecubin14.11OK
NVCC 10.0-O3ptx + JIT14.11OK
NVCC 10.0-Noneptx + JIT14.11OK
NVCC 10.0-O3cubin14.16OK
NVCC 10.0-Nonecubin14.16OK
NVCC 10.0-O3ptx + JIT14.16OK
NVCC 10.0-Noneptx + JIT14.16OK
NVCC 10.1-O3cubin14.11OK
NVCC 10.1-Nonecubin14.11OK
NVCC 10.1-O3ptx + JIT14.11OK
NVCC 10.1-Noneptx + JIT14.11OK
NVCC 10.1-O3cubin14.16OK
NVCC 10.1-Nonecubin14.16OK
NVCC 10.1-O3ptx + JIT14.16OK
NVCC 10.1-Noneptx + JIT14.16OK
NVCC 9.1ptxas 9.1O3cubin14.11OK
NVCC 9.1ptxas 9.1Nonecubin14.11OK
NVCC 9.1ptxas 10.0O3cubin14.11OK
NVCC 9.1ptxas 10.0Nonecubin14.11OK
NVCC 9.1ptxas 10.1O3cubin14.11OK
NVCC 9.1ptxas 10.1Nonecubin14.11OK
NVCC 10.0ptxas 10.0O3cubin14.11OK
NVCC 10.0ptxas 10.0Nonecubin14.11OK
NVCC 10.0ptxas 10.0O3cubin14.16OK
NVCC 10.0ptxas 10.0Nonecubin14.16OK
NVCC 10.0ptxas 10.1O3cubin14.11OK
NVCC 10.0ptxas 10.1Nonecubin14.11OK
NVCC 10.1ptxas 10.1O3cubin14.11OK
NVCC 10.1ptxas 10.1Nonecubin14.11OK
NVCC 10.1ptxas 10.1O3cubin14.16OK
NVCC 10.1ptxas 10.1Nonecubin14.16OK
NVRTC 9.1ptxas 9.1O3cubin14.11ERROR
NVRTC 9.1ptxas 9.1Nonecubin14.11OK
NVRTC 9.1-O3ptx + JIT14.11OK
NVRTC 9.1ptxas 10.0O3cubin14.11Weird pixels, but not crashed
NVRTC 9.1ptxas 10.0Nonecubin14.11OK
NVRTC 10.0ptxas 10.0O3cubin14.11Weird pixels, but not crashed
NVRTC 10.0ptxas 10.0Nonecubin14.11OK
NVRTC 10.0-O3ptx + JIT14.11OK
NVRTC 10.0-Noneptx + JIT14.11OK
NVRTC 9.1ptxas 10.1O3cubin14.11OK
NVRTC 9.1ptxas 10.1Nonecubin14.11OK
NVRTC 10.0ptxas 10.1O3cubin14.11OK
NVRTC 10.0ptxas 10.1Nonecubin14.11OK
NVRTC 10.1ptxas 10.1O3cubin14.11OK
NVRTC 10.1ptxas 10.1Nonecubin14.11OK

ptx + JIT means loading ptx assembly into blender (instead of .cubin) and letting nvidia driver compile it. (Requires no change to blender source code as driver can handle ptx as well as cubin)

"Weird dots" image from NVRTC 9.1 & ptxas 10.0:

And cycles_cubin_cc (which was used to compile cuda kernels shipped with buildbot) did NVRTC 9.1 & ptxas 9.1 with optimization. So it seems to me that nvrtc 9.1 and ptxas 9.1 and dont play nice together on some gpu's (but nvcc 9.1 & ptxas 9.1 works good). Latest NVIDIA driver jit compiles nvrtc 9.1 ptx kernel just fine.

Running cuda_memcheck blender.exe shows:

========= Invalid __global__ read of size 4
=========     at 0x000bab78 in <>/blender/intern/cycles/kernel/../kernel/kernel_path_state.h:250:kernel_cuda_bake
=========     by thread (10,0,0) in block (14,0,0)
=========     Address 0x3e80000000000022 is misaligned
=========     Device Frame:<>/blender/intern/cycles/kernel/../kernel/kernel_bake.h:121:kernel_cuda_bake (kernel_cuda_bake : 0x2cbd8)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuGetExportTable + 0x1b9c47) [0x1c82c5]
=========     Host Frame:<>\build_windows_Release_x64_vc15_Debug\bin\Debug\blender.exe (ccl::CUDADevice::shader + 0x8a3) [0x29a2c83]
=========     Host Frame:<>\build_windows_Release_x64_vc15_Debug\bin\Debug\blender.exe (ccl::CUDADevice::thread_run + 0x356) [0x29a5796]
...

And after further investigation it shows something like a stack corruption on a random thread is happening in

kernel_bake_evaluate ->
    compute_light_pass (create variable "PathState state;") ->
        kernel_path_indirect (passing "state" by pointer) <corruption here> -> ... -> path_state_ao_bounce (kernel_path_state.h:250) ->
                     "access to pointer to stack allocated (and not shared) var "state", pointer points to garbage"

And all subsequent calls to CUDA api return an error. (cuCtxSynchronize, etc)
It seems to me it's an optimization codegen bug with nvrtc 9.1 (and possible in 10.0).

CUDA 10.1 resolves these issues (and we can use the latest msvc with it!).

Tested on this file (default cube + texture):

@artem ivanov (ixd), interesting tests. There is indeed an optimization codegen bug that we reported and that was fixed in 10.1.