Page MenuHome

Crash when rendering with CUDA / OptiX
Closed, DuplicatePublic

Description

System Information
Operating system: Windows-10-10.0.19041-SP0 64 Bits
Graphics card: GeForce RTX 3080/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 460.79

Blender Version
Broken: version: 2.92.0 Beta, branch: master, commit date: 2021-01-25 22:10, hash: rB77f73a928439
Worked: 2.91.2

Short description of error
Crashes while loading textures to GPU memory using CUDA.
After some digging in the code it seems to happen when the device memory is low and it decides to 'move_textures_to_host' then it ends up trying to lock 'cuda_mem_map_mutex' twice.
Example stack trace:

Attached .blend just has 16k and 8k generated black textures and highly subdivided cubes to fill GPU memory. Can't seem to get this to crash in the 2.91.2 release build.

Exact steps for others to reproduce the error
Select CUDA or OptiX as the device
Render a scene with large textures

Event Timeline

I've hit this deadlock bug last week as well: I believe you're right, it was introduced when the scoped locks were put in. When the GPU mem is full, the mechanism to free items from the device mem to re-allocate them on the host now suffers from a deadlock, since move_textures_to_host is called from within a mutex locked scope. The method then calls free/alloc which try to acquire the same lock (cuda_mem_map_mutex) and cause a deadlock.

I think the same bug is responsible for T84734.

I fixed the pb guarding the mem calls in CUDADevice::move_textures_to_host(size_t size, bool for_texture):

/*
 * This method is called from generic_alloc(), which itself is always
 * called in a scope with a thread_scoped_lock lock on cuda_mem_map_mutex.
 * By definition, a call to device_copy_to will free and allocate memory,
 * requiring a lock on cuda_mem_map_mutex, so we need to unlock
 * cuda_mem_map_mutex in order to prevent a systematic deadlock.
 */
cuda_mem_map_mutex.unlock();
max_mem->device_copy_to();
cuda_mem_map_mutex.lock();

And at the end of the method since load_texture_info also tries to acquire the same lock:

/* Update texture info array with new pointers.
 * See comment above regarding the mutex unlocking */
cuda_mem_map_mutex.unlock();
load_texture_info();
cuda_mem_map_mutex.lock();

Just saw your diff, great! (was just about to submit one, will take yours).

@James Horsley (mmdanggg2) @Olivier Maury (omaury) This is a duplicate of T84734. I will merge this ticket into that one and try to put the information you've gathered in there as well.

Thank you for investigating this. Please continue the discussion in the ticket this one has been merged into.