Page MenuHome

Threading issue in sculpt mode
Closed, ResolvedPublic

Description

System Information
Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch 64 Bits
Graphics card: AMD Radeon RX 5700 (NAVI10, DRM 3.36.0, 5.6.13-arch1-1, LLVM 10.0.0) X.Org 4.6 (Core Profile) Mesa 20.0.7

Blender Version
Broken: version: 2.83 (sub 17) and an older version from end of march (817c38f7).
Worked: (newest version of Blender that worked as expected)

Short description of error
When starting to sculpt, ASAN detects a heap-use-after-free error.

Exact steps for others to reproduce the error

  1. Apply P1400 (you can also skip this at first, this patch only helps to narrow down on one particular issue).
  2. Compile debug build with ASAN enabled.
  3. Start Blender.
  4. In default scene, switch to Sculpting workspace.
  5. Enable dynamic topology.
  6. Start making strokes. It should crash after one or two strokes with a message like P1399.

In P1400 I did the following:

  • Disable threading for a couple of callbacks that would cause crashes as well.
  • Added BKE_pbvh_parallel_range2 that executes a given callback in parallel, but with the c++ threads library instead of TBB. The result is the same.
  • Disabled an assert that I ran into a couple of times, that does not seem to be harmful though.
  • Added a mutex to pbvh_update_draw_buffer_cb for testing.

The issue is fixed by uncommenting the mutex in pbvh_update_draw_buffer_cb such that it is only executed by a single thread at once. I tried reducing the scope of the mutex, but still got crashes, it just took a bit longer.

So far, the issue only happened with dynamic topology enabled, it can also be related to that. I still could not narrow the issue down further.


I originally investigated this in T76544, but decided to make this a more specific separate report.

Event Timeline

Jacques Lucke (JacquesLucke) changed the task status from Needs Triage to Needs Developer to Reproduce.EditedMon, May 18, 3:33 PM
Jacques Lucke (JacquesLucke) triaged this task as High priority.

@Brecht Van Lommel (brecht), @Pablo Dobarro (pablodp606) Can you check if you can reproduce this please. I can reproduce it reliably.

In release builds I did not get a crash from this yet. Maybe the memory leaks described in T76544 are a result of the threading issue.

I'm setting this to high priority, since this could be quite a bad bug that we should really get rid of before releasing 2.83.

I'm unable to reproduce the crash, with a debug build using asan. I can reproduce the member call on address 0x613000adefc0 which does not point to an object of type 'task' problem.

From looking at the backtrace, I guessed there is a GPU batch referencing a freed GPU vertex buffer. I found a case where that could happen and committed a fix.

But I don't think it's causing this specific issue, and looking at the backtrace more closely I'm not sure that's actually what is going on.

Some random things to try:

  • Set update_only_visible to false in BKE_pbvh_draw_cb always, and see if that stops the crash.
  • Try adding PBVH_RebuildDrawBuffers in BKE_pbvh_node_mark_update, BKE_pbvh_node_mark_update_mask or BKE_pbvh_node_mark_redraw, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue.
  • Does smooth or flat shading make a difference? If so GPU_pbvh_bmesh_buffers_update_free could be changed to free more buffers with flat shading like it does for smooth shading.

If I could repro this I could probably figure out what is going on. I'm not sure why I can't or how to make it work.

That TBB warning seems to be a known issue in TBB.
https://github.com/oneapi-src/oneTBB/issues/140

It's unclear if this is a bug in TBB or a false positive from address sanitizer, but I guess it's not related to this bug.

I'll try to reproduce it on another OS/computer tomorrow.

Trying this morning, I cannot get an ASAN report, but I can 'reliably' reproduce a crash (including in debugger) with following backtrace (which looks similar to the ASAN report from @Jacques Lucke (JacquesLucke)):

1  __GI_raise                                                                                                                                                                                                                                                                                raise.c             50   0x7f52bb8ef781 
2  __GI_abort                                                                                                                                                                                                                                                                                abort.c             79   0x7f52bb8d955b 
3  __assert_fail_base                                                                                                                                                                                                                                                                        assert.c            92   0x7f52bb8d942f 
4  __GI___assert_fail                                                                                                                                                                                                                                                                        assert.c            101  0x7f52bb8e8092 
5  GPU_vertbuf_data_alloc                                                                                                                                                                                                                                                                    gpu_vertex_buffer.c 123  0x1c90d235     
6  gpu_pbvh_vert_buf_data_set                                                                                                                                                                                                                                                                gpu_buffers.c       154  0x1c857fc2     
7  GPU_pbvh_bmesh_buffers_update                                                                                                                                                                                                                                                             gpu_buffers.c       925  0x1c866d8e     
8  pbvh_update_draw_buffer_cb                                                                                                                                                                                                                                                                pbvh.c              1320 0x3373fa8      
9  RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}::operator()() const                                                                                                                                                                                             task_range.cc       96   0x1d258882     
10 tbb::interface7::internal::delegated_function<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const, void>::operator()() const                                                                                                                                  task_arena.h        96   0x1d25d456     
11 tbb::interface7::internal::isolate_within_arena(tbb::interface7::internal::delegate_base&, long)                                                                                                                                                                                                                   0x7f52c1ca5615 
12 tbb::interface7::internal::isolate_impl<void, RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const>(RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const&)                                                                          task_arena.h        216  0x1d25a24b     
13 tbb::interface7::this_task_arena::isolate<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}>(tbb::interface7::internal::return_type_or_void const&)                                                                                                               task_arena.h        472  0x1d258da3     
14 RangeTask::operator()                                                                                                                                                                                                                                                                     task_range.cc       92   0x1d258a33     
15 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::run_body                                                                                                                                                                           parallel_for.h      115  0x1d262fe9     
16 tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>> partitioner.h       423  0x1d260d00     
17 tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>>                                                   partitioner.h       256  0x1d25ff17     
18 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::execute                                                                                                                                                                            parallel_for.h      142  0x1d25d7d3     
19 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1cabb35 
20 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1cabdfb 
21 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca5327 
22 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca3c30 
23 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca01fc 
24 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca0409 
25 start_thread                                                                                                                                                                                                                                                                              pthread_create.c    479  0x7f52c1c67f27 
26 clone                                                                                                                                                                                                                                                                                     clone.S             95   0x7f52bb9b131f

To do so, I have to generate some 'broken' geometry, typically (with dyntopo enabled):

  • Sculpt a first stroke at rather large, coarse scale (ie generating fairly large faces).
  • Sculpt a stroke at a very fine, detailed scale (by simply zooming as close as possible to the sculpted object).
  • Repeat those steps on the same region of the mesh, you should quickly get it crash.

Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch-Arch-Linux 64 Bits
Graphics card: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2) Intel Open Source Technology Center 4.5 (Core Profile) Mesa 20.0.7

I just freshly installed EndeavourOS on my laptop (so another device from my original report). I can reproduce the same issues on that device as well.

Some random things to try:

  • Set update_only_visible to false in BKE_pbvh_draw_cb always, and see if that stops the crash.
  • Try adding PBVH_RebuildDrawBuffers in BKE_pbvh_node_mark_update, BKE_pbvh_node_mark_update_mask or BKE_pbvh_node_mark_redraw, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue.
  • Does smooth or flat shading make a difference? If so GPU_pbvh_bmesh_buffers_update_free could be changed to free more buffers with flat shading like it does for smooth shading.

Unfortunately, none of that helped.

Here is a quick video showing the issue, just to make sure we are on the same page. Note that I get a different ASAN output here, because I did not deactivate threading for a couple of functions. Here are two other ASAN reports I get in a Blender 2.83 build: P1405.

When calling make, I get a couple of cmake warnings (P1406). I have no idea if those are related to the problem. It does not look like it does.

I did not get the error on windows yet.

This turned out to be a harmless assert and a problem only when using --debug-memory.

Thanks Brecht! That solved the issue indeed, and it looks like a couple of issues were solved in the process as well, so it was not all for nothing.

I was actually stepping through the allocator to find the issue, because I could not find it anywere else. There I saw mem_lock_thread and mem_unlock_thread and then wrongly assumed that it was thread safe... Wouldn't be better to make sure that our memory allocator is always thread safe? I find the idea of having a non-thread-safe main allocator a bit uncomfortable. This might become more of an issue if we decide to use tbb in more places directly, instead of using our own abstractions.