Threading issue in sculpt mode #76858

Closed
opened 2020-05-18 15:31:08 +02:00 by Jacques Lucke · 19 comments
Member

System Information
Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch 64 Bits
Graphics card: AMD Radeon RX 5700 (NAVI10, DRM 3.36.0, 5.6.13-arch1-1, LLVM 10.0.0) X.Org 4.6 (Core Profile) Mesa 20.0.7

Blender Version
Broken: version: 2.83 (sub 17) and an older version from end of march (817c38f7).
Worked: (newest version of Blender that worked as expected)

Short description of error
When starting to sculpt, ASAN detects a heap-use-after-free error.

Exact steps for others to reproduce the error

  1. Apply P1400 (you can also skip this at first, this patch only helps to narrow down on one particular issue).
  2. Compile debug build with ASAN enabled.
  3. Start Blender.
  4. In default scene, switch to Sculpting workspace.
  5. Enable dynamic topology.
  6. Start making strokes. It should crash after one or two strokes with a message like P1399.

In P1400 I did the following:

  • Disable threading for a couple of callbacks that would cause crashes as well.
  • Added BKE_pbvh_parallel_range2 that executes a given callback in parallel, but with the c++ threads library instead of TBB. The result is the same.
  • Disabled an assert that I ran into a couple of times, that does not seem to be harmful though.
  • Added a mutex to pbvh_update_draw_buffer_cb for testing.

The issue is fixed by uncommenting the mutex in pbvh_update_draw_buffer_cb such that it is only executed by a single thread at once. I tried reducing the scope of the mutex, but still got crashes, it just took a bit longer.

So far, the issue only happened with dynamic topology enabled, it can also be related to that. I still could not narrow the issue down further.


I originally investigated this in #76544, but decided to make this a more specific separate report.

**System Information** Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch 64 Bits Graphics card: AMD Radeon RX 5700 (NAVI10, DRM 3.36.0, 5.6.13-arch1-1, LLVM 10.0.0) X.Org 4.6 (Core Profile) Mesa 20.0.7 **Blender Version** Broken: version: 2.83 (sub 17) and an older version from end of march (`817c38f7`). Worked: (newest version of Blender that worked as expected) **Short description of error** When starting to sculpt, ASAN detects a heap-use-after-free error. **Exact steps for others to reproduce the error** 1. Apply [P1400](https://archive.blender.org/developer/P1400.txt) (you can also skip this at first, this patch only helps to narrow down on one particular issue). 2. Compile debug build with ASAN enabled. 3. Start Blender. 4. In default scene, switch to Sculpting workspace. 5. Enable dynamic topology. 6. Start making strokes. It should crash after one or two strokes with a message like [P1399](https://archive.blender.org/developer/P1399.txt). In [P1400](https://archive.blender.org/developer/P1400.txt) I did the following: * Disable threading for a couple of callbacks that would cause crashes as well. * Added `BKE_pbvh_parallel_range2` that executes a given callback in parallel, but with the c++ threads library instead of TBB. The result is the same. * Disabled an assert that I ran into a couple of times, that does not seem to be harmful though. * Added a mutex to `pbvh_update_draw_buffer_cb` for testing. The issue is fixed by uncommenting the mutex in `pbvh_update_draw_buffer_cb` such that it is only executed by a single thread at once. I tried reducing the scope of the mutex, but still got crashes, it just took a bit longer. So far, the issue only happened with dynamic topology enabled, it can also be related to that. I still could not narrow the issue down further. ---- I originally investigated this in #76544, but decided to make this a more specific separate report.
Author
Member

Added subscriber: @JacquesLucke

Added subscriber: @JacquesLucke
Author
Member

Changed status from 'Needs Triage' to: 'Needs Developer To Reproduce'

Changed status from 'Needs Triage' to: 'Needs Developer To Reproduce'
Author
Member

Added subscribers: @PabloDobarro, @brecht

Added subscribers: @PabloDobarro, @brecht
Author
Member

@brecht, @PabloDobarro Can you check if you can reproduce this please. I can reproduce it reliably.

In release builds I did not get a crash from this yet. Maybe the memory leaks described in #76544 are a result of the threading issue.

I'm setting this to high priority, since this could be quite a bad bug that we should really get rid of before releasing 2.83.

@brecht, @PabloDobarro Can you check if you can reproduce this please. I can reproduce it reliably. In release builds I did not get a crash from this yet. Maybe the memory leaks described in #76544 are a result of the threading issue. I'm setting this to high priority, since this could be quite a bad bug that we should really get rid of before releasing 2.83.

This issue was referenced by e0ae229acb

This issue was referenced by e0ae229acb2aa63ffba05dc53045577dbb876acf

I'm unable to reproduce the crash, with a debug build using asan. I can reproduce the member call on address 0x613000adefc0 which does not point to an object of type 'task' problem.

From looking at the backtrace, I guessed there is a GPU batch referencing a freed GPU vertex buffer. I found a case where that could happen and committed a fix.

But I don't think it's causing this specific issue, and looking at the backtrace more closely I'm not sure that's actually what is going on.

Some random things to try:

  • Set update_only_visible to false in BKE_pbvh_draw_cb always, and see if that stops the crash.
  • Try adding PBVH_RebuildDrawBuffers in BKE_pbvh_node_mark_update, BKE_pbvh_node_mark_update_mask or BKE_pbvh_node_mark_redraw, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue.
  • Does smooth or flat shading make a difference? If so GPU_pbvh_bmesh_buffers_update_free could be changed to free more buffers with flat shading like it does for smooth shading.

If I could repro this I could probably figure out what is going on. I'm not sure why I can't or how to make it work.

I'm unable to reproduce the crash, with a debug build using asan. I can reproduce the `member call on address 0x613000adefc0 which does not point to an object of type 'task'` problem. From looking at the backtrace, I guessed there is a GPU batch referencing a freed GPU vertex buffer. I found a case where that could happen and committed a fix. But I don't think it's causing this specific issue, and looking at the backtrace more closely I'm not sure that's actually what is going on. Some random things to try: * Set `update_only_visible` to `false` in `BKE_pbvh_draw_cb` always, and see if that stops the crash. * Try adding `PBVH_RebuildDrawBuffers` in `BKE_pbvh_node_mark_update`, `BKE_pbvh_node_mark_update_mask` or `BKE_pbvh_node_mark_redraw`, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue. * Does smooth or flat shading make a difference? If so `GPU_pbvh_bmesh_buffers_update_free` could be changed to free more buffers with flat shading like it does for smooth shading. If I could repro this I could probably figure out what is going on. I'm not sure why I can't or how to make it work.

That TBB warning seems to be a known issue in TBB.
https://github.com/oneapi-src/oneTBB/issues/140

It's unclear if this is a bug in TBB or a false positive from address sanitizer, but I guess it's not related to this bug.

That TBB warning seems to be a known issue in TBB. https://github.com/oneapi-src/oneTBB/issues/140 It's unclear if this is a bug in TBB or a false positive from address sanitizer, but I guess it's not related to this bug.
Author
Member

I'll try to reproduce it on another OS/computer tomorrow.

I'll try to reproduce it on another OS/computer tomorrow.

Added subscriber: @mont29

Added subscriber: @mont29

Trying this morning, I cannot get an ASAN report, but I can 'reliably' reproduce a crash (including in debugger) with following backtrace (which looks similar to the ASAN report from @JacquesLucke):

1  __GI_raise                                                                                                                                                                                                                                                                                raise.c             50   0x7f52bb8ef781 
2  __GI_abort                                                                                                                                                                                                                                                                                abort.c             79   0x7f52bb8d955b 
3  __assert_fail_base                                                                                                                                                                                                                                                                        assert.c            92   0x7f52bb8d942f 
4  __GI___assert_fail                                                                                                                                                                                                                                                                        assert.c            101  0x7f52bb8e8092 
5  GPU_vertbuf_data_alloc                                                                                                                                                                                                                                                                    gpu_vertex_buffer.c 123  0x1c90d235     
6  gpu_pbvh_vert_buf_data_set                                                                                                                                                                                                                                                                gpu_buffers.c       154  0x1c857fc2     
7  GPU_pbvh_bmesh_buffers_update                                                                                                                                                                                                                                                             gpu_buffers.c       925  0x1c866d8e     
8  pbvh_update_draw_buffer_cb                                                                                                                                                                                                                                                                pbvh.c              1320 0x3373fa8      
9  RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}::operator()() const                                                                                                                                                                                             task_range.cc       96   0x1d258882     
10 tbb::interface7::internal::delegated_function<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const, void>::operator()() const                                                                                                                                  task_arena.h        96   0x1d25d456     
11 tbb::interface7::internal::isolate_within_arena(tbb::interface7::internal::delegate_base&, long)                                                                                                                                                                                                                   0x7f52c1ca5615 
12 tbb::interface7::internal::isolate_impl<void, RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const>(RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const&)                                                                          task_arena.h        216  0x1d25a24b     
13 tbb::interface7::this_task_arena::isolate<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}>(tbb::interface7::internal::return_type_or_void const&)                                                                                                               task_arena.h        472  0x1d258da3     
14 RangeTask::operator()                                                                                                                                                                                                                                                                     task_range.cc       92   0x1d258a33     
15 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::run_body                                                                                                                                                                           parallel_for.h      115  0x1d262fe9     
16 tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>> partitioner.h       423  0x1d260d00     
17 tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>>                                                   partitioner.h       256  0x1d25ff17     
18 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::execute                                                                                                                                                                            parallel_for.h      142  0x1d25d7d3     
19 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1cabb35 
20 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1cabdfb 
21 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca5327 
22 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca3c30 
23 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca01fc 
24 ??                                                                                                                                                                                                                                                                                                                 0x7f52c1ca0409 
25 start_thread                                                                                                                                                                                                                                                                              pthread_create.c    479  0x7f52c1c67f27 
26 clone                                                                                                                                                                                                                                                                                     clone.S             95   0x7f52bb9b131f 

To do so, I have to generate some 'broken' geometry, typically (with dyntopo enabled):

  • Sculpt a first stroke at rather large, coarse scale (ie generating fairly large faces).
  • Sculpt a stroke at a very fine, detailed scale (by simply zooming as close as possible to the sculpted object).
  • Repeat those steps on the same region of the mesh, you should quickly get it crash.
Trying this morning, I cannot get an ASAN report, but I can 'reliably' reproduce a crash (including in debugger) with following backtrace (which looks similar to the ASAN report from @JacquesLucke): ```lines=10 1 __GI_raise raise.c 50 0x7f52bb8ef781 2 __GI_abort abort.c 79 0x7f52bb8d955b 3 __assert_fail_base assert.c 92 0x7f52bb8d942f 4 __GI___assert_fail assert.c 101 0x7f52bb8e8092 5 GPU_vertbuf_data_alloc gpu_vertex_buffer.c 123 0x1c90d235 6 gpu_pbvh_vert_buf_data_set gpu_buffers.c 154 0x1c857fc2 7 GPU_pbvh_bmesh_buffers_update gpu_buffers.c 925 0x1c866d8e 8 pbvh_update_draw_buffer_cb pbvh.c 1320 0x3373fa8 9 RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}::operator()() const task_range.cc 96 0x1d258882 10 tbb::interface7::internal::delegated_function<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const, void>::operator()() const task_arena.h 96 0x1d25d456 11 tbb::interface7::internal::isolate_within_arena(tbb::interface7::internal::delegate_base&, long) 0x7f52c1ca5615 12 tbb::interface7::internal::isolate_impl<void, RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const>(RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1} const&) task_arena.h 216 0x1d25a24b 13 tbb::interface7::this_task_arena::isolate<RangeTask::operator()(tbb::blocked_range<int> const&) const::{lambda()#1}>(tbb::interface7::internal::return_type_or_void const&) task_arena.h 472 0x1d258da3 14 RangeTask::operator() task_range.cc 92 0x1d258a33 15 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::run_body parallel_for.h 115 0x1d262fe9 16 tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>> partitioner.h 423 0x1d260d00 17 tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute<tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>, tbb::blocked_range<int>> partitioner.h 256 0x1d25ff17 18 tbb::interface9::internal::start_for<tbb::blocked_range<int>, RangeTask, tbb::auto_partitioner const>::execute parallel_for.h 142 0x1d25d7d3 19 ?? 0x7f52c1cabb35 20 ?? 0x7f52c1cabdfb 21 ?? 0x7f52c1ca5327 22 ?? 0x7f52c1ca3c30 23 ?? 0x7f52c1ca01fc 24 ?? 0x7f52c1ca0409 25 start_thread pthread_create.c 479 0x7f52c1c67f27 26 clone clone.S 95 0x7f52bb9b131f ``` To do so, I have to generate some 'broken' geometry, typically (with dyntopo enabled): * Sculpt a first stroke at rather large, coarse scale (ie generating fairly large faces). * Sculpt a stroke at a very fine, detailed scale (by simply zooming as close as possible to the sculpted object). * Repeat those steps on the same region of the mesh, you should quickly get it crash.
Author
Member

Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch-Arch-Linux 64 Bits
Graphics card: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2) Intel Open Source Technology Center 4.5 (Core Profile) Mesa 20.0.7

I just freshly installed EndeavourOS on my laptop (so another device from my original report). I can reproduce the same issues on that device as well.

Operating system: Linux-5.6.13-arch1-1-x86_64-with-arch-Arch-Linux 64 Bits Graphics card: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2) Intel Open Source Technology Center 4.5 (Core Profile) Mesa 20.0.7 I just freshly installed EndeavourOS on my laptop (so another device from my original report). I can reproduce the same issues on that device as well.
Author
Member

In #76858#934258, @brecht wrote:
Some random things to try:

  • Set update_only_visible to false in BKE_pbvh_draw_cb always, and see if that stops the crash.
  • Try adding PBVH_RebuildDrawBuffers in BKE_pbvh_node_mark_update, BKE_pbvh_node_mark_update_mask or BKE_pbvh_node_mark_redraw, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue.
  • Does smooth or flat shading make a difference? If so GPU_pbvh_bmesh_buffers_update_free could be changed to free more buffers with flat shading like it does for smooth shading.

Unfortunately, none of that helped.

Here is a quick video showing the issue, just to make sure we are on the same page. Note that I get a different ASAN output here, because I did not deactivate threading for a couple of functions. Here are two other ASAN reports I get in a Blender 2.83 build: P1405.
2020-05-19 13-45-00.mp4

When calling make, I get a couple of cmake warnings (P1406). I have no idea if those are related to the problem. It does not look like it does.

I did not get the error on windows yet.

> In #76858#934258, @brecht wrote: > Some random things to try: > * Set `update_only_visible` to `false` in `BKE_pbvh_draw_cb` always, and see if that stops the crash. > * Try adding `PBVH_RebuildDrawBuffers` in `BKE_pbvh_node_mark_update`, `BKE_pbvh_node_mark_update_mask` or `BKE_pbvh_node_mark_redraw`, to force a full rebuild of the draw buffers instead of a partial. Maybe partial rebuilding has some issue. > * Does smooth or flat shading make a difference? If so `GPU_pbvh_bmesh_buffers_update_free` could be changed to free more buffers with flat shading like it does for smooth shading. Unfortunately, none of that helped. Here is a quick video showing the issue, just to make sure we are on the same page. Note that I get a different ASAN output here, because I did not deactivate threading for a couple of functions. Here are two other ASAN reports I get in a Blender 2.83 build: [P1405](https://archive.blender.org/developer/P1405.txt). [2020-05-19 13-45-00.mp4](https://archive.blender.org/developer/F8542934/2020-05-19_13-45-00.mp4) When calling `make`, I get a couple of cmake warnings ([P1406](https://archive.blender.org/developer/P1406.txt)). I have no idea if those are related to the problem. It does not look like it does. I did not get the error on windows yet.

This issue was referenced by 8d63d7337c

This issue was referenced by 8d63d7337cec3b2dea06b3a520af467d9fc3e66d

This issue was referenced by 59cfb20fa1

This issue was referenced by 59cfb20fa112e636a6ae85cb6fdc048f16af5637

This issue was referenced by 499c0229f7

This issue was referenced by 499c0229f7e5a3bee0c2292fc58f3c7bbaf23240

Changed status from 'Needs Developer To Reproduce' to: 'Resolved'

Changed status from 'Needs Developer To Reproduce' to: 'Resolved'
Brecht Van Lommel self-assigned this 2020-05-20 00:39:18 +02:00

This turned out to be a harmless assert and a problem only when using --debug-memory.

This turned out to be a harmless assert and a problem only when using `--debug-memory`.
Author
Member

Thanks Brecht! That solved the issue indeed, and it looks like a couple of issues were solved in the process as well, so it was not all for nothing.

I was actually stepping through the allocator to find the issue, because I could not find it anywere else. There I saw mem_lock_thread and mem_unlock_thread and then wrongly assumed that it was thread safe... Wouldn't be better to make sure that our memory allocator is always thread safe? I find the idea of having a non-thread-safe main allocator a bit uncomfortable. This might become more of an issue if we decide to use tbb in more places directly, instead of using our own abstractions.

Thanks Brecht! That solved the issue indeed, and it looks like a couple of issues were solved in the process as well, so it was not all for nothing. I was actually stepping through the allocator to find the issue, because I could not find it anywere else. There I saw `mem_lock_thread` and `mem_unlock_thread` and then wrongly assumed that it was thread safe... Wouldn't be better to make sure that our memory allocator is always thread safe? I find the idea of having a non-thread-safe main allocator a bit uncomfortable. This might become more of an issue if we decide to use tbb in more places directly, instead of using our own abstractions.

I agree, I committed 183ba284f2 after this fix.

I agree, I committed 183ba284f2 after this fix.
Thomas Dinges added this to the 2.83 LTS milestone 2023-02-08 16:35:48 +01:00
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#76858
No description provided.