Page MenuHome

Crash when changing torus properties
Closed, ResolvedPublicBUG

Description

System Information
Operating system: Darwin-19.6.0-x86_64-i386-64bit 64 Bits (macOS Catalina) 10.15.6
Graphics card: NVIDIA GeForce GT 650M OpenGL Engine NVIDIA Corporation 4.1 NVIDIA-14.0.32 355.11.11.10.10.143

Blender Version
Broken: version: 2.90.0 Beta, branch: master, commit date: 2020-08-28 14:51, hash: rBddbf41d88d43
Worked: -

Short description of error
When changing properties of a torus mesh blender crash.
I recorded a video:
https://imgur.com/a/ghQjmGG

Exact steps for others to reproduce the error
Disable undo legacy. (it doesn't crash there)
On release/relwithdebinfo build, Default scene > add torus > move one radius slider, then ferociously move the other radius slider.
If the crash doesn't occur, add cone, do the same with its radius sliders.

Related Objects

Mentioned In
T89307: Blender crashes when using operator panels on macOS
T84789: Crash when switching away from Blender and back again
T84397: Creating and removing many objects very quickly causes a crash
T84356: Constant crashes in Mac with version 2.91.0
T84368: Segmentation fault when changing size of added objects
T84015: Crash when staying / using a slider field for too long
T80272: Same problem with task T77840, Blender 2.83.0 crashes on macOS Catalina.
T82353: Blender crashes and destroys my scene
T80857: Blender 2.8-2.9 crashing on simple operations Mac OS
Mentioned Here
T84387: Crash Changing Size of Cube
rBabbc43e4e419: Fix T84397: Creating and removing many objects very quickly causes a crash
D10077: Fix T84397, T80203: use `session_uuid` instead of ID pointers in depsgraph storage.
T84397: Creating and removing many objects very quickly causes a crash
T18: no entry for command line arguments yet
T84368: Segmentation fault when changing size of added objects
T81077: Build Bot: MacOS X test fails
T60695: Optimized per-datablock global undo
D6580: WIP/Demonstration patch about undo speedup project.
rBbb872b25f219: CMake/macOS: Search for headers in Frameworks last.
T74067: Crash: Custom Panel UI Toggle while Undoing
D7795: Fix T74067: Crash when the UI accesses stale data
P1709 T80203 attempts to get a trace
P1710 T80203 crash report when when blender is run without a debugger
rB63c906e0a7e5: Fix T81340: UBSan: addition of unsigned offset causes overflow
rBddbf41d88d43: Fix T80104: Crash on making material local.
rBdefe21a7bbac: Doversion: move (fix) 2.80 checks to 2.90
rBcb0b0416f454: Fix T80258: UILayout.prop_search() issues with datablock names
rB0330d1af29c0: Fix T77900: File Browser in macOS fullscreen crashes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ankit Meel (ankitm) added a comment.EditedOct 3 2020, 12:27 PM

Can confirm on Release and RelWithDebInfo build. rB63c906e0a7e59
No config built with ASan crashes.

Bastien Montagne (mont29) changed the task status from Confirmed to Needs Information from User.Oct 15 2020, 11:47 AM
Bastien Montagne (mont29) changed the subtype of this task from "Report" to "Bug".

Can you please check whether it still crashes for you if you disable the new undo system? In user preferences, Experimental section, Debugging panel, enable the Undo Legacy checkbox.

Can you please check whether it still crashes for you if you disable the new undo system? In user preferences, Experimental section, Debugging panel, enable the Undo Legacy checkbox.

I tried with 2.90 and 2.91 Alpha with legacy undo enabled. 20 minutes playing with diferent properties of diferent meshes and no crashes

Thanks, then it is probably some kind of memory corruption... Will see if I can reproduce.

Bastien Montagne (mont29) changed the task status from Needs Information from User to Needs Information from Developers.Oct 15 2020, 2:40 PM
Bastien Montagne (mont29) triaged this task as High priority.Oct 15 2020, 2:46 PM
Bastien Montagne (mont29) edited projects, added BF Blender (2.91); removed BF Blender.

Tried to reproduce the issue here on linux for quiet some time, with release and debug+Asan builds, with one or a thousands of objects already in scene... Not a single crash. :(

@Ankit Meel (ankitm) since you can reproduce am afraid this is on your desk for now?

Ankit Meel (ankitm) updated the task description. (Show Details)
Ankit Meel (ankitm) updated the task description. (Show Details)

debug+Asan builds,

Enabling Asan makes the crash disappear for me too, no matter what config.
Did you forget to disable legacy undo ?

I had tried to get trace the same day I confirmed it (relwithdebinfo), but the crash was happening at too many places, no consistent stacktrace. In some places, even C was NULL. Is that normal ?

note to self: https://stackoverflow.com/questions/14045208/how-to-set-a-breakpoint-in-malloc-error-break-to-debug/

@Ankit Meel (ankitm) I obviously had legacy undo disabled! Crash just does not happen on linux apparently...

Got lucky and Debug build (without asan) also crashes. Four attempts, four traces. https://developer.blender.org/P1709
Then launched Blender without debugger and got this crash report: https://developer.blender.org/P1710 Line 7 sticks out.. is that the bug ?

Ankit Meel (ankitm) added a comment.EditedOct 15 2020, 9:57 PM

Got its malloc history and corresponding --debug-all event log too. At the end of the latter are the malloc messages about corruption.

Hmmmm those traces remind me a lot about that other issue: T74067: Crash: Custom Panel UI Toggle while Undoing

This only happening with new undo (and never with ASAN build) could be explained by the fact that the operation triggering that out-of-sync situation in UI has to be very fast, otherwise UI will get updated in-between and issue will not happen?

Anyway, can you try and see if D7795: Fix T74067: Crash when the UI accesses stale data fixes it for you? Hopefully this is still applicable easily on current master.

With D7795 rebased on master rBbb872b25f219d1a9bc2446228b6dc, crash happens,
Crash report/ trace being: https://developer.blender.org/P1709#8866 and there's this black screen on the panel too in this crash.
On another attempt: https://developer.blender.org/P1709#8867

the operation triggering that out-of-sync situation in UI has to be very fast,

yeah, I move the cursor from one corner of the screen to the other really fast.

With the same patch (as in my last comment), and running Blender without debugger, https://developer.blender.org/P1710#8869
see the lines which wrongly set the property which is intended for the radii.

bpy.data.window_managers["WinMan"].(null) = 3  # Property
..
bpy.data.window_managers["WinMan"].(null) = 11.2  # Property

One more thing @Yevgeny Makarov (jenkm) also mentioned above:
with legacy undo, the cursor disappears while moving the slider.
with new undo, the cursor moves on the screen while moving the slider, and returns back to its original state (on the slider) when released.
Ignore if it's intended.

Don't really know what to add here... This is likely some staled data somewhere in UI code, but besides that, being unable to reproduce...

@Sebastián Barschkis (sebbas) Can you please have a look at this, see if you can reproduce and then investigate it? I cannot do anything about it here on linux it'd seem, but this is a rather annoying issue still, and the bug is unlikely to actually be mac-only... Thanks. :)

On an old MacBook Pro running 10.15.6 I can reproduce the problem with 2.91.0 Beta, branch: master, commit date: 2020-11-13 00:36, hash: rB2e08500d047e, Blender will completely crash and disappear, leading to the macOS its "has unexpectedly crashed" message.

Taking todays (November 17th) version rB906ff7b8fea8 and doing the same steps, it will freeze Blender completely, causing the mouse cursor to display Apple its famous "Beachball of Death". However I was able to limp back into the Finder and launch a Terminal from there and could see using "top", that Blender is still running at a 100% CPU usage, however moving the mouse back into the Blender window the beachball of death is still there and not responding to any clicks or keypresses. Also the mouse cursor remains completely hidden.

This definitely seems like some sort of memory error. Most likely related to undo/redo.

Each time the crash log is entirely different. Once in the depsgraph code, once in drawing, once in the UI code... Oh and once, every change to the radius value even created a new torus object (as seen in the Outliner), no other weird things and no crash happened.
That is with recent master. --debug-memory doesn't make a difference. So it doesn't seem like a use-after-free, more like an invalid write or so.

Apparently Valgrind isn't available for modern macOS? A memory tool like this would be useful.

I was going to do a bisect (from 2.82 to 2.83), but realized it will likely just end up being the new undo system merge or its enable-by-default commit. If somebody still wants to give it a go, by all means, feel welcome.


So I guess we need to give @Bastien Montagne (mont29) access to a Mac to debug this.
Or maybe somebody can reproduce on Linux+Clang?

I don't see a reason to believe this is caused by stall UI data. Just because some of the seemingly random logs show this. But after redo is performed, most code that runs is UI code, so of course the logs after an invalid memory operation on undo/redo are biased towards that. And if you move the mouse faster, more value changes happen and it's more likely to fail in that period - I was able to reproduce it with slow mouse movements as well.

Not a fun one at all!

Apparently Valgrind isn't available for modern macOS? A memory tool like this would be useful.

valgrind: stable 3.16.1 (bottled), HEAD
==> Requirements
Required: macOS <= 10.13 ✘

Even if it was available on our macOS versions, it will introduce slowdown and that would make the bug go away. That's why we couldn't' redo it with Asan or debug + Asan builds.

I can reliably reproduce the crash with other mesh objects as well. E.g. with cubes and spheres.
However I can not reproduce it with non-mesh objects, e.g. curves, lamps and meta-balls.


Even if it was available on our macOS versions, it will introduce slowdown and that would make the bug go away. That's why we couldn't' redo it with Asan or debug + Asan builds.

I'm not at all convinced that it's the overhead of ASan that makes it go away. There are many more variables, AFAIK ASan has an own allocator.
Like I said I was able to reproduce it with really slow movements. Based on how mouse moves are handled in the main-loop (we drop all mouse moves but the most recent one), I kinda doubt that these could lead to a data-race.

Two suggestions that I was unable to find in the comments so far, based on that this looks like a race condition.

Helgrind on Linux

https://www.valgrind.org/docs/manual/hg-manual.html

Helgrind will sometimes report failing synchronization *even when they don't lead to crashes*, so this would be my suggestion for any Valgrind savvy on-Linux developers who want to track this down.

And it might be that this problem exists on Linux as well, but it just blows up a lot less often. In that case Helgrind might help.

I spent some time a number of years ago fixing threading issues in a largeish piece of code and Helgrind was gold! Super slow to execute but back then I got good diagnostics out of it and was able to fix some really non-obvious real world problems.

Thread Sanitizer on macOS

Asan is mentioned a lot in this thread, but if this is a race condition of some kind then Tsan might be what should really be used:
https://clang.llvm.org/docs/ThreadSanitizer.html

Haven't used Tsan, can't vouch for it, except for the fact that it exists and claims to have macOS support on its web page.

FWIW I removed address sanitiser, and added thread sanitizer. See the flags here. Couldn't get the crash in debug or relwithdebinfo builds.

Or maybe somebody can reproduce on Linux+Clang?

Tried again and clang (9 and 11) does not crash for me either on linux.

nos (nos) added a subscriber: nos (nos).EditedDec 12 2020, 3:55 PM

I am not a developer but wanted to add that it also crashes on my Mac when changing the torus radius.

Nothing on the project. Just a new torus and crashes.

MacBook Pro 2015, Mojave, AMD Radeon R9 M370X.

edit: just tried on 2.83.9 and the same thing happens.

At this point I am wondering why I am bothering with learning a program that has such a basic bug unsolved for at least the last 2 publicly released versions

Perhaps a better quality control should take place?

*edit2: "No compatible GPU's found" either for OpenCL or CUDA* Perhaps that's the problem?

*edit3:* After installing all versions from 2.82a to 2.90.1, I can confirm that only version 2.82a works without crashing

This comment was removed by Robert Guetzkow (rjg).
Johan Walles (walles) added a comment.EditedJan 6 2021, 8:08 PM

Let's say somebody would want to use git bisect to figure out where this problem started.

  • Is there a git repo with complete history for the new undo system? Where?
  • On which branch / SHA did development start?

This problem already existed when the undo speedup first entered Blender master, I could repro at the below change.

If you want to try this, first enable Interface / Developer Extras, then in the new Experimental tab check Undo Speedup.

Two ways forward that I can come up with:

  • Re-review b852db57ba24adfcfaa0ada7e9ff513a79a399a2 with the new knowledge that it introduces this issue. @Bastien Montagne (mont29), since your name is on the change, do you think you could find the problem this way?
  • Find the history for that change and try to git bisect. Maybe look for this in T60695 or D6580, since those are mentioned?
commit b852db57ba24adfcfaa0ada7e9ff513a79a399a2 (HEAD)
Author: Bastien Montagne <b.mont29@gmail.com>
Date:   Tue Mar 17 12:29:36 2020 +0100

    Add experimental global undo speedup.
    
    The feature is hidden behind an experimental option, you'll have to
    enable it in the preferences to try it.
    
    This feature is not yet considered fully stable, crashes may happen, as
    well as .blend file corruptions (very unlikely, but still possible).
    
    In a nutshell, the ideas behind this code are to:
    * Detect unchanged IDs across an undo step.
    * Reuse as much as possible existing IDs memory, even when its content
      did change.
    * Re-use existing depsgraphs instead of building new ones from scratch.
    * Store accumulated recalc flags, to avoid needless re-compute of things
      that did not change, when the ID itself is detected as modified.
    
    See T60695 and D6580 for more technical details.

 release/scripts/startup/bl_ui/space_userpref.py  |  21 ++
 source/blender/blenkernel/BKE_blender_undo.h     |   5 +-
 source/blender/blenkernel/BKE_main.h             |   5 +
 source/blender/blenkernel/BKE_undo_system.h      |   3 +
 source/blender/blenkernel/intern/blender_undo.c  |  13 +-
 source/blender/blenkernel/intern/blendfile.c     |  38 +--
 source/blender/blenloader/BLO_readfile.h         |   9 +-
 source/blender/blenloader/BLO_undofile.h         |   4 +
 source/blender/blenloader/intern/readblenentry.c |  12 +-
 source/blender/blenloader/intern/readfile.c      | 320 ++++++++++++++++++++---
 source/blender/blenloader/intern/readfile.h      |  23 +-
 source/blender/blenloader/intern/undofile.c      |   9 +-
 source/blender/blenloader/intern/writefile.c     |   6 +
 source/blender/editors/undo/memfile_undo.c       | 120 ++++++++-
 source/blender/makesdna/DNA_ID.h                 |   4 +
 source/blender/makesdna/DNA_userdef_types.h      |   3 +-
 source/blender/makesrna/intern/rna_userdef.c     |   8 +
 17 files changed, 534 insertions(+), 69 deletions(-)
Johan Walles (walles) added a comment.EditedJan 7 2021, 7:18 AM

Found the undo-experiments branch.

The latest commit into that branch did not seem to have this problem, or at least I was unable to repro.

commit 449aeb6a43c28e60728a46926545ca4cd570196c (HEAD, origin/undo-experiments)
Author: Bastien Montagne <b.mont29@gmail.com>
Date:   Tue Mar 3 12:15:49 2020 +0100

    Fix stupid mistake in key generation for temp deg storage.

 source/blender/blenkernel/intern/scene.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

@Johan Walles (walles) I don't think bisect is going to help really, once again: new undo revealed several issues in existing code, it did not create them.

The proper way to solve this imho is to actually investigate the reasons of the crash (which requires to be able to reproduce it), using debug sessions etc.

Also, since this only happens on a specific OS/compiler version, I would suspect one of those two sources as being the issue here:

  • A hidden strict-C standard violation that only fails with highly optimized build from a specific, stricter compiler, like what we had some months ago already with strict aliasing (T81077: Build Bot: MacOS X test fails).
  • Some weird threading issue (not sure if someone already tried to run Blender from command line with -t 1 option and see if they could still reproduce the crash?).

...but again, these are wild guesses.

Crash also happens with 1 thread, see T84368 for debug output.

I ran Thread Sanitizer (tsan) on Blender version
bc788929aa2bd259670a5562a1f403f25cad4625 (recent master) on macOS.

Changing a torus radius one notch got me 30 warnings:

To find some common theme I did this:

grep -E ' #[0-9]+ ' tsan-change-torus-radius-bc788929.txt|sort|uniq -c|sort -g

... and found 26 of these (out of 30):

blender::deg::Depsgraph::add_id_node(ID*, ID*) depsgraph.cc:123 (Blender:x86_64+0x1088dd6fa)

To find examples in the attached file, just search it for add_id_node.

Read Before Write

One way to read a tsan warning is to convert it into a timeline.

Picking the first warning in the tsan output, which is one of the add_id_node() ones, a timeline could look like this (see attachment for full backtraces):

  1. In the main thread, add_id_node() allocates a new node by calling init_copy_on_write()
  2. This node is read by thread T18, which looks like a worker thread, in BKE_mesh_minmax() mesh.c:1475
  3. The node's contents is written to by the main thread, in BKE_mesh_update_customdata_pointers() mesh.c:767

Not sure how much this helps, but I do think it would be worth it for somebody familiar with how synchronization is meant to work to have a look at these diagnostics and see if something can be improved.

For reference, possibly related issues are being investigated in T84397.

@Johan Walles (walles) Thank you for looking into this. Could you check how this looks like when you start Blender with --debug-depsgraph-no-threads -t 1 as well?

I can't redo this bug with the patch in https://developer.blender.org/T84397#1089432 applied.
But it could be that I didn't move the slider fast enough, so would be nice if someone else also tests it.

For reference, possibly related issues are being investigated in T84397.

@Johan Walles (walles) Thank you for looking into this. Could you check how this looks like when you start Blender with --debug-depsgraph-no-threads -t 1 as well?

With --debug-depsgraph-no-threads -t 1 I get no thread warnings whatsoever.

With just --debug-depsgraph-no-threads I get a similar amount of warnings as above, but none about add_id_node():

With just -t 1 I get no thread warnings whatsoever.

Robert Guetzkow (rjg) added a comment.EditedJan 9 2021, 2:48 PM

It seems like you might have found another issue then, as the problem from T84397 still happens with --debug-depsgraph-no-threads -t 1.

I can't redo this bug with the patch in https://developer.blender.org/T84397#1089432 applied.
But it could be that I didn't move the slider fast enough, so would be nice if someone else also tests it.

I tried with that patch and I couldn't repro either after I applied it.

Without the patch I was able to crash Blender by resizing the torus.

Modified version of D10077 has been committed as rBabbc43e4e419, please report if this fixes this issue or not.

Julian Eisel (Severin) closed this task as Resolved.Jan 12 2021, 5:41 PM
Julian Eisel (Severin) claimed this task.

No more crash for me! Double checked the commit before, and that still crashes reliably.

Glad to see such a nasty one gone.

Can confirm. rBabbc43e4e419 also solved T84387 . Thanks.