Page MenuHome

Build Bot: MacOS X test fails
Closed, ResolvedPublicBUG

Description

Since we released Blender 2.90.0 the tests of the build bot are failing for the mac.

        Start  51: script_pyapi_prop_array
 51/157 Test  #51: script_pyapi_prop_array ...................   Passed    0.38 sec
        Start  52: id_management
 52/157 Test  #52: id_management .............................***Exception: SegFault  2.89 sec
......Blender 2.90.1 (hash 3e85bb34d0d7 built 2020-09-23 07:24:50)
found bundled python: /Users/blender/blender-buildbot/macos_290/install/Blender.app/Contents/Resources/2.90/python

----------------------------------------------------------------------
Ran 6 tests in 2.646s

OK
Writing: /var/folders/5s/6pmgq7ns62ng1r77k17kv6fm0000gn/T/blender.crash.txt

        Start  53: blendfile_io
 53/157 Test  #53: blendfile_io ..............................   Passed    0.35 sec
        Start  54: blendfile_liblink
 54/157 Test  #54: blendfile_liblink .........................   Passed    0.35 sec
        Start  55: bmesh_bevel

It seems to be the case most of the time, happens only on the mac build bot. always on the same test, but not related to a specific commit.
the 2.90.0 was released with passing tests, but the day after that the test started to fail. Strange enough the test did pass once 2 days ago. After we added all the fixes of 2.90.1.

This needs investigation. I set it to Unbreak now as this halts the release for 2.90.1. Is there anything I can do?

Event Timeline

Jeroen Bakker (jbakker) changed the task status from Needs Triage to Confirmed.Sep 23 2020, 9:44 AM
Jeroen Bakker (jbakker) triaged this task as Unbreak Now! priority.
Jeroen Bakker (jbakker) created this task.

I am not sure why I'm in the subsribers. This is not specific to the buildbot setup, it will happen on any macOS build. Compiling with ASAN will make it easier to catch the issue.

Is there anything I can do?

I do not think so. Someone on a mac should dig into it and see if it's something wrong is going on in the test itself, or in the code.

I set it to Unbreak now as this halts the release for 2.90.1

Not sure why to do it at the day of release. This is not a newly introduced issue. For the release is safer to NOT do changes in code at this point.

Doesn't mean we should not fix the issue, is just to me this is not a stopper for 2.90.1.

Jeroen Bakker (jbakker) lowered the priority of this task from Unbreak Now! to High.Sep 23 2020, 10:32 AM

Lowering the prio as after testing with the dmg we decided to continue with the release as is.

I got these tests failing on my macOS machine with today's master (a6b16cfd801f):

11 - id_management (SEGFAULT)
29 - export_ply_vertices (Failed)
50 - cycles_volume (Failed)

Needs further investigation ..

Bastien Montagne (mont29) changed the subtype of this task from "Report" to "Bug".Sep 24 2020, 9:54 AM

Regarding id_management, did someone check that it was not a mere 'out of RAM' issue? Those tests are run in parallel now, iirc this can consume quite a lot of memory… Would also explain why it passes sometimes, and sometimes not?

Not sure how valid this remark is though, don't know the specs of our buildbots.

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

  1. Compile Blender with ASAN
  2. ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

Those tests are run in parallel now

Where this information is coming from?

The issue can be easily reproduced on macOS by:

  1. Compile Blender with ASAN
  2. ctest -R id_management

Please look into actual problems rather than speculating that something is wrong on the buildbot.

I am not speculating, I am asking a question, after facing same out-of-memory issue here. And I would like to know how I am supposed to investigate an issue that only shows on an OS I have absolutely no access to.

@Bastien Montagne (mont29), I'm not sure why you're the one who is supposed to look into the issue: as I've mentioned above that someone on macOS is to look into it, that it is easy to reproduce, and that it is not specific to buildbot.
At this time I don't think you should be looking into this issue. Give some time for the mac people to dig deeper, and, maybe, eventually assist with addressing the root cause (after it is identified).

Removed. I couldn't redo the original crash and thought what I fixed was happening on buildbot.

Ankit Meel (ankitm) added a comment.EditedSep 26 2020, 1:48 AM

Please ignore the previous comment, it is a separate issue.
Debug build didn't crash at all, so built Release with ASan, and got a heap use after free due to the experimental method batch_remove(..) : P1659

The day started with P1726#8937 showing that id->us is not 0 and MECube was being freed when there was still a user. Output was:

id_delete: deleting MECube (1)

Later, @Julian Eisel (Severin) shared a patch that had fixed it for him, but not for me. https://pasteall.org/4OjK/slim

Later, after a lot of debug statements and misguided breakpoints, I found that the code in the for-loop for (id = last_remapped_id->next; id; id = id->next) { is not even being executed. So while trying to debug why that is, surprisingly P1726#8932 fixed the test, and also fixed the id->us from being 1 to 0.
Ray suggested P1726#8936 and that is also a fix.

Crash/ test failure happens only in release and relwithdebinfo builds, not debug ones. (ASAN doesn't affect that)

From my uneducated point of view, this sounds like Clang optimizer being over aggressive here, to say the least...

Those patches are nice to investigate, but none are acceptable fixes of course, they are all ways to 'hide' it with extra processing forcing somehow the compiler to generate correct code again ( or disabling any optimization).

I will try with clang on linux tomorrow out of curiosity (what is the version on OSX btw?), but did you try a full explicit init of tagged_deleted_ids, with two NULL pointers? That's the only obvious thing I can see from quickly checking the code again?

And obviously, big thanks to everybody for investigating this hairy issue!

buildbot is using "AppleClang 12.0.0.12000032"
Julian is using "Apple clang version 12.0.0 (clang-1200.0.32.21)"
I'm using LLVM "clang version 12.0.0 (https://github.com/llvm/llvm-project.git e139450166a7c23ad42f839eddb1e34553967d78)"
I also tested "AppleClang 10.0.1.10010046".
Same results in all four.

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726#8940 this ? It crashes with this patch applied.

did you try a full explicit init of tagged_deleted_ids, with two NULL pointers?

P1726#8940 this ? It crashes with this patch applied.

Yes, it was the only potentially fuzzy think I could spot (though I would not have expected it to be an issue)...

No issues here with clang 9, trying with clang 11 now...

And Clang 11 also passes fine here :(

https://godbolt.org/z/nKzaqE comparison of clang and gcc.
code of interest:

if (last_remapped_id == NULL) {
  dummy_link.next = tagged_deleted_ids.first;
  last_remapped_id = (ID *)(&dummy_link);
}

Julian found this fix P1726#8941 (making last_remapped_id volatile)