Page MenuHome

Cycles HIP rendered viewport crashes system/GPU on Linux with RDNA2 GPU
Closed, ArchivedPublic

Description

System Information
Operating system: Fedora Linux Release 36 - Linux-5.18.16-200.fc36.x86_64-x86_64-with-glibc2.35 64 Bits
Graphics card: AMD Radeon RX 6900 XT (sienna_cichlid, LLVM 14.0.0, DRM 3.46, 5.18.16-200.fc36.x86_64) AMD 4.6 (Core Profile) Mesa 22.1.5
ROCm Version: rocm-5.2.0

Blender Version
Broken: version: 3.2.2, branch: master, commit date: 2022-08-02 18:15, hash: rBbcfdb14560e7
Worked: never

Short description of error
When using HIP on AMD GPU together with Cycles GPU Rendering my entire system freezes when using the rendered viewport, and I have to hard-reset the PC. I think it's the GPU /GPU-driver that crashes.
Kernel messages from journalctl suggest amdgpu (kernel module?) crashing (see attached)

The crash is not always immediate and might instead only happen after using the rendered viewport for a while or switching back and forth between textured and rendered a few times.

Rendering with F12 works fine and I have observed no other crashes as long as I don't use the rendered viewport.

I have only started using blender a few days ago creating procedural textures in the node editor and using the rendered viewport all the time and the crash didn't happen once.
Now that I started working with image-texture-based assets like in the attached .blend file this problem started happening.
Maybe it's related with this issue: https://developer.blender.org/T97591

Can I somehow get more detailed debug output from blender? The "-d" switch didn't generate much more relevant output.

Exact steps for others to reproduce the error
I could reliably reproduce the issue by following these steps

  1. Open up two Blender instances with the attached .blend File loaded and have the viewport running in rendered mode
  2. move the viewport in the first instance to get it to "refresh"/re-render
  3. while the first instance is still working - move the viewport in the second instance - for me the System/GPU freeze/crash happens right there.

let me know what info I can provide to help.

Thank you!

Event Timeline

That issue is not known to happen with RDNA2 cards.

But it would be good to verify if happens under the same conditions, with image textures whose resolution is not a multiple of 128. The attached .blend does not contain the image textures so we can't tell.

I assume blender doesn't "crash" when my system locks up, because I cant find a /tmp/blender-crash.txt File.

I started it with the parameters listed in the manual (thanks for the links!)

blender --factory-startup --debug-all

and attached logfiles for both blender sessions (blender1.log and blender2.log)

I hope this attached blend file contains the textures now - (I used the "Automatically Pack Resources" button)

Thank you for your help.

Thanks. The file contains some images with a resolution that is not a multiple of 128, but those are not used by any shader. So I think this is likely a different issue than T97591.

There is a bug with a similar backtrace that may affect your kernel version 5.18.16.
https://gitlab.freedesktop.org/drm/amd/-/issues/2050
https://bugzilla.kernel.org/show_bug.cgi?id=216173
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016548

I don't know if it's the same issue, but sounds similar.

Attempting to render test.blend by pressing F12 with "GPU Compute" enabled resulted in Blender crashing on my system.

System Information
Operating system: Fedora Linux Release 36 - Linux-5.18.16-200.fc36.x86_64-x86_64-with-glibc2.35 64 Bits
Graphics card: AMD Radeon RX 5700 XT (navi10, LLVM 14.0.0, DRM 3.46, 5.18.16-200.fc36.x86_64) AMD 4.6 (Core Profile) Mesa 22.1.6
ROCm Version: rocm-5.2.0

Blender version
version: 3.2.1, branch: unknown, commit date: 1970-01-01 00:00, hash: unknown, type: Release (installed from the Fedora 36 updates-testing repository)

The crash log looks a lot like T97591.

@flavonol (flavonol), that's almost certainly the same bug as T97591 and different than this report.

I looked a bit more into the kernel bug with similar backtrace, and it's not exactly the same issue. That's a bug introduced in a release candidate of 5.19 that was fixed in the final release. So it wouldn't affect 5.18.16. Still this area seems to be under active development so maybe 5.19 or newer kernel versions have a fix.

In any case, if the system freezes that's a bug in the kernel. Maybe there is also a bug in Blender that triggers it, but it doesn't seem all that likely to me.

CC @Brian Savery (bsavery) @Sayak Biswas (sayakAMD).

I guess the next step would be to test with a newer kernel version, or get some input from AMD.

I am unable to reproduce this issue on an Ubuntu 20.04.4, it is running kernel 5.15 though. I will upgrade to 5.19.1 via a ppa and see if it repros.

Okay, with 5.19 kernel I am able to kind of reproduce the issue. It doesn't cause a system freeze for me, but both instances of blender freeze.

I don't have access to the 5.19 Kernel right now but I tried with the 6.0-rc1 from the fedora-rawhide repository and it's still the same.
I think I'm going to open a bug report on bugzilla.redhat.com for the system crash (I hope that's the correct place).

same problem with rx6800xt
any versions of blender
arch linux 5.19.3
hip version 22.20.3.50203-1 (prevision version same error)

problem also only with viewport rendering specially after enable wireframe
after "freeze" on second tty can see multiple times repeated message "failed to initialize parser -125!"

also was trying with same software ( version arch os/blender/hip ets) configuration but on laptop with 6700m and all works perfectly

also to be sure problem is not in video card
i was tested the rx6800xt with windows 11 os and there was no problem

Thanks for the additional info!

It's Interesting that your 6700m laptop doesn't have this issue - maybe mobile and desktop GPU's are somehow treated differently !?

Does your Laptop maybe have a different CPU than your Desktop (Intel/AMD?)

My CPU and GPU are both AMD: RX 6900XT with Ryzen 7 5800x

I created a bug report on RedHat Bugzilla for this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2119986

viktor (viktor3d) added a comment.EditedAug 27 2022, 1:31 PM

cpu and gpu on both pc's amd
laptop ryzen 7 5800h rx6700m
desktop ryzen 9 3950x rx6800xt

if i'm right 6600-6700xt and 6800-6900xt have technologically different gpu chips
go i guess part of hip driver maybe different too

One user on the fedora forum mentioned he has the same issue on a 6700XT:
https://discussion.fedoraproject.org/t/blender-amd-and-opencl/40199/12?u=joni999

well i no idea then ...
same system/kernel/software, even all settings was just moved on laptop by copy and mount home folder
also tryed today kernel linux-lts-5.15.63-1 on desktop and this not change situation, same problem still happened

but i found solution (rule) to avoid freezes. you should have only ONE space with 3d VIEWPORT mine look like this

for avoid should look like this


and second rule is the OVERLAYS should be OFF then frezes not happens to me tested on 5.15 kernel and on 5.19 too same result

If multiple 3D viewports or overlays are the problem, I guess there is some issue with the GPU driver handling HIP and OpenGL work simultaneously. Or some kind of race conditions or other problem while the driver is under heavy load. If it was an issue in the Cycles kernel code or HIP compiler that leads to the kernel e.g. going into an infinite loop, it would most likely hang also when rendering just with Cycles.

I believe the most relevant bug tracker is this, where a some similar sounding hangs/freezes were reported.
https://gitlab.freedesktop.org/drm/amd/-/issues

For example this:
https://gitlab.freedesktop.org/drm/amd/-/issues/2083

Thanks for the link to the bug tracker - I created a ticket:
https://gitlab.freedesktop.org/drm/amd/-/issues/2145

Thank you viktor for the workaround - I'll try that out (although losing the overlays is a high price in my opinion)

Update:

I'm pretty sure it's an issue with the amd driver setup.

On the drm/amd bugtracker I was asked to reproduce the issue on RedHat Enterprise Linux 8.6 because fedora is not officially supported.
So I did install RHEL and installed the amdgpu-install package via the official RPM

Then running
#> sudo amdgpu-install --usecase=hip
worked without issues and I could not reproduce the error on RHEL 8.6 - it works like it's supposed to.

So I tried the same thing on Fedora 36 but the package amdgpu-dkms which seems to be a kernel module that will be built against the current kernel, does not build/install and instead throws errors leaving me with functional ROCm but I still experience the issue.

I'm trying to get the kernel module to compile on Fedora but it seems there are some checks for certain kernel settings which I honestly have little clue of.

here's the build error:

[joni@linuxjoni02 yum.repos.d]$ cat /var/lib/dkms/amdgpu/5.16.9.22.20-1438747.el9/build/make.log
DKMS make.log for amdgpu-5.16.9.22.20-1438747.el9 for kernel 5.19.4-200.fc36.x86_64 (x86_64)
Fri Sep  2 10:20:45 PM CEST 2022
make: Entering directory '/usr/src/kernels/5.19.4-200.fc36.x86_64'
/var/lib/dkms/amdgpu/5.16.9.22.20-1438747.el9/build/Makefile:16: *** dma_resv->seq is missing., exit....  Stop.
make: *** [Makefile:1851: /var/lib/dkms/amdgpu/5.16.9.22.20-1438747.el9/build] Error 2
make: Leaving directory '/usr/src/kernels/5.19.4-200.fc36.x86_64'
[joni@linuxjoni02 yum.repos.d]$

TL;DR: It's not a blender issue. I'll close this ticket tomorrow and continue on here:
https://discussion.fedoraproject.org/t/blender-amd-and-opencl/40199

Thanks to everyone!

Jonathan (joni999) closed this task as Archived.Sun, Sep 4, 3:39 PM
This comment was removed by viktor (viktor3d).
Jonathan (joni999) added a comment.EditedSat, Sep 24, 1:19 PM

If you are experiencing the same issue and you are on one of the ROCm supported Distros please share your experience at this amd ticket:
https://gitlab.freedesktop.org/drm/amd/-/issues/2145
I hope this can be fixed soon, so Cycles will be usable again!