Description of problem: After upgrading to 5.4.2-1 my system has been experiencing 2D lock ups but recoveres after approximately five to ten seconds. In dmesg this message appears immediately after a hang: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered Version-Release number of selected component (if applicable): Mageia 7.1 x86_64 How reproducible: This issue is obviously tied to amdgpu so a system with a Volcanic Island GPU is probably required. My system is running a AMD 2400G Raven Ridge with the integrated Vega 11 GPU. This problem did not manifest with the 5.3.13 or any earlier kernel on this hardware. Steps to Reproduce: 1. Install the 5.4.2 kernel. 2. Watch a youtube video or just bring up a web page and scroll. 3. Wait for the screen to hang and once it "unhangs" run "dmesg -T" Occasionally this shows up in dmesg -T: [Sat Dec 14 17:14:48 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Sat Dec 14 17:14:53 2019] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [Sat Dec 14 17:14:58 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered I have not observed this phenomenon on any other hardware it's certainly tied to AMDGPU but may also be unique to the Raven Ridge Vega 11 GPU.
Thank you for reporting this, and all the details you provided Assigning to kernel/drivers group.
Assignee: bugsquad => kernel
I'm bumping this one up a bit since it would be a show stopper if there wasn't a work around (using the 5.3 kernel). The combination of Raven Ridge (2400g) + Mesa 19.3.1 + 5.4.6 kernel = unusable machine. Now the combination of Raven Ridge (2400g) + Mesa 19.3.1 + 5.3.13 = perfectly usable machine with the ACO shader compiler. This appears to be a regression involving the amdgpu module and the 5.4.x kernel and is exacerbated by mesa 19.3.1. Mesa 19.2.7 could recover from the regression but 19.3.1 can be recovered once by restarting sddm. The second restart of sddm requires a reboot. To make sure it wasn't some odd setting I was using, I created a new user and set autologin for that user but it had no effect. To replicate: autologin on an AMD Raven Ridge system running a 5.4.(2 or 6) kernel, mesa 19.3.1, start google-earth and firefox. The screen will freeze and in dmesg -T this will appear: [Fri Dec 27 21:06:24 2019] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [Fri Dec 27 21:06:29 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered To reiterate, 5.3.7 has no issues whatsoever, 5.4.6 and 5.4.2 have issues with mesa 19.2.7 but the system recovers (after about 10 seconds). Mesa 19.3.1 has a full hang with restarting sddm giving a one-time recovery. I don't think this is a mesa problem but is more likely to be a Raven Ridge issue. My other systems that are not having any issues consist of: 2400g with GPU disabled using polaris 570: no problems. None of my other systems that are running 5.4.6 and video from (intel, AMD and nVidia) are manifesting any problems. This appears to be unique to Raven Ridge.
Priority: Normal => HighSeverity: normal => major
Looks like there is still upstream issues with this, even in drm-next... https://gitlab.freedesktop.org/drm/amd/issues/847 https://gitlab.freedesktop.org/drm/amd/issues/934 and some others... You could try a workaround that has worked for some: do as root: echo "export AMD_DEBUG=nodcc" > /etc/profile.d/amd_fix.sh chmod +x /etc/profile.d/amd_fix.sh and reboot, does that help ?
CC: (none) => tmb
Thank you Thomas, your suggestion did help but it only works after the first hang and I ssh in and restart sddm. Thank you for the links, I'll follow those around and see where they go. I had not seen those links before but it's nice to know I'm not alone. Since this does not appear to be a Mageia specific issue and I have a working system at least with 5.3.13 I suppose this issue can be closed. Thank you for your help Thomas.
Does adding: amdgpu.lockup_timeout=0 or amdgpu.gpu_recovery=1 or both on kernel command line help the system keep going ?
Negative, the hangs still take place, the first hang just returns a single [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered but after restarting sddm from a remote system, the next hang generates a series of the "*ERROR* ring gfx timeout" popping up in dmesg -Tw ever 10-15 seconds. The next sddm restart brings the system back again, but as far as I can tell, adding both of those kernel options has no effect.
A fix appears to have been found for this issue in this link: https://gitlab.freedesktop.org/drm/amd/issues/934 The fix is to add "amdgpu.noretry=0" to the kernel command line. I've been running this change with kernel 5.4.6, mesa 19.3.1 and have had no hangs using OpenGL as well as Vulkan programs. This certainly appears to have resolved this problem. I can't see any reason not to close this ticket.
(In reply to Alan Richter from comment #7) > A fix appears to have been found for this issue in this link: > > https://gitlab.freedesktop.org/drm/amd/issues/934 > > The fix is to add "amdgpu.noretry=0" to the kernel command line. Ah, so another tradeoff between performance and stability... > > I've been running this change with kernel 5.4.6, mesa 19.3.1 and have had no > hangs using OpenGL as well as Vulkan programs. This certainly appears to > have resolved this problem. > > I can't see any reason not to close this ticket. Well, since you are the reporter, you are free to do so :) Maybe add an entry in the errata to help others with similar issue...
Issues regarding raven ridge GPUs (2400g, 2200g, 2400u, 3400g etc.) and GPU hangs with the 5.4.x and later kernels, adding amdgpu_noretry=0 to the kernel command line seems to resolve the problem
Status: NEW => RESOLVEDResolution: (none) => FIXED
FYI, the default noretry=1 is now being reverted to noretry=0 upstream, and I'll add that to the next kernel build (kernel >= 5.4.8-3) that i plan to submit tonight to the buildsystem, so after that you wont need the extra kernel command line anymore...
It looks like you will only have to do this with 5.4.8, it appears that the boffins at AMD are backing this change out until a better solution is found. https://www.phoronix.com/scan.php?page=news_item&px=AMD-Restore-Retry-Faults-Raven And you've probably already recreated the patch for the problem: https://lists.freedesktop.org/archives/amd-gfx/2020-January/044477.html
An update for this issue has been pushed to the Mageia Updates repository. https://advisories.mageia.org/MGASA-2020-0036.html
Kernel command line removed, using 5.4.10 + Mesa 19.3.2-1, everything is working correctly on Raven Ridge 2400g. Resolved + Fixed really sums it up nicely.