Bug 25882 - amdgpu is having trouble with kernel 5.4.2-1 on Raven Ridge.
Summary: amdgpu is having trouble with kernel 5.4.2-1 on Raven Ridge.
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 7
Hardware: x86_64 Linux
Priority: High major
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-15 01:18 CET by Alan Richter
Modified: 2020-01-14 05:37 CET (History)
1 user (show)

See Also:
Source RPM: kernel-5.4.2-1.mga7.src.rpm
CVE:
Status comment:


Attachments

Description Alan Richter 2019-12-15 01:18:29 CET
Description of problem:
After upgrading to 5.4.2-1 my system has been experiencing 2D lock ups but recoveres after approximately five to ten seconds.  In dmesg this message appears immediately after a hang:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Version-Release number of selected component (if applicable):
Mageia 7.1 x86_64

How reproducible:

This issue is obviously tied to amdgpu so a system with a Volcanic Island GPU is probably required.  My system is running a AMD 2400G Raven Ridge with the integrated Vega 11 GPU.  This problem did not manifest with the 5.3.13 or any earlier kernel on this hardware.  

Steps to Reproduce:
1.  Install the 5.4.2 kernel.
2.  Watch a youtube video or just bring up a web page and scroll. 
3.  Wait for the screen to hang and once it "unhangs" run "dmesg -T"

Occasionally this shows up in dmesg -T:
[Sat Dec 14 17:14:48 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[Sat Dec 14 17:14:53 2019] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[Sat Dec 14 17:14:58 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I have not observed this phenomenon on any other hardware it's certainly tied to AMDGPU but may also be unique to the Raven Ridge Vega 11 GPU.
Comment 1 Lewis Smith 2019-12-17 15:42:55 CET
Thank you for reporting this, and all the details you provided

Assigning to kernel/drivers group.

Assignee: bugsquad => kernel

Comment 2 Alan Richter 2019-12-28 05:40:53 CET
I'm bumping this one up a bit since it would be a show stopper if there wasn't a work around (using the 5.3 kernel).  

The combination of Raven Ridge (2400g) + Mesa 19.3.1 + 5.4.6 kernel = unusable machine.  

Now the combination of Raven Ridge (2400g) + Mesa 19.3.1 + 5.3.13 = perfectly usable machine with the ACO shader compiler.  

This appears to be a regression involving the amdgpu module and the 5.4.x kernel and is exacerbated by mesa 19.3.1.  Mesa 19.2.7 could recover from the regression but 19.3.1 can be recovered once by restarting sddm.  The second restart of sddm requires a reboot.  

To make sure it wasn't some odd setting I was using, I created a new user and set autologin for that user but it had no effect.  

To replicate: autologin on an AMD Raven Ridge system running a 5.4.(2 or 6) kernel, mesa 19.3.1, start google-earth and firefox.  The screen will freeze and in dmesg -T this will appear:

[Fri Dec 27 21:06:24 2019] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[Fri Dec 27 21:06:29 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

To reiterate, 5.3.7 has no issues whatsoever, 5.4.6 and 5.4.2 have issues with mesa 19.2.7 but the system recovers (after about 10 seconds).  Mesa 19.3.1 has a full hang with restarting sddm giving a one-time recovery.

I don't think this is a mesa problem but is more likely to be a Raven Ridge issue.  My other systems that are not having any issues consist of:

2400g with GPU disabled using polaris 570: no problems.

None of my other systems that are running 5.4.6 and video from (intel, AMD and nVidia) are manifesting any problems.  This appears to be unique to Raven Ridge.

Priority: Normal => High
Severity: normal => major

Comment 3 Thomas Backlund 2019-12-28 12:17:14 CET
Looks like there is still upstream issues with this, even in drm-next...

https://gitlab.freedesktop.org/drm/amd/issues/847
https://gitlab.freedesktop.org/drm/amd/issues/934

and some others...

You could try a workaround that has worked for some:

do as root:

echo "export AMD_DEBUG=nodcc" > /etc/profile.d/amd_fix.sh

chmod +x /etc/profile.d/amd_fix.sh

and reboot, does that help ?

CC: (none) => tmb

Comment 4 Alan Richter 2019-12-28 21:39:45 CET
Thank you Thomas, your suggestion did help but it only works after the first hang and I ssh in and restart sddm.

Thank you for the links, I'll follow those around and see where they go.  I had not seen those links before but it's nice to know I'm not alone.  

Since this does not appear to be a Mageia specific issue and I have a working system at least with 5.3.13 I suppose this issue can be closed. 

Thank you for your help Thomas.
Comment 5 Thomas Backlund 2019-12-28 22:31:02 CET
Does adding:
amdgpu.lockup_timeout=0 

or

amdgpu.gpu_recovery=1

or both on kernel command line help the system keep going ?
Comment 6 Alan Richter 2019-12-29 00:32:51 CET
Negative, the hangs still take place, the first hang just returns a single 

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

but after restarting sddm from a remote system, the next hang generates a series of the "*ERROR* ring gfx timeout" popping up in dmesg -Tw ever 10-15 seconds.

The next sddm restart brings the system back again, but as far as I can tell, adding both of those kernel options has no effect.
Comment 7 Alan Richter 2020-01-01 22:18:54 CET
A fix appears to have been found for this issue in this link:

https://gitlab.freedesktop.org/drm/amd/issues/934

The fix is to add "amdgpu.noretry=0" to the kernel command line.  

I've been running this change with kernel 5.4.6, mesa 19.3.1 and have had no hangs using OpenGL as well as Vulkan programs.  This certainly appears to have resolved this problem.  

I can't see any reason not to close this ticket.
Comment 8 Thomas Backlund 2020-01-02 00:22:53 CET
(In reply to Alan Richter from comment #7)
> A fix appears to have been found for this issue in this link:
> 
> https://gitlab.freedesktop.org/drm/amd/issues/934
> 
> The fix is to add "amdgpu.noretry=0" to the kernel command line.  

Ah, so another tradeoff between performance and stability...

> 
> I've been running this change with kernel 5.4.6, mesa 19.3.1 and have had no
> hangs using OpenGL as well as Vulkan programs.  This certainly appears to
> have resolved this problem.  
> 
> I can't see any reason not to close this ticket.

Well, since you are the reporter, you are free to do so :)

Maybe add an entry in the errata to help others with similar issue...
Comment 9 Alan Richter 2020-01-02 04:40:09 CET
Issues regarding raven ridge GPUs (2400g, 2200g, 2400u, 3400g etc.) and GPU hangs with the 5.4.x and later kernels, adding amdgpu_noretry=0 to the kernel command line seems to resolve the problem

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 10 Thomas Backlund 2020-01-07 16:09:44 CET
FYI, the default noretry=1 is now being reverted to noretry=0 upstream, and I'll add that to the next kernel build (kernel >= 5.4.8-3) that i plan to submit tonight to the buildsystem, so after that you wont need the extra kernel command line anymore...
Comment 11 Alan Richter 2020-01-07 21:34:08 CET
It looks like you will only have to do this with 5.4.8, it appears that the boffins at AMD are backing this change out until a better solution is found.

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Restore-Retry-Faults-Raven

And you've probably already recreated the patch for the problem:

https://lists.freedesktop.org/archives/amd-gfx/2020-January/044477.html
Comment 12 Thomas Backlund 2020-01-13 17:59:36 CET
An update for this issue has been pushed to the Mageia Updates repository.

https://advisories.mageia.org/MGASA-2020-0036.html
Comment 13 Alan Richter 2020-01-14 05:37:33 CET
Kernel command line removed, using 5.4.10 + Mesa 19.3.2-1, everything is working correctly on Raven Ridge 2400g.   Resolved + Fixed really sums it up nicely.

Note You need to log in before you can comment on or make changes to this bug.