Bug 24841 - Screen and Graphic intermittent freezes with amdgpu
Summary: Screen and Graphic intermittent freezes with amdgpu
Status: NEW
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 8
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-19 17:41 CEST by Jüri Ivask
Modified: 2021-05-17 02:37 CEST (History)
2 users (show)

See Also:
Source RPM: kernel-5.10.37-1.mga8.src.rpm
CVE:
Status comment:


Attachments

Description Jüri Ivask 2019-05-19 17:41:16 CEST
Description of problem: Running Mageia 7 installation with Plasma Desktop - the desktop freezes from time to time to up to about 5 seconds.


Version-Release number of selected component (if applicable): 5.15.4

This happens quite regularily after some minutes. Everything freezes for about 5 seconds and then resumes to normal. For example - start slideshow with gwenview and use space key to change pictures. From time to time the slideshow does not advance - you have to press space again and then it passes the image and continues with the image after.

Any ideas how to find the cause of it?

Hardware: Lenovo Thinpkad X260
https://www.thinkwiki.org/wiki/Category:X260

It should be something specific to Mageia 7 as the KDE neon installation at another partition of the same laptop does not have such behaviour.
Comment 1 Alan Richter 2019-05-21 03:52:10 CEST
I believe that plasmashell is causing GPU crashes.  On my MGA7 rig I've found that hovering over panel icons or going through menus causes this: (sorry for the SPAM)



[Mon May 20 19:42:16 2019] Generic Realtek PHY r8169-400:00: attached PHY driver [Generic Realtek PHY] (mii_bus:phy_addr=r8169-400:00, irq=IGNORE)
[Mon May 20 19:42:17 2019] r8169 0000:04:00.0 enp4s0: Link is Down
[Mon May 20 19:42:17 2019] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[Mon May 20 19:42:17 2019] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[Mon May 20 19:42:19 2019] r8169 0000:04:00.0 enp4s0: Link is Up - 1Gbps/Full - flow control rx/tx
[Mon May 20 19:42:19 2019] IPv6: ADDRCONF(NETDEV_CHANGE): enp4s0: link becomes ready
[Mon May 20 19:42:21 2019] NET: Registered protocol family 17
[Mon May 20 19:42:30 2019] fuse init (API version 7.29)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103742000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103742000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103740000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103742000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:4 pasid:32773, for process plasmashell pid 5712 thread plasmashel:cs0 pid 5841)
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0:   in page starting at address 0x0000800103742000 from 27
[Mon May 20 19:45:29 2019] amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mon May 20 19:45:34 2019] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
[Mon May 20 19:45:39 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6470, emitted seq=6473
[Mon May 20 19:45:39 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 5712 thread plasmashel:cs0 pid 5841
[Mon May 20 19:45:39 2019] amdgpu 0000:08:00.0: GPU reset begin!
[Mon May 20 19:45:39 2019] amdgpu 0000:08:00.0: GPU mode1 reset
[Mon May 20 19:45:39 2019] [drm] psp mode 1 reset not supported now! 
[Mon May 20 19:45:39 2019] amdgpu 0000:08:00.0: GPU reset succeeded, trying to resume
[Mon May 20 19:45:39 2019] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000).
[Mon May 20 19:45:39 2019] [drm] PSP is resuming...
[Mon May 20 19:45:39 2019] [drm] reserve 0x400000 from 0xf400600000 for PSP TMR SIZE
[Mon May 20 19:45:39 2019] [drm] psp command failed and response status is (-65529)


****************************** And eventually this:

[Mon May 20 19:48:11 2019] Workqueue: events drm_sched_job_timedout [gpu_sched]
[Mon May 20 19:48:11 2019] Call Trace:
[Mon May 20 19:48:11 2019]  ? __schedule+0x253/0x860
[Mon May 20 19:48:11 2019]  ? __switch_to_asm+0x35/0x70
[Mon May 20 19:48:11 2019]  schedule+0x28/0x70
[Mon May 20 19:48:11 2019]  schedule_timeout+0x268/0x380
[Mon May 20 19:48:11 2019]  ? __schedule+0x25b/0x860
[Mon May 20 19:48:11 2019]  dma_fence_default_wait+0x204/0x270
[Mon May 20 19:48:11 2019]  ? dma_fence_release+0x90/0x90
[Mon May 20 19:48:11 2019]  dma_fence_wait_timeout+0xdd/0x100
[Mon May 20 19:48:11 2019]  drm_sched_stop+0xf1/0x130 [gpu_sched]
[Mon May 20 19:48:11 2019]  amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[Mon May 20 19:48:11 2019]  amdgpu_device_gpu_recover+0x7b/0x72b [amdgpu]
[Mon May 20 19:48:11 2019]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[Mon May 20 19:48:11 2019]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[Mon May 20 19:48:11 2019]  process_one_work+0x200/0x400
[Mon May 20 19:48:11 2019]  worker_thread+0x2d/0x3d0
[Mon May 20 19:48:11 2019]  ? process_one_work+0x400/0x400
[Mon May 20 19:48:11 2019]  kthread+0x112/0x130
[Mon May 20 19:48:11 2019]  ? kthread_create_on_node+0x60/0x60
[Mon May 20 19:48:11 2019]  ret_from_fork+0x22/0x40

I've run with icewm as my desktop without the GPU hang.  My GPU is a Raven Ridge 2400g and aside from the plasmashell issue it's stable.  

IIRC there was a plasma update and the move to Mesa 19.1 which seemed to coincide.  I've not tried backtracing to older packages yet.

CC: (none) => arichter

Comment 2 Alan Richter 2019-05-21 03:54:21 CEST
Unless it's only crashing like this on me, I would move the severity of this bug to high since it requires a reboot on systems that do not have GPUs that can be restarted.
Comment 3 Jüri Ivask 2019-05-31 08:20:11 CEST
No, I do not see such plasmashell crash messages (running journalctl -f in konsole).

However - after the freeze the log is filled with several qt.qpa.xcb: QXcbConnection: XCB error: messages (see also bug 24865) like this:

mai   31 09:04:49 tpkrom kwin_x11[1519]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 30454, resource id: 90177649, major code: 18 (ChangeProperty), minor code: 0
mai   31 09:05:41 tpkrom plasmashell[1525]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 21356, resource id: 92274724, major code: 141 (Unknown), minor code: 3
etc...

Not sure if these messages are the cause or the result of the freezing...

Nobody else is experiencing such freezes?
Comment 4 Alan Richter 2019-05-31 16:00:07 CEST
The GPU hangs I was having occurred shortly after Mesa 19.1.rc3 was introduced, I haven't had any GPU hangs associated with plasmashell with later builds.  Mesa 19.1-rc4 seems to be working very well.
Comment 5 Jüri Ivask 2020-04-09 17:41:06 CEST
Seems to be not related to KDE software as the freezes happen also in IceWM and also not related to Mageia as similar freezes happened on same hardware also in KDE Neon.
Looking for similar problems is giving impression that the problem is related to recent 5 series kernels, where some i915 module options are causing these freezes. For example:
https://bbs.archlinux.org/viewtopic.php?id=246841&p=2
For me the workaround was with KDE Neon to use 4.15 series kernel or boot with kernel parameter "i915.enable_psr=0" in Mageia Cauldron or in KDE Neon with 5.3 series kernel.
Not sure, what to do with a current bug...
Jüri Ivask 2020-04-09 17:43:59 CEST

Summary: Mageia 7 Plasma Desktop intermittent freezes => Mageia 7 and 8 Plasma Desktop intermittent freezes

Comment 6 Aurelien Oudelet 2021-05-17 02:37:53 CEST
Seems an amdgpu bug with your hardware.

Please install inxi package (urpmi inxi) and do:

$ inxi -SGxx



Assigning to Kernel and Drivers maintainers.

Source RPM: (none) => kernel-5.10.37-1.mga8.src.rpm
Summary: Mageia 7 and 8 Plasma Desktop intermittent freezes => Screen and Graphic intermittent freezes with amdgpu
Version: 7 => 8
Assignee: kde => kernel
CC: (none) => ouaurelien


Note You need to log in before you can comment on or make changes to this bug.