Bug 26544

Summary: Hard kernel lock on Raven Ridge hardware.
Product: Mageia Reporter: Alan Richter <arichter>
Component: RPM PackagesAssignee: Kernel and Drivers maintainers <kernel>
Status: RESOLVED WORKSFORME QA Contact:
Severity: critical    
Priority: Normal    
Version: 7   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: kernel-desktop-5.6.6-1.mga7-1-1 CVE:
Status comment:

Description Alan Richter 2020-04-26 17:59:08 CEST
Description of problem:
Kernel 5.5.6 is causing a hard lock on my Raven Ridge 2400g using the integrated Vega graphics.

Version-Release number of selected component (if applicable):
Kernel 5.5.6, 

How reproducible:
Always.

Steps to Reproduce:
1.  Start Steam
2.  Start Talos Principle (although I suspect any Croteam game would work)
3.  Lock.

BTW, I also see the graphic issues as described by the Nouveau problems after upgrading to 5.5.6.  Furthermore, my system hung completely from using Google-Chrome for approximately 30 minutes, but that's harder to replicate.  

Here's the output from "dmesg -Tw" when attempting to bring up Talos Principle 

[Sun Apr 26 09:36:21 2020] fuse: init (API version 7.31)
[Sun Apr 26 09:42:21 2020] general protection fault, probably for non-canonical address 0xe535b7ba108039e5: 0000 [#1] SMP NOPTI
[Sun Apr 26 09:42:21 2020] CPU: 5 PID: 29205 Comm: Talos Not tainted 5.6.6-desktop-1.mga7 #1
[Sun Apr 26 09:42:21 2020] Hardware name: Gigabyte Technology Co., Ltd. AB350M-DS3H/AB350M-DS3H-CF, BIOS F50a 11/27/2019
[Sun Apr 26 09:42:21 2020] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm]
[Sun Apr 26 09:42:21 2020] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00
[Sun Apr 26 09:42:21 2020] RSP: 0018:ffffbe3f0c497cb8 EFLAGS: 00010246
[Sun Apr 26 09:42:21 2020] RAX: 0000000000000000 RBX: ffff96fc086dea80 RCX: e535b7ba108039e5
[Sun Apr 26 09:42:21 2020] RDX: ffff96fb84c30000 RSI: 7fffffffffffffff RDI: ffff96fc086dea80
[Sun Apr 26 09:42:21 2020] RBP: ffff96fcc5a84f50 R08: 0000000000135446 R09: ffff96fc43f292a0
[Sun Apr 26 09:42:21 2020] R10: ffff96fbe2349c98 R11: ffff96fccbdee638 R12: ffff96fb84dcbdbc
[Sun Apr 26 09:42:21 2020] R13: ffff96fccbdee650 R14: ffffbe3f0c497e10 R15: 0000000000000008
[Sun Apr 26 09:42:21 2020] FS:  00007f967c754700(0000) GS:ffff96fcd0940000(0000) knlGS:0000000000000000
[Sun Apr 26 09:42:21 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Apr 26 09:42:21 2020] CR2: 00007f96740b4318 CR3: 0000000306282000 CR4: 00000000003406e0
[Sun Apr 26 09:42:21 2020] Call Trace:
[Sun Apr 26 09:42:21 2020]  ttm_tt_destroy.part.11+0x49/0x50 [ttm]
[Sun Apr 26 09:42:21 2020]  ttm_bo_cleanup_memtype_use+0x32/0x80 [ttm]
[Sun Apr 26 09:42:21 2020]  ttm_bo_put+0x2b1/0x330 [ttm]
[Sun Apr 26 09:42:21 2020]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[Sun Apr 26 09:42:21 2020]  amdgpu_gem_object_free+0x30/0x50 [amdgpu]
[Sun Apr 26 09:42:21 2020]  drm_gem_object_release_handle+0x6e/0x90 [drm]
[Sun Apr 26 09:42:21 2020]  drm_gem_handle_delete+0x55/0x80 [drm]
[Sun Apr 26 09:42:21 2020]  ? drm_gem_handle_create+0x40/0x40 [drm]
[Sun Apr 26 09:42:21 2020]  drm_ioctl_kernel+0xac/0xf0 [drm]
[Sun Apr 26 09:42:21 2020]  drm_ioctl+0x201/0x3a0 [drm]
[Sun Apr 26 09:42:21 2020]  ? drm_gem_handle_create+0x40/0x40 [drm]
[Sun Apr 26 09:42:21 2020]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[Sun Apr 26 09:42:21 2020]  ksys_ioctl+0x86/0xc0
[Sun Apr 26 09:42:21 2020]  __x64_sys_ioctl+0x16/0x20
[Sun Apr 26 09:42:21 2020]  do_syscall_64+0x5f/0x220
[Sun Apr 26 09:42:21 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sun Apr 26 09:42:21 2020] RIP: 0033:0x3fb08fa0f7
[Sun Apr 26 09:42:21 2020] Code: 0f 1f 00 64 48 8b 14 25 00 00 00 00 48 8b 05 90 8d 0c 00 c7 04 02 26 00 00 00 48 c7 c0 ff ff ff ff c3 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 8d 0c 00 f7 d8 64 89 01 48
[Sun Apr 26 09:42:21 2020] RSP: 002b:00007f967c752de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Sun Apr 26 09:42:21 2020] RAX: ffffffffffffffda RBX: 00007f96740b3c50 RCX: 0000003fb08fa0f7
[Sun Apr 26 09:42:21 2020] RDX: 00007f967c752e28 RSI: 0000000040086409 RDI: 000000000000002b
[Sun Apr 26 09:42:21 2020] RBP: 00007f967c752e28 R08: 0000000005926bc8 R09: 000000000000000e
[Sun Apr 26 09:42:21 2020] R10: 0000000000000123 R11: 0000000000000246 R12: 0000000040086409
[Sun Apr 26 09:42:21 2020] R13: 000000000000002b R14: 00000000058ef270 R15: 00007f95d0098574
[Sun Apr 26 09:42:21 2020] Modules linked in: fuse xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ip6t_REJECT nf_reject_ipv6 xt_comment xt_hashlimit xt_mark xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat xt_CT xt_tcpudp ip6table_raw iptable_mangle iptable_nat nf_nat nfnetlink_log xt_NFLOG nf_log_ipv6 nf_log_common nf_tables xt_LOG nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns iptable_filter nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 tun nf_conntrack_ftp bridge ts_kmp nf_conntrack_amanda nf_conntrack stp llc nf_defrag_ipv4 ip6table_filter ip6_tables af_packet cfg80211 rfkill binfmt_misc msr sunrpc nls_iso8859_1 nls_cp437 vfat fat input_leds hid_generic kvm_amd kvm snd_hda_codec_realtek irqbypass crc32_pclmul snd_hda_codec_generic crc32c_intel ghash_clmulni_intel ledtrig_audio r8169 snd_hda_codec_hdmi usbhid snd_hda_intel snd_intel_dspcfg aesni_intel
[Sun Apr 26 09:42:21 2020]  snd_hda_codec crypto_simd cryptd glue_helper realtek hid snd_hda_core libphy snd_hwdep wmi_bmof snd_pcm k10temp sp5100_tco i2c_piix4 snd_timer ccp snd soundcore sha1_generic thermal evdev gpio_amdpt gpio_generic button acpi_cpufreq sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt nf_defrag_ipv6 autofs4 xhci_pci xhci_hcd usbcore sr_mod usb_common amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm wmi drm_kms_helper cec video drm
[Sun Apr 26 09:42:21 2020] ---[ end trace 1ea9c745790ad6e8 ]---
[Sun Apr 26 09:42:21 2020] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm]
[Sun Apr 26 09:42:21 2020] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00
[Sun Apr 26 09:42:21 2020] RSP: 0018:ffffbe3f0c497cb8 EFLAGS: 00010246
[Sun Apr 26 09:42:21 2020] RAX: 0000000000000000 RBX: ffff96fc086dea80 RCX: e535b7ba108039e5
[Sun Apr 26 09:42:21 2020] RDX: ffff96fb84c30000 RSI: 7fffffffffffffff RDI: ffff96fc086dea80
[Sun Apr 26 09:42:21 2020] RBP: ffff96fcc5a84f50 R08: 0000000000135446 R09: ffff96fc43f292a0
[Sun Apr 26 09:42:21 2020] R10: ffff96fbe2349c98 R11: ffff96fccbdee638 R12: ffff96fb84dcbdbc
[Sun Apr 26 09:42:21 2020] R13: ffff96fccbdee650 R14: ffffbe3f0c497e10 R15: 0000000000000008
[Sun Apr 26 09:42:21 2020] FS:  00007f967c754700(0000) GS:ffff96fcd0940000(0000) knlGS:0000000000000000
[Sun Apr 26 09:42:21 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Apr 26 09:42:21 2020] CR2: 00007f96740b4318 CR3: 0000000306282000 CR4: 00000000003406e0
[Sun Apr 26 09:42:22 2020] stack segment: 0000 [#2] SMP NOPTI
[Sun Apr 26 09:42:22 2020] CPU: 1 PID: 6787 Comm: net_applet Tainted: G      D           5.6.6-desktop-1.mga7 #1
Comment 1 Lewis Smith 2020-04-26 21:26:05 CEST
Please can you clarify exactly what kernel you are referring to:
 In Source RPM you say '5.6.6-1', which I do not think is out yet.
 In comment0, you cite several times '5.5.6' [it should be 5.5.6-2].
If the last is what you have, we have since issued '5.5.9-1' and '5.5.15-3'.
Can you please update to the latest kernel and report back.

CC: (none) => lewyssmith

Comment 2 Alan Richter 2020-04-26 22:52:30 CEST
The offending kernel is:

kernel-5.6.6-1.mga7.src.rpm  

which is now in:
http://mirrors.kernel.org/mageia/distrib/7.1/SRPMS/core/updates/

I'm not using updates testing or anything else.

BTW kernel-desktop-5.5.15-3.mga7-1-1.mga7 works without issue.
Comment 3 Lewis Smith 2020-04-27 19:44:42 CEST
Thanks for the clarifications.
Just got the -5.6.6-1 update!

Assigning to kernel team.

CC: lewyssmith => (none)
Assignee: bugsquad => kernel

Comment 4 Alan Richter 2020-04-28 05:14:59 CEST
This may not be directly connected to kernel 5.6.6, I've been running 5.5.15 and had multiple GPU hangs and had to poke the reset button to get my system back. I was able to operate (well, play Talos Principle) for a few minutes but my system inevitably hung. 

I tried switching to lib64drm_amdgpu1-2.4.100 and lib64drm2-2.4.100 while running kernel 5.5.15 but the hangs were the same.
Comment 5 Alan Richter 2020-04-29 18:37:41 CEST
This issue is definitely tied to the Vega 11 integrated graphics, I swapped in a RX 570 and haven't had any issues whatsoever.
Comment 6 Alan Richter 2020-05-06 05:07:43 CEST
No joy with kernel 5.6.8 and mesa 20.0.6-1, hard GPU crashes.  Discrete RX 570 works perfectly.  

I sure hope that Renoir works better than Raven Ridge.
Comment 7 Alan Richter 2020-05-11 01:20:13 CEST
I tried again with my 2400g running talos principle and Black Mesa, there was no instability, Perhaps something in the latest round of patches which I applied ~May 8 resolved the problem.  

I can't see anything that would have affected AMDGPU but it's nice not to have to run a heavy graphics card on light fluffy games like Black Mesa and Talos.

I'll give it a couple of days then close this bug.
Comment 8 Alan Richter 2020-05-12 05:27:54 CEST
I must have some dodgy hardware because Talos is still causing problems.

Not a hard kernel lock but a GPU hang:

[   75.388344] BUG: kernel NULL pointer dereference, address: 0000000000000018
[   75.388350] #PF: supervisor write access in kernel mode
[   75.388352] #PF: error_code(0x0002) - not-present page
[   75.388353] PGD 332aa1067 P4D 332aa1067 PUD 31ee87067 PMD 0 
[   75.388359] Oops: 0002 [#1] SMP NOPTI
[   75.388363] CPU: 0 PID: 12878 Comm: Talos Not tainted 5.6.8-desktop-1.mga7 #1
[   75.388366] Hardware name: Gigabyte Technology Co., Ltd. AB350M-DS3H/AB350M-DS3H-CF, BIOS F50a 11/27/2019
[   75.388375] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm]
[   75.388379] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00
[   75.388381] RSP: 0018:ffffa94c03717cc8 EFLAGS: 00010246
[   75.388383] RAX: 0000000000000000 RBX: ffff8e33e8448840 RCX: 0000000000000000
[   75.388385] RDX: ffff8e33e7c90000 RSI: 7fffffffffffffff RDI: ffff8e33e8448840
[   75.388387] RBP: ffff8e3506284f50 R08: 0000000000132771 R09: ffff8e34ff8805a0
[   75.388388] R10: ffff8e34ff880340 R11: 0000000000000001 R12: ffff8e341eb44dbc
[   75.388390] R13: ffff8e3484778c50 R14: ffffa94c03717e10 R15: 0000000000000008
[   75.388392] FS:  00007fbbce1ec940(0000) GS:ffff8e3510800000(0000) knlGS:0000000000000000
[   75.388394] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.388396] CR2: 0000000000000018 CR3: 00000003307c2000 CR4: 00000000003406f0
[   75.388398] Call Trace:
[   75.388407]  ttm_tt_destroy.part.11+0x49/0x50 [ttm]
[   75.388415]  ttm_bo_cleanup_memtype_use+0x32/0x80 [ttm]
[   75.388422]  ttm_bo_put+0x275/0x2d0 [ttm]
[   75.388519]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[   75.388614]  amdgpu_gem_object_free+0x30/0x50 [amdgpu]
[   75.388635]  drm_gem_object_release_handle+0x6e/0x90 [drm]
[   75.388656]  drm_gem_handle_delete+0x55/0x80 [drm]
[   75.388676]  ? drm_gem_handle_create+0x40/0x40 [drm]
[   75.388695]  drm_ioctl_kernel+0xac/0xf0 [drm]
[   75.388716]  drm_ioctl+0x201/0x3a0 [drm]
[   75.388736]  ? drm_gem_handle_create+0x40/0x40 [drm]
[   75.388741]  ? kmem_cache_free+0x270/0x280
[   75.388831]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[   75.388836]  ksys_ioctl+0x86/0xc0
[   75.388840]  __x64_sys_ioctl+0x16/0x20
[   75.388844]  do_syscall_64+0x5f/0x220
[   75.388849]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   75.388852] RIP: 0033:0x3fb08fa0f7
[   75.388855] Code: 0f 1f 00 64 48 8b 14 25 00 00 00 00 48 8b 05 90 8d 0c 00 c7 04 02 26 00 00 00 48 c7 c0 ff ff ff ff c3 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 8d 0c 00 f7 d8 64 89 01 48
[   75.388857] RSP: 002b:00007ffc38b28248 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   75.388859] RAX: ffffffffffffffda RBX: 00007fbbc00a3260 RCX: 0000003fb08fa0f7
[   75.388861] RDX: 00007ffc38b28288 RSI: 0000000040086409 RDI: 000000000000002c
[   75.388863] RBP: 00007ffc38b28288 R08: 0000000006212238 R09: 000000000000000e
[   75.388864] R10: 000000000000010a R11: 0000000000000246 R12: 0000000040086409
[   75.388866] R13: 000000000000002c R14: 0000000000000000 R15: 0000000000000000
[   75.388870] Modules linked in: fuse xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ip6t_REJECT nf_reject_ipv6 xt_comment xt_hashlimit xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 iptable_mangle iptable_nat xt_mark ip6table_mangle ip6table_nat nf_nat nf_tables xt_CT xt_tcpudp ip6table_raw nfnetlink_log xt_NFLOG nf_log_ipv6 nf_log_common xt_LOG iptable_filter nf_conntrack_tftp tun nf_conntrack_sip nf_conntrack_sane bridge nf_conntrack_pptp nf_conntrack_netlink stp llc nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables af_packet cfg80211 rfkill binfmt_misc msr sunrpc nls_iso8859_1 nls_cp437 vfat fat input_leds hid_generic snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec usbhid snd_hda_core hid snd_hwdep snd_pcm snd_timer snd soundcore kvm_amd ccp kvm
[   75.388907]  irqbypass sha1_generic crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof sp5100_tco k10temp r8169 i2c_piix4 realtek libphy thermal evdev gpio_amdpt gpio_generic acpi_cpufreq button sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt nf_defrag_ipv6 autofs4 xhci_pci xhci_hcd usbcore sr_mod usb_common amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm wmi drm_kms_helper cec video drm
[   75.388930] CR2: 0000000000000018
[   75.388933] ---[ end trace d01f2db04a3aed33 ]---
[   75.388939] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm]
[   75.388942] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00
[   75.388944] RSP: 0018:ffffa94c03717cc8 EFLAGS: 00010246
[   75.388946] RAX: 0000000000000000 RBX: ffff8e33e8448840 RCX: 0000000000000000
[   75.388947] RDX: ffff8e33e7c90000 RSI: 7fffffffffffffff RDI: ffff8e33e8448840
[   75.388949] RBP: ffff8e3506284f50 R08: 0000000000132771 R09: ffff8e34ff8805a0
[   75.388950] R10: ffff8e34ff880340 R11: 0000000000000001 R12: ffff8e341eb44dbc
[   75.388952] R13: ffff8e3484778c50 R14: ffffa94c03717e10 R15: 0000000000000008
[   75.388954] FS:  00007fbbce1ec940(0000) GS:ffff8e3510800000(0000) knlGS:0000000000000000
[   75.388956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.388957] CR2: 0000000000000018 CR3: 00000003307c2000 CR4: 00000000003406f0
Comment 9 Alan Richter 2020-05-13 18:50:57 CEST
Watching my system boot I noticed this:

kfd kfd: Failed to resume IOMMU for device . . . 

So I added IOMMU=pt to the kernel line and rebooted, so far the system seems stable.
Comment 10 Alan Richter 2020-05-16 04:21:17 CEST
It appears that the addition of IOMMU=pt on the kernel command line resolves this problem.  I have had several days of stable operation and am closing this report.

Status: NEW => RESOLVED
Resolution: (none) => WORKSFORME