| Summary: | Hard kernel lock on Raven Ridge hardware. | ||
|---|---|---|---|
| Product: | Mageia | Reporter: | Alan Richter <arichter> |
| Component: | RPM Packages | Assignee: | Kernel and Drivers maintainers <kernel> |
| Status: | RESOLVED WORKSFORME | QA Contact: | |
| Severity: | critical | ||
| Priority: | Normal | ||
| Version: | 7 | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Source RPM: | kernel-desktop-5.6.6-1.mga7-1-1 | CVE: | |
| Status comment: | |||
|
Description
Alan Richter
2020-04-26 17:59:08 CEST
Please can you clarify exactly what kernel you are referring to: In Source RPM you say '5.6.6-1', which I do not think is out yet. In comment0, you cite several times '5.5.6' [it should be 5.5.6-2]. If the last is what you have, we have since issued '5.5.9-1' and '5.5.15-3'. Can you please update to the latest kernel and report back. CC:
(none) =>
lewyssmith The offending kernel is: kernel-5.6.6-1.mga7.src.rpm which is now in: http://mirrors.kernel.org/mageia/distrib/7.1/SRPMS/core/updates/ I'm not using updates testing or anything else. BTW kernel-desktop-5.5.15-3.mga7-1-1.mga7 works without issue. Thanks for the clarifications. Just got the -5.6.6-1 update! Assigning to kernel team. CC:
lewyssmith =>
(none) This may not be directly connected to kernel 5.6.6, I've been running 5.5.15 and had multiple GPU hangs and had to poke the reset button to get my system back. I was able to operate (well, play Talos Principle) for a few minutes but my system inevitably hung. I tried switching to lib64drm_amdgpu1-2.4.100 and lib64drm2-2.4.100 while running kernel 5.5.15 but the hangs were the same. This issue is definitely tied to the Vega 11 integrated graphics, I swapped in a RX 570 and haven't had any issues whatsoever. No joy with kernel 5.6.8 and mesa 20.0.6-1, hard GPU crashes. Discrete RX 570 works perfectly. I sure hope that Renoir works better than Raven Ridge. I tried again with my 2400g running talos principle and Black Mesa, there was no instability, Perhaps something in the latest round of patches which I applied ~May 8 resolved the problem. I can't see anything that would have affected AMDGPU but it's nice not to have to run a heavy graphics card on light fluffy games like Black Mesa and Talos. I'll give it a couple of days then close this bug. I must have some dodgy hardware because Talos is still causing problems. Not a hard kernel lock but a GPU hang: [ 75.388344] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 75.388350] #PF: supervisor write access in kernel mode [ 75.388352] #PF: error_code(0x0002) - not-present page [ 75.388353] PGD 332aa1067 P4D 332aa1067 PUD 31ee87067 PMD 0 [ 75.388359] Oops: 0002 [#1] SMP NOPTI [ 75.388363] CPU: 0 PID: 12878 Comm: Talos Not tainted 5.6.8-desktop-1.mga7 #1 [ 75.388366] Hardware name: Gigabyte Technology Co., Ltd. AB350M-DS3H/AB350M-DS3H-CF, BIOS F50a 11/27/2019 [ 75.388375] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm] [ 75.388379] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00 [ 75.388381] RSP: 0018:ffffa94c03717cc8 EFLAGS: 00010246 [ 75.388383] RAX: 0000000000000000 RBX: ffff8e33e8448840 RCX: 0000000000000000 [ 75.388385] RDX: ffff8e33e7c90000 RSI: 7fffffffffffffff RDI: ffff8e33e8448840 [ 75.388387] RBP: ffff8e3506284f50 R08: 0000000000132771 R09: ffff8e34ff8805a0 [ 75.388388] R10: ffff8e34ff880340 R11: 0000000000000001 R12: ffff8e341eb44dbc [ 75.388390] R13: ffff8e3484778c50 R14: ffffa94c03717e10 R15: 0000000000000008 [ 75.388392] FS: 00007fbbce1ec940(0000) GS:ffff8e3510800000(0000) knlGS:0000000000000000 [ 75.388394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 75.388396] CR2: 0000000000000018 CR3: 00000003307c2000 CR4: 00000000003406f0 [ 75.388398] Call Trace: [ 75.388407] ttm_tt_destroy.part.11+0x49/0x50 [ttm] [ 75.388415] ttm_bo_cleanup_memtype_use+0x32/0x80 [ttm] [ 75.388422] ttm_bo_put+0x275/0x2d0 [ttm] [ 75.388519] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ 75.388614] amdgpu_gem_object_free+0x30/0x50 [amdgpu] [ 75.388635] drm_gem_object_release_handle+0x6e/0x90 [drm] [ 75.388656] drm_gem_handle_delete+0x55/0x80 [drm] [ 75.388676] ? drm_gem_handle_create+0x40/0x40 [drm] [ 75.388695] drm_ioctl_kernel+0xac/0xf0 [drm] [ 75.388716] drm_ioctl+0x201/0x3a0 [drm] [ 75.388736] ? drm_gem_handle_create+0x40/0x40 [drm] [ 75.388741] ? kmem_cache_free+0x270/0x280 [ 75.388831] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 75.388836] ksys_ioctl+0x86/0xc0 [ 75.388840] __x64_sys_ioctl+0x16/0x20 [ 75.388844] do_syscall_64+0x5f/0x220 [ 75.388849] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 75.388852] RIP: 0033:0x3fb08fa0f7 [ 75.388855] Code: 0f 1f 00 64 48 8b 14 25 00 00 00 00 48 8b 05 90 8d 0c 00 c7 04 02 26 00 00 00 48 c7 c0 ff ff ff ff c3 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 8d 0c 00 f7 d8 64 89 01 48 [ 75.388857] RSP: 002b:00007ffc38b28248 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 75.388859] RAX: ffffffffffffffda RBX: 00007fbbc00a3260 RCX: 0000003fb08fa0f7 [ 75.388861] RDX: 00007ffc38b28288 RSI: 0000000040086409 RDI: 000000000000002c [ 75.388863] RBP: 00007ffc38b28288 R08: 0000000006212238 R09: 000000000000000e [ 75.388864] R10: 000000000000010a R11: 0000000000000246 R12: 0000000040086409 [ 75.388866] R13: 000000000000002c R14: 0000000000000000 R15: 0000000000000000 [ 75.388870] Modules linked in: fuse xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ip6t_REJECT nf_reject_ipv6 xt_comment xt_hashlimit xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 iptable_mangle iptable_nat xt_mark ip6table_mangle ip6table_nat nf_nat nf_tables xt_CT xt_tcpudp ip6table_raw nfnetlink_log xt_NFLOG nf_log_ipv6 nf_log_common xt_LOG iptable_filter nf_conntrack_tftp tun nf_conntrack_sip nf_conntrack_sane bridge nf_conntrack_pptp nf_conntrack_netlink stp llc nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables af_packet cfg80211 rfkill binfmt_misc msr sunrpc nls_iso8859_1 nls_cp437 vfat fat input_leds hid_generic snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec usbhid snd_hda_core hid snd_hwdep snd_pcm snd_timer snd soundcore kvm_amd ccp kvm [ 75.388907] irqbypass sha1_generic crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof sp5100_tco k10temp r8169 i2c_piix4 realtek libphy thermal evdev gpio_amdpt gpio_generic acpi_cpufreq button sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt nf_defrag_ipv6 autofs4 xhci_pci xhci_hcd usbcore sr_mod usb_common amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm wmi drm_kms_helper cec video drm [ 75.388930] CR2: 0000000000000018 [ 75.388933] ---[ end trace d01f2db04a3aed33 ]--- [ 75.388939] RIP: 0010:ttm_tt_unpopulate+0x22/0x60 [ttm] [ 75.388942] Code: 84 00 00 00 00 00 66 90 0f 1f 44 00 00 83 7f 3c 02 74 4a f6 47 19 01 75 2f 48 83 7f 20 00 74 28 48 8b 57 10 31 c0 48 8b 0c c2 <48> c7 41 18 00 00 00 00 48 8b 0c c2 48 83 c0 01 48 c7 41 20 00 00 [ 75.388944] RSP: 0018:ffffa94c03717cc8 EFLAGS: 00010246 [ 75.388946] RAX: 0000000000000000 RBX: ffff8e33e8448840 RCX: 0000000000000000 [ 75.388947] RDX: ffff8e33e7c90000 RSI: 7fffffffffffffff RDI: ffff8e33e8448840 [ 75.388949] RBP: ffff8e3506284f50 R08: 0000000000132771 R09: ffff8e34ff8805a0 [ 75.388950] R10: ffff8e34ff880340 R11: 0000000000000001 R12: ffff8e341eb44dbc [ 75.388952] R13: ffff8e3484778c50 R14: ffffa94c03717e10 R15: 0000000000000008 [ 75.388954] FS: 00007fbbce1ec940(0000) GS:ffff8e3510800000(0000) knlGS:0000000000000000 [ 75.388956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 75.388957] CR2: 0000000000000018 CR3: 00000003307c2000 CR4: 00000000003406f0 Watching my system boot I noticed this: kfd kfd: Failed to resume IOMMU for device . . . So I added IOMMU=pt to the kernel line and rebooted, so far the system seems stable. It appears that the addition of IOMMU=pt on the kernel command line resolves this problem. I have had several days of stable operation and am closing this report. Status:
NEW =>
RESOLVED |