Bug 22663 - AER related hang/oops with AHCI
Summary: AER related hang/oops with AHCI
Status: RESOLVED OLD
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 6
Hardware: All Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-27 11:50 CET by Herbert Poetzl
Modified: 2020-08-16 16:16 CEST (History)
3 users (show)

See Also:
Source RPM: kernel-4.14.20-1.mga6.src.rpm
CVE:
Status comment:


Attachments

Description Herbert Poetzl 2018-02-27 11:50:21 CET
Description of problem:
When an AER happens (which I can trigger here relatively reliably) then the kernel becomes unresponsive because it isn't handled. The only option is a cold reboot.

Version-Release number of selected component (if applicable):
kernel-server-4.14.20-1.mga6-1-1.mga6

How reproducible:
fairly reliable

Steps to Reproduce:
1. Addonics AD4ES6GPX4 HBA with Marvell 9705PM and 5 SSDs
2. for n in a b c d e; do time dd if=/dev/sd$n bs=128k count=200000 of=/dev/null & done



[  161.530384] pcieport 0000:00:1d.0: AER: Uncorrected (Fatal) error received: id=00e8
[  161.538263] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=00e8(Receiver ID)
[  161.549885] pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00020000/00010000
[  161.558412] pcieport 0000:00:1d.0:    [17] Receiver Overflow      (First)
[  161.565324] pcieport 0000:00:1d.0: broadcast error_detected message
[  161.565326] ahci 0000:44:00.0: device has no AER-aware driver
[  161.565327] ahci 0000:45:00.0: device has no AER-aware driver
[  162.630587] pcieport 0000:00:1d.0: Root Port link has been reset
[  162.630604] pcieport 0000:00:1d.0: AER: Device recovery failed
[  192.450092] ------------[ cut here ]------------
[  192.454813] WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:2725 rcu_process_callbacks+0x4d6/0x4f0
[  192.463545] Modules linked in: xt_recent ip6table_nat nf_nat_ipv6 nf_nat xt_comment ip6t_REJECT nf_reject_ipv6 xt_addrtype bridge stp llc xt_mark ip6table_mangle nf_conntrack_snmp xt_tcpudp xt_CT ip6table_raw xt_multiport nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack xt_NFLOG nfnetlink_log xt_LOG nf_log_ipv6 nf_log_common nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack ip6table_filter ip6_tables x_tables af_packet msr sunrpc ses enclosure snd_hda_codec_hdmi snd_hda_codec_ca0132 i915 drm_kms_helper drm i2c_algo_bit intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel
[  192.535227]  kvm_intel snd_hda_codec iTCO_wdt iTCO_vendor_support kvm snd_hda_core nls_utf8 irqbypass intel_cstate nls_cp437 snd_hwdep mxm_wmi intel_uncore snd_pcm e1000e vfat intel_rapl_perf fat ixgbe snd_timer snd i2c_i801 soundcore alx ptp mpt3sas joydev pps_core mdio evdev dca input_leds raid_class scsi_transport_sas fan thermal wmi video acpi_pad button mei_me mei shpchp sch_fq_codel gpio_it87 it87 hwmon_vid efivarfs ipv6 crc_ccitt autofs4 algif_skcipher af_alg dm_crypt uas usb_storage hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc xhci_pci xhci_hcd aesni_intel aes_x86_64 crypto_simd cryptd glue_helper usbcore usb_common dm_mirror dm_region_hash dm_log dm_mod ide_pci_generic ide_core
[  192.601260] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.20-server-1.mga6 #1
[  192.608653] Hardware name: Gigabyte Technology Co., Ltd. Z170X-Gaming 7/Z170X-Gaming 7, BIOS F22j 01/11/2018
[  192.618654] task: ffffffff992124c0 task.stack: ffffffff99200000
[  192.624695] RIP: 0010:rcu_process_callbacks+0x4d6/0x4f0
[  192.630024] RSP: 0018:ffff8a908ec03f10 EFLAGS: 00010002
[  192.635327] RAX: 0000000000000000 RBX: ffff8a908ec23180 RCX: ffff8a907c73e118
[  192.642565] RDX: ffffffffffffd801 RSI: ffff8a908ec03f20 RDI: ffff8a908ec231b8
[  192.649854] RBP: ffffffff99250380 R08: 0000000000000246 R09: 0000000000000001
[  192.657143] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a908ec231b8
[  192.664414] R13: ffffffff992124c0 R14: fffffffffffffffb R15: 7fffffffffffffff
[  192.671692] FS:  0000000000000000(0000) GS:ffff8a908ec00000(0000) knlGS:0000000000000000
[  192.679927] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  192.685822] CR2: 00007f108506e140 CR3: 000000042a20a005 CR4: 00000000003606f0
[  192.693093] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  192.700399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  192.707694] Call Trace:
[  192.710199]  <IRQ>
[  192.712255]  __do_softirq+0xf5/0x295
[  192.715885]  irq_exit+0xae/0xb0
[  192.719075]  smp_apic_timer_interrupt+0x70/0x130
[  192.723788]  apic_timer_interrupt+0x7d/0x90
[  192.728070]  </IRQ>
[  192.730229] RIP: 0010:cpuidle_enter_state+0xa4/0x300
[  192.735298] RSP: 0018:ffffffff99203e88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  192.743004] RAX: ffff8a908ec22480 RBX: 0000000000000006 RCX: 000000000000001f
[  192.750240] RDX: 0000000000000000 RSI: 00000000258f0602 RDI: 0000000000000000
[  192.757528] RBP: ffffffff992b8960 R08: ffff8a908ec214c4 R09: 0000000000000018
[  192.764781] R10: 000000000000248b R11: 0000000000004d0c R12: ffff8a908ec2b400
[  192.772062] R13: ffffffff992b8bb8 R14: 0000002cce52e067 R15: 0000002cceeb23d9
[  192.779334]  ? cpuidle_enter_state+0x92/0x300
[  192.783798]  do_idle+0x185/0x1e0
[  192.787091]  cpu_startup_entry+0x6f/0x80
[  192.791078]  start_kernel+0x4cb/0x4eb
[  192.794812]  secondary_startup_64+0xa5/0xb0
[  192.799111] Code: 17 01 0f 8f 80 fd ff ff 48 8b 15 f6 25 17 01 48 89 93 b0 00 00 00 e9 6d fd ff ff 4c 89 f6 4c 89 e7 e8 5f 73 72 00 e9 eb fb ff ff <0f> 0b e9 9e fd ff ff 0f 0b e9 9d fc ff ff e8 e7 4f f9 ff 0f 1f 
[  192.818368] ---[ end trace 7872bf286971a2ac ]---
Marja Van Waes 2018-02-27 11:57:11 CET

Assignee: bugsquad => kernel
CC: (none) => marja11

Comment 1 Herbert Poetzl 2018-02-27 23:28:20 CET
Got a very similar trace with an ASMedia controller without the AER ...

[10105.578852] ------------[ cut here ]------------
[10105.583583] WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:2725 rcu_process_call
backs+0x4d6/0x4f0
[10105.592403] Modules linked in: xt_recent ip6table_nat nf_nat_ipv6 nf_nat xt_c
omment ip6t_REJECT nf_reject_ipv6 xt_addrtype bridge stp llc xt_mark ip6table_ma
ngle nf_conntrack_snmp xt_tcpudp xt_CT ip6table_raw xt_multiport nf_conntrack_ip
v6 nf_defrag_ipv6 xt_conntrack xt_NFLOG nfnetlink_log xt_LOG nf_log_ipv6 nf_log_
common nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane nf_conntrack_pptp nf
_conntrack_proto_gre nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_c
onntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf
_conntrack_amanda nf_conntrack ip6table_filter ip6_tables x_tables af_packet msr
 sunrpc ses enclosure snd_hda_codec_hdmi snd_hda_codec_ca0132 i915 intel_rapl nl
s_utf8 x86_pkg_temp_thermal intel_powerclamp drm_kms_helper nls_cp437 coretemp v
fat drm i2c_algo_bit
[10105.664977]  fat kvm_intel ixgbe kvm snd_hda_intel e1000e snd_hda_codec snd_h
da_core snd_hwdep irqbypass snd_pcm intel_cstate snd_timer intel_uncore snd iTCO
_wdt iTCO_vendor_support ptp soundcore mxm_wmi intel_rapl_perf i2c_i801 pps_core
 wmi alx evdev input_leds joydev mpt3sas mdio raid_class scsi_transport_sas dca 
mei_me fan thermal video acpi_pad mei button shpchp sch_fq_codel gpio_it87 it87 
hwmon_vid efivarfs ipv6 crc_ccitt autofs4 algif_skcipher af_alg dm_crypt uas usb
_storage hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash
_clmulni_intel pcbc xhci_pci aesni_intel xhci_hcd aes_x86_64 crypto_simd cryptd 
glue_helper usbcore usb_common dm_mirror dm_region_hash dm_log dm_mod ide_pci_ge
neric ide_core
[10105.730080] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.20-server-1.mga6 #
1
[10105.737413] Hardware name: Gigabyte Technology Co., Ltd. Z170X-Gaming 7/Z170X
-Gaming 7, BIOS F22j 01/11/2018
[10105.747414] task: ffffffff9e2124c0 task.stack: ffffffff9e200000
[10105.753457] RIP: 0010:rcu_process_callbacks+0x4d6/0x4f0
[10105.758777] RSP: 0018:ffff888e0ec03f10 EFLAGS: 00010002
[10105.764106] RAX: 0000000000000000 RBX: ffff888e0ec23180 RCX: 0000000180040002
[10105.771393] RDX: ffffffffffffd801 RSI: ffff888e0ec03f20 RDI: ffff888e0ec231b8
[10105.778639] RBP: ffffffff9e250380 R08: 00000000fb733b01 R09: 0000000180040002
[10105.785895] R10: ffff888e0ec03e30 R11: 0000000000000000 R12: ffff888e0ec231b8
[10105.793148] R13: ffffffff9e2124c0 R14: fffffffffffffffc R15: 7fffffffffffffff
[10105.800428] FS:  0000000000000000(0000) GS:ffff888e0ec00000(0000) knlGS:00000
00000000000
[10105.808643] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10105.814477] CR2: 00007f41a1383000 CR3: 000000036820a003 CR4: 00000000003606f0
[10105.821782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10105.829010] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10105.836246] Call Trace:
[10105.838745]  <IRQ>
[10105.840825]  __do_softirq+0xf5/0x295
[10105.844447]  irq_exit+0xae/0xb0
[10105.847670]  smp_apic_timer_interrupt+0x70/0x130
[10105.852377]  apic_timer_interrupt+0x7d/0x90
[10105.856632]  </IRQ>
[10105.858772] RIP: 0010:cpuidle_enter_state+0xa4/0x300
[10105.863833] RSP: 0018:ffffffff9e203e88 EFLAGS: 00000246 ORIG_RAX: fffffffffff
fff10
[10105.871511] RAX: ffff888e0ec22480 RBX: 0000000000000006 RCX: 000000000000001f
[10105.878791] RDX: 0000000000000000 RSI: 00000000258f0602 RDI: 0000000000000000
[10105.886089] RBP: ffffffff9e2b8960 R08: ffff888e0ec214c4 R09: 0000000000000018
[10105.893360] R10: 000000000000194b R11: 0000000000002547 R12: ffff888e0ec2b400
[10105.900622] R13: ffffffff9e2b8bb8 R14: 00000930e2d78395 R15: 00000930e36fd284
[10105.907861]  ? cpuidle_enter_state+0x92/0x300
[10105.912299]  do_idle+0x185/0x1e0
[10105.915599]  cpu_startup_entry+0x6f/0x80
[10105.919578]  start_kernel+0x4cb/0x4eb
[10105.923321]  secondary_startup_64+0xa5/0xb0
[10105.927576] Code: 17 01 0f 8f 80 fd ff ff 48 8b 15 f6 25 17 01 48 89 93 b0 00
 00 00 e9 6d fd ff ff 4c 89 f6 4c 89 e7 e8 5f 73 72 00 e9 eb fb ff ff <0f> 0b e9
 9e fd ff ff 0f 0b e9 9d fc ff ff e8 e7 4f f9 ff 0f 1f 
[10105.946764] ---[ end trace 9332ec21af281a5e ]---

CC: (none) => herbert

Comment 2 Aurelien Oudelet 2020-08-16 16:16:36 CEST
Mageia 6 changed to end-of-life (EOL) status on 2019-09-30. It is no longer 
maintained, which means that it will not receive any further security or bug 
fix updates.

Package Maintainer: If you wish for this bug to remain open because you plan 
to fix it in a currently maintained version, simply change the 'version' to 
a later Mageia version.

Bug Reporter: Thank you for reporting this issue and we are sorry that we 
weren't able to fix it before Mageia 6's end of life. If you are able to 
reproduce it against a later version of Mageia, you are encouraged to click 
on "Version" and change it against that version of Mageia.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a more recent
Mageia release includes newer upstream software that fixes bugs or makes them
obsolete.

If you would like to help fixing bugs in the future, don't hesitate to join the
packager team via our mentoring program [1] or join the teams that fit you 
most [2].

[1] https://wiki.mageia.org/en/Becoming_a_Mageia_Packager
[2] http://www.mageia.org/contribute/

Best regards,
Aurélien
Bugsquad Team

CC: (none) => ouaurelien
Status: NEW => RESOLVED
Resolution: (none) => OLD


Note You need to log in before you can comment on or make changes to this bug.