Bug 30405 - System lockups
Summary: System lockups
Status: RESOLVED WORKSFORME
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-10 01:14 CEST by Pierre Fortin
Modified: 2022-05-21 07:06 CEST (History)
1 user (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments
Entire journal for first lockup (274.62 KB, text/plain)
2022-05-10 01:17 CEST, Pierre Fortin
Details
Entire journal of second lockup (514.08 KB, text/plain)
2022-05-10 01:28 CEST, Pierre Fortin
Details
dmesg errors on boot (3.72 KB, text/plain)
2022-05-10 03:19 CEST, Pierre Fortin
Details
latest lockup (5.91 KB, text/plain)
2022-05-10 03:26 CEST, Pierre Fortin
Details
htop showing 7 procs pegged at 100% (164.49 KB, image/jpeg)
2022-05-10 15:38 CEST, Pierre Fortin
Details

Description Pierre Fortin 2022-05-10 01:14:43 CEST
Description of problem:  Reporting this in case more happen...

Rebooted for recent updates.  System stayed up for a few moments; then locked up.  Using old laptop, could only ping this machine; ssh did not respond.  
Hard power down, and reboot.  Stayed up a little longer, and locked up.
Hard power down, and reboot.  Still up as I write this...  


Version-Release number of selected component (if applicable):


How reproducible: Happened twice


Steps to Reproduce:  unknown. No commonality I could discern.
1.
2.
3.
Comment 1 Pierre Fortin 2022-05-10 01:17:09 CEST
Created attachment 13240 [details]
Entire journal for first lockup
Comment 2 Pierre Fortin 2022-05-10 01:28:55 CEST
Created attachment 13241 [details]
Entire journal of second lockup

CORRECTION: system does not respond to pings.  I accidentally typed "ping 192168.1.46" (note missing dot) to which my router responds. 

As I was about to upload this attachment, system locked up again...

This log ends with:
May 09 19:15:49 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 09 19:15:49 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 09 19:17:00 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 09 19:17:00 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 09 19:17:50 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 09 19:17:50 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 09 19:17:50 pf.pfortin.com kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000

Besides the kernel bug, I think the bluetooth timeouts are new...  will try to check if the system will stay up long enough...
Comment 3 Pierre Fortin 2022-05-10 02:00:20 CEST
Had laptop connected via BT; disconnected and I may be avoiding lockup...  the uptime before lockup appears random based on a sample of 3 lockups.
Pierre Fortin 2022-05-10 02:00:55 CEST

Summary: FYI: System lockup => System lockups

Comment 4 Pierre Fortin 2022-05-10 03:19:21 CEST
Created attachment 13242 [details]
dmesg errors on boot

With a total lockup, I can only guess at where to look. so will add oddities as I find them...  Here, various services report "lacks a native systemd unit file"
Comment 5 Pierre Fortin 2022-05-10 03:26:50 CEST
Created attachment 13243 [details]
latest lockup

Did something in systemsettings5 get clobbered?
Comment 6 Morgan Leijström 2022-05-10 08:26:19 CEST
Could you run RAM memory check, using all cores, let it run overnight?

(I had peculiar faults a while ago when a core in my CPU got bad, i saw it when running RAM check.)

There is ram check option in our install medias.

CC: (none) => fri

Comment 7 Pierre Fortin 2022-05-10 15:38:39 CEST
Created attachment 13244 [details]
htop showing 7 procs pegged at 100%

A photo was the only way to get a screenshot when system hung this morning.

Will try to run memory checks this evening...  I have a zoom meeting shortly -- hopefully, I can get through it.

Now running on 
$ uname -a
Linux pf.pfortin.com 5.17.6-server-1.mga9 #1 SMP PREEMPT Mon May 9 18:34:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

and leaving "jounalctl -f" on screen in case something appears there that doesn't make to disk... 

amazingly, system ran all night; until... woke up to bluetooth at 100% and SIX:
  /usr/lib64/sa/sadc -F -L 600 6 /var/log/sa 
(System Activity Data Collector -- did not know of this) running at 100% and appear to have started, one each hour +|- 1 second

According to htop, memory: 57.7G/62.5G (swap 0K/4.00G) and curiously, processors 2, 4, 6, 8, 10, 12, 14 (of 20) pegged at 100% (matches BT + 6x sadc at 100%) -- only even numbered procs... :?

$ ll /var/log/sa
total 3100
-rw-r--r-- 1 root root  39244 May  6 23:51 sa06
-rw-r--r-- 1 root root 460060 May  7 23:51 sa07
-rw-r--r-- 1 root root 460060 May  8 23:51 sa08
-rw-r--r-- 1 root root 399824 May  9 23:51 sa09
-rw-r--r-- 1 root root  45788 May 10 09:01 sa10
-rw-r--r-- 1 root root  70464 May  7 04:02 sar06
-rw-r--r-- 1 root root 829128 May  8 04:02 sar07
-rw-r--r-- 1 root root 829128 May  9 04:02 sar08

The sarNN files are human readable; saNN are binary -- any utility to read those? 

In case it's related: all this started after I reloaded to bring in latest glibc and current running kernel.
Comment 8 Thomas Backlund 2022-05-10 19:52:13 CEST
(In reply to Pierre Fortin from comment #4)
> Created attachment 13242 [details]
> dmesg errors on boot
> 
> With a total lockup, I can only guess at where to look. so will add oddities
> as I find them...  Here, various services report "lacks a native systemd
> unit file"

that's only informal messages for some packages not yet converted from init scripts...
Comment 9 Thomas Backlund 2022-05-10 20:00:05 CEST
(In reply to Pierre Fortin from comment #7)
> Created attachment 13244 [details]
> htop showing 7 procs pegged at 100%
> 


looks like sysstat is overloading your system.

if you remove that package, does your system work ok then ?
Comment 10 Thomas Backlund 2022-05-10 20:03:47 CEST
(In reply to Pierre Fortin from comment #2)

> May 09 19:17:50 pf.pfortin.com kernel: BUG: kernel NULL pointer dereference,
> address: 0000000000000000
> 

This is a kernel crash, but since the rest of the crash info is missing, one can only guess...
Comment 11 Pierre Fortin 2022-05-10 21:50:57 CEST
This is a super-weird situation...  the lockups started out of the blue and hit me every few minutes.  Now, I've been up for 7 hours and all looks good.
The kernel crash was the only I saw; others just locked up tight; screens frozen.  A picture was the only option; reminded me of the camera dumps of neon bulb control panels on NORAD SAGE computers in the 1960s  LOL

Just removed sysstat and killed sdac...
Comment 12 Pierre Fortin 2022-05-11 15:04:08 CEST
Good morning! Feeling confident the lockups are resolved with 5.17.6 kernel,
Thanks All!!
Comment 13 Pierre Fortin 2022-05-12 19:39:42 CEST
Nope...  just had another lockup after being up 50 hours and a couple minutes.  I had walked away to chat with a visitor; came back to the system repeating a 1 second audio clip over and over. This was from a news stream.

Could not ssh into or ping the system.  Sadly, other than seeing the times in the journal to discern uptime, there's nothing abnormal therein.  :(

Operating System: Mageia 9
KDE Plasma Version: 5.24.4
KDE Frameworks Version: 5.93.0
Qt Version: 5.15.2
Kernel Version: 5.17.6-server-2.mga9 (64-bit)  # latest kernel via mcc
Graphics Platform: X11
Processors: 20 × 12th Gen Intel® Core™ i7-12700K
Memory: 62.5 GiB of RAM
Graphics Processor: AMD Radeon RX 6600 XT
Comment 14 Pierre Fortin 2022-05-13 03:31:51 CEST
another kernel NULL pointer dereference...  (5.17.6-server-2.mga9)
This is everything at the end of this journal:
May 12 18:28:42 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 18:28:42 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 18:29:32 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 18:29:32 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 18:30:38 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 18:30:38 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 18:32:04 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 18:32:04 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 18:33:15 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 18:33:15 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 18:33:19 pf.pfortin.com kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000

While I had just started typing here, got another...  (next)
Comment 15 Pierre Fortin 2022-05-13 03:47:33 CEST
Should have included this on comment 14:
May 12 11:01:05 pf.pfortin.com kernel: Linux version 5.17.6-server-2.mga9 (iurt@rabbit.mageia.org) (gcc (Mageia 12.1.1-0.20220507.1.mga9) 12.1.1 20220507, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Tue May 10 16:14:21 UTC 2022
May 12 11:01:05 pf.pfortin.com kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.17.6-server-2.mga9 root=UUID=957cd552-a8ad-4d8a-a90d-1c6eb3871ebd ro splash quiet noiswmd resume=UUID=63b04ef6-fb15-4f03-b7af-6573fb6070ec audit=0
May 12 11:01:05 pf.pfortin.com kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

New lockup...

Starting to suspect some patterns...
* Bluetooth errors just before the lockup.
* kernel NULL pointer dereference -- sometimes in journal
* lockup occurs very near a screen power save
  - just disabled Screen Energy Saving...

May 12 18:44:03 pf.pfortin.com kernel: microcode: microcode updated early to revision 0x1f, date = 2022-03-03
May 12 18:44:03 pf.pfortin.com kernel: Linux version 5.17.7-server-1.mga9 (iurt@ecosse.mageia.org) (gcc (Mageia 12.1.1-0.20220507.1.mga9) 12.1.1 20220507, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Thu May 12 12:54:42 UTC 2022
May 12 18:44:03 pf.pfortin.com kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.17.7-server-1.mga9 root=UUID=957cd552-a8ad-4d8a-a90d-1c6eb3871ebd ro splash quiet noiswmd resume=UUID=63b04ef6-fb15-4f03-b7af-6573fb6070ec audit=0
May 12 18:44:03 pf.pfortin.com kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
[snip]
May 12 21:06:58 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:06:58 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 21:07:58 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:07:58 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 21:09:20 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:09:20 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 21:12:24 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:12:24 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 21:14:10 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:14:10 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
May 12 21:16:54 pf.pfortin.com kernel: Bluetooth: hci0: link tx timeout
May 12 21:16:54 pf.pfortin.com kernel: Bluetooth: hci0: killing stalled connection a0:a8:cd:ad:3b:75
Comment 16 Pierre Fortin 2022-05-13 05:44:58 CEST
Another lockup: May 12 23:26:37
Bluetooth messages starting to bother me.  This machine and the old laptop(mga8) INSIST on staying connected.  Kinda like the two computers in the movie "The Forbin Project" -- one letter off from my name, so it's easy for me to remember that title.

The machines are so aggressive in staying connected that I finally disabled BT in this machine.  I don't see a BT disable in mga8 settings.
Comment 17 Pierre Fortin 2022-05-13 19:02:28 CEST
Had this machine less than 70 days.  I ALWAYS check for new BIOS when working on a new machine (mine or friends')...  BIOS was up to date at 1.0.8...
Just discovered BIOS is now at 1.0.13...!!!  In two months!
Looks like 1.0.12 is the one that may resolve my issues; 1.0.13 it is!
A friend informed me he got an email from Dell about the BIOS. He got his similar machine a week or two before me.
Comment 18 Pierre Fortin 2022-05-19 17:55:10 CEST
Looks like this may have been cleared up with BIOS 1.0.13...
Closing for now.

Resolution: (none) => WORKSFORME
Status: NEW => RESOLVED

Comment 19 Pierre Fortin 2022-05-21 07:06:40 CEST
New BIOS released:
File Name: XPS_8950_1.2.1_x64.exe
File Size: 8.25 MB
Importance: Urgent
Fixes & Enhancements
- Firmware updates to address security vulnerabilities including
(Common Vulnerabilities and Exposures - CVE) such as CVE-2021-3712,
CVE-2019-14584, CVE-2021-28210, and CVE-2021-28211.

Will update this weekend. No lockups lately.

Note You need to log in before you can comment on or make changes to this bug.