Bug 23043

Summary: Ryzen 7 processor / Kernel crashes system at random
Product: Mageia Reporter: DariuszSki <linuxstuff>
Component: RPM PackagesAssignee: Kernel and Drivers maintainers <kernel>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: Normal CC: fri, linuxstuff, marja11, rolfpedersen
Version: 6   
Target Milestone: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Source RPM: kernel-desktop-4.14.70-2.mga6 CVE:
Status comment:
Attachments: Crash text from syslog
Edited syslog of latest crash
Another crash that left some information behind

Description DariuszSki 2018-05-17 09:30:25 CEST
Mageia 6, x86_64.

Description of problem:
New Ryzen 7 processor / motherboard system. Since getting it, the system crashes at random times, but 99% of the time during the night (when it gets very light use). For a while because of the crash, syslog has produced nothing because all logging stops when it locks up, whatever was displayed on the screen stays, no other processes happen once the system locks up. System is very stable during the daytime when I'm using it.


Version-Release number of selected component (if applicable):
kernel-desktop-4.14.30-3.mga6

How reproducible:
Just use the system as normal.

Processor; Ryzen7 1800x
Motherboard: CROSSHAIR VI HERO


Syslog crash information:
https://pastebin.com/A00Hxm2F
Comment 1 Marja Van Waes 2018-05-18 08:17:37 CEST
Hi DariuszSki

Please attach your logs instead of giving a link to them :-)

Thanks!

Marja

Assignee: bugsquad => kernel
CC: (none) => marja11

Comment 2 DariuszSki 2018-05-18 09:39:31 CEST
Created attachment 10165 [details]
Crash text from syslog

The extract from syslog which indicates what happened to cause a system crash. 99% of the time the computer crashes with nothing in syslog to indicate what happened.
Comment 3 DariuszSki 2018-05-18 12:04:52 CEST
Just to add, the system is NOT overclocked, and am using the latest BIOS for the motherboard (as the log text shows).
Comment 4 Morgan Leijström 2018-05-18 13:17:02 CEST
(In reply to DariuszSki from comment #0)
> whatever was displayed on the screen stays

So when you leave the system for the night, you could open a terminal window and in that become root, and issue journalctl -f, so it write the log in the window until it hangs.


> Version-Release number of selected component (if applicable):
> kernel-desktop-4.14.30-3.mga6

4.14.40 is in updates_testing repo, you could try if that works better.

CC: (none) => fri

Comment 5 DariuszSki 2018-05-18 20:54:05 CEST
I have installed the latest Kernel you suggest, which has just showed up, I will let the system run as normal, and see if the new Kernel does anything different.
Comment 6 DariuszSki 2018-05-25 19:29:22 CEST
After a number of reboots and trying to force things over a number of days, the newest kernel seems to have fixed the problem with the processor stall. If it happens again I'll re-open the bug.

Resolution: (none) => FIXED
Status: NEW => RESOLVED

Comment 7 DariuszSki 2018-06-05 16:25:02 CEST
Report was originally for: kernel-desktop-4.14.30-3.mga6

Latest version kernel still affected: kernel-desktop-4.14.44-2.mga6

I am re-opening this bug as the system lockup is still happening. After the original bug report, there was a kernel update, that one did have one lockup / crash, but I didn't report it because the next day there was a new kernel (kernel-desktop-4.14.44-2.mga6) .. the latest kernel.

After leaving the machine on 24/7, it managed to get to 3 days 15 hours before it fell over, and as always during the night when the pc is doing very little actual work.

When running the command you suggested "journalctl -f" in konsole, it seems to be just enough work for the processor to do, to not fall over during the night. But when I don't run it during the night (as last night), the processor locked up the machine, there is nothing in /var/log/syslog to show what caused the lockup. I am assuming it was a processor stall, as I had manged to write in Comment #2.

It's impractical to keep using "journalctl -f" to keep a machine running during the night. Any other suggestions to see what's going on?

Source RPM: kernel-desktop-4.14.30-3.mga6 => kernel-desktop-4.14.44-2.mga6
Status: RESOLVED => REOPENED
Resolution: FIXED => (none)

Comment 8 Morgan Leijström 2018-06-05 18:37:33 CEST
Possibly all kinds of file writing to disk or network do not get flushed when it hang.

I do not know how to do it (never tried), but use another PC to log in and run "journalctl -f", so you get the output on another machine directly? (hoping that do not prevent it from crashing, so we can log the problem)

Not a solution, but a workaround, may be to give it some work. Sleeping silicon is a waste ;)   I let my workstation run BOINC practically always.
Comment 9 Morgan Leijström 2018-06-05 19:57:27 CEST
I see kernel-linus-4.14.48 got built a couple hours ago, mga6 updates_testing...
Comment 10 DariuszSki 2018-09-30 12:16:17 CEST
This bug was reported as for kernel kernel-desktop-4.14.30-3.mga6, but it has affected ALL kernel updates since, multiple random crashes, and am currently on the latest kernel kernel-desktop-4.14.70-2.mga6.

Source RPM: kernel-desktop-4.14.44-2.mga6 => kernel-desktop-4.14.70-2.mga6

Comment 11 DariuszSki 2018-09-30 12:34:03 CEST
Created attachment 10390 [details]
Edited syslog of latest crash

Most crashes left no information in syslog, or nothing useful. However, this log taken from syslog this morning after another crash seems the most detailed. It logged everything from the crash until I got up and rebooted the machine some hours later. It was repeating the same information over and over during this period into syslog. I've attached a shortened version.

Motherboard is using the latest BIOS, although there's been no updates to it in about three months.

CC: (none) => linuxstuff

Comment 12 Rolf Pedersen 2018-09-30 14:27:00 CEST
I have a newer (Intel) machine that would crash overnight.  I removed a (suspect) nvme card from the hot m.2 slot on the back of the motherboard and have not had  that crash since.  The partition on that drive was not even mounted.  Long shot. ;)

CC: (none) => rolfpedersen

Comment 13 DariuszSki 2018-09-30 17:02:47 CEST
(In reply to Rolf Pedersen from comment #12)
> I have a newer (Intel) machine that would crash overnight.  I removed a
> (suspect) nvme card from the hot m.2 slot on the back of the motherboard and
> have not had  that crash since.  The partition on that drive was not even
> mounted.  Long shot. ;)

I haven't seen anything to indicate temperatures go silly. During the day when the system is in use, graphics card is ok, and the CPU is water cooled and I don't feel any real heat from the system. Hard drives are in fan airflow, so they should never get hot. Memory passes error checks I've performed. SMART isn't showing any problems with the hard drives.
Comment 14 DariuszSki 2018-10-16 09:30:57 CEST
Created attachment 10404 [details]
Another crash that left some information behind

Syslog entries for another crash that left some information behind on what happened, syslog usually leaves nothing of what happened.
Comment 15 DariuszSki 2019-03-31 16:15:21 CEST
After many months, a new BIOS was released, and the last two kernels (latest one 4.14.104-desktop-2.mga6) have had the computer appear very stable, it has not decided to lock up at all. I will keep an eye on the system, will close this bug as "Fixed", but will re-open of I need to. Thanks.

Resolution: (none) => FIXED
Status: REOPENED => RESOLVED