Bug 32199

Summary: Nvidia problem
Product: Mageia Reporter: Mészáros Csaba <csablak>
Component: RPM PackagesAssignee: Kernel and Drivers maintainers <kernel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: Normal CC: csablak, davidwhodgins, fri
Version: 9   
Target Milestone: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Source RPM: pass CVE:
Status comment:
Attachments: journanctl -k output
journalctl -b -1 --no-h

Description Mészáros Csaba 2023-08-28 21:13:28 CEST
Description of problem:

The nouveau driver is a bit sticky. The windows don't pop.
Nvidia takes 5 minutes before the machine shuts down.
The journalctl -k writes about an error?

$ hostnamectl 
 Static hostname: csablakPC
       Icon name: computer-desktop
         Chassis: desktop 🖥
      Machine ID: 498c5855421b41c5937cc93973e49231
         Boot ID: 96c65ec23c3242aeb831cb87eaec8807
Operating System: Mageia 9                        
          Kernel: Linux 6.4.9-desktop-4.mga9
    Architecture: x86-64
 Hardware Vendor: ASUSTeK Computer INC.
  Hardware Model: M3N78-EMH HDMI
Firmware Version: 0602
   Firmware Date: Tue 2008-11-04

$ cat /proc/cpuinfo | grep 'model name'
model name      : AMD Phenom(tm) 9350e Quad-Core Processor
model name      : AMD Phenom(tm) 9350e Quad-Core Processor
model name      : AMD Phenom(tm) 9350e Quad-Core Processor
model name      : AMD Phenom(tm) 9350e Quad-Core Processor

This nvidia driver is installed on another partition for Linux Mint, but there is no problem there.

$ rpm -qa | grep nvidia
dkms-nvidia470-470.199.02-1.mga9.nonfree
lib64nvidia-egl-wayland1-1.1.11-1.mga9
nvidia470-utils-470.199.02-1.mga9.nonfree
nvidia470-doc-html-470.199.02-1.mga9.nonfree
x11-driver-video-nvidia470-470.199.02-1.mga9.nonfree

The kernel command parameter:

# journalctl -b 0 | grep "Unknown kernel"
aug 28 16:47:28 csablakPC kernel: Unknown kernel command line parameters "nokmsboot BOOT_IMAGE=/boot/vmlinuz-6.4.9-desktop-4.mga9 pti=auto vga=34C", will be passed to user space.
Unknown - vga?

cat /etc/X11/xorg.conf.d/20-vgaDriver.conf 
# File generated by XFdrake (rev 262502)

# File generated by csablak

# **********************************************************************
# Refer to the xorg.conf man page for details about the format of
# this file.
# **********************************************************************

Section "Device"
    Identifier          "Videocard0"
    VendorName          "NVIDIA Corporation"
    BoardName           "NVIDIA GeForce GT710"
    Driver              "nvidia"
    Option              "ConnectToAcpid" "0"
    Option              "UseEdidDpi" "True"
    Option              "DPMS" "true"
EndSection

Section "Screen"
    Identifier          "Screen0"
    Device              "Videocard0"
    Monitor             "Monitor0"
    Option              "nvidiaXineramaInfoOrder" "DFP-0, DFP-1"
    Option              "TripleBuffer" "true"
    Option              "AddARGBGLXVisuals"
    Option              "metamodes" "1920x1080_75 +0+0; 1280x720 +0+0"
    Option              "SLI" "Off"
    Option              "MultiGPU" "Off"
    Option              "BaseMosaic" "off"
    Subsection "Display"
        Depth       24
    EndSubsection
EndSection

I already use xorg.conf.d for everything.
ll /etc/X11/xorg.conf.d/
összesen 28
-rw-r--r-- 1 root    root     309 aug   19 13:50 00-keyboard.conf
-rw-r--r-- 1 csablak csablak  309 aug    4  2022 10-custom-kbd.conf
lrwxrwxrwx 1 root    root      10 dec    8  2021 20-vgaDriver.conf -> nvidia.ini
-rw-r--r-- 1 root    root    1138 jan    4  2023 50-monitor.conf
drwxr-xr-x 2 root    root    4096 aug   20 13:19 nouveau/
-rw-rw-r-- 1 csablak csablak  694 dec   24  2022 nouveau.ini
drwxr-xr-x 2 root    root    4096 aug   28 21:08 nvidia/
-rw-rw-r-- 1 csablak csablak 1066 aug   28 21:06 nvidia.ini

I see that I can't even insert a file in the initial post.
I just remembered something. If there is a change, I will report it. If not, this is the error.
Comment 1 Mészáros Csaba 2023-08-28 21:15:06 CEST
Created attachment 13954 [details]
journanctl -k output

CC: (none) => csablak

Comment 2 Dave Hodgins 2023-08-29 00:31:28 CEST
The unknown kernel parameters message is normal, when their are parameters
present that are not handled by the kernel itself. They are left for things
like xorg or systemd to handle.

CC: (none) => davidwhodgins

Comment 3 Dave Hodgins 2023-08-29 00:50:58 CEST
According to https://lore.kernel.org/linux-kernel/882633660a17792b574bcedea1431a6c@dogben.com/T/
the error at mm/gup.c:1101 is a false positive triggered by google chrome

After the nvidia module causes the delay on shutdown, use
"journalctl -b -1 --no-h|xz>journal.txt.xz" to get the prior boots journal
and attach journal.txt.xz to this bug report.
Comment 4 Mészáros Csaba 2023-08-29 07:57:11 CEST
Created attachment 13955 [details]
journalctl -b -1 --no-h

OK. The requested file is attached.
Regards: Csaba
Comment 5 Dave Hodgins 2023-08-29 18:50:24 CEST
If the kernel bug wasn't a false positive, the important part would have been
from the line with
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000258
up to the end trace.

In this case I think the important parts are
aug 29 07:36:10 kernel: note: Xorg[1670] exited with irqs disabled
aug 29 07:36:10 kernel: note: Xorg[1670] exited with preempt_count 1
aug 29 07:36:10 kernel: Fixing recursive fault but reboot is needed!
aug 29 07:36:10 kernel: BUG: scheduling while atomic: Xorg/1670/0x00000000

which seems to trigger
aug 29 07:36:24 systemd[1]: lightdm.service: State 'final-sigterm' timed out. Killing.
aug 29 07:36:24 systemd[1]: lightdm.service: Killing process 1670 (Xorg) with signal SIGKILL.
aug 29 07:36:35 systemd[1]: lightdm.service: Processes still around after final SIGKILL. Entering failed mode.
aug 29 07:36:35 systemd[1]: lightdm.service: Failed with result 'timeout'.
aug 29 07:36:35 systemd[1]: lightdm.service: Unit process 1670 (Xorg) remains running after unit stopped.
aug 29 07:36:35 systemd[1]: Stopped lightdm.service.
aug 29 07:36:35 systemd[1]: lightdm.service: Consumed 24.297s CPU time.

It may be that kernel NULL pointer dereference is not a false positive,
or that the false positive is causing the problem for xorg.

Assigning to the kernel and drivers team.

Assignee: bugsquad => kernel

Comment 6 Mészáros Csaba 2023-10-24 19:30:05 CEST
Hi guys!

Can we expect any changes in this area? Because the system is significantly faster with the nvidia driver, but I don't use it because of the long exit time.

Regards: Csaba
Comment 7 Morgan Leijström 2023-11-20 13:13:32 CET
There are new nvidia driver versions in nonfree updates testing.

Also try new kernel 6.5.* in updates testing.

CC: (none) => fri

Comment 8 Mészáros Csaba 2023-11-20 13:55:21 CET
OK. I ask for a little patience, because I got pissed of, and since I had the opportunity, I ended up replacing my entire 15-year-old machine with a more modern one. However, I'm going to assemble this old motherboard and all its stuff in another house, but I'll have to get that first, I'll also need a power supply.

Thanks.
Comment 9 Morgan Leijström 2023-12-15 18:15:15 CET
Now we have even newer kernels 6.5.13 if you want to test.

Bug 32628, Bug 32623

There is also an update to nvidia-current (with no bug assigned) in nonfree updates_testing.
Comment 10 Morgan Leijström 2024-03-03 11:18:06 CET
(In reply to Mészáros Csaba from comment #8)
> I'm going to assemble this old motherboard and all

Any news here?
Comment 11 Mészáros Csaba 2024-03-03 13:13:15 CET
I ended up selling the old machine, but I think it's fine now. My friend also has a video card like this, and he doesn't have a long wait time at the exit.
Comment 12 Morgan Leijström 2024-03-03 13:25:18 CET
Thank you.
Assuming fixed then :)

Resolution: (none) => FIXED
Status: NEW => RESOLVED