Bug 26025 - Desktop crash maybe related to x11 nvidia driver update
Summary: Desktop crash maybe related to x11 nvidia driver update
Status: RESOLVED OLD
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 7
Hardware: x86_64 Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Kernel and Drivers maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-06 09:42 CET by Christian C
Modified: 2021-09-07 14:10 CEST (History)
2 users (show)

See Also:
Source RPM: x11-driver-video-nvidia340-340.108-1.mga7.nonfree.x86_64 ??
CVE:
Status comment:


Attachments
Extract from /var/log/messages at the time of crash (22.75 KB, text/plain)
2020-01-06 09:49 CET, Christian C
Details

Description Christian C 2020-01-06 09:42:52 CET
Description of problem:

My desktop crashed after disconnection from X11 server.
See joined file extract from /var/log/messages.

I had recently updated the new x11 video driver :
Jan  2 09:08:04 localhost [RPM][27231]: install x11-driver-video-nvidia340-340.108-1.mga7.nonfree.x86_64: success

After reboot, I saw the following traces in dmesg :

[  188.228647] NVRM: Your system is not currently configured to drive a VGA console
[  188.228654] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[  188.228658] NVRM: requires the use of a text-mode VGA console. Use of other console
[  188.228661] NVRM: drivers including, but not limited to, vesafb, may result in
[  188.228663] NVRM: corruption and stability problems, and is not supported.
[  188.279320] ------------[ cut here ]------------
[  188.279325] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'nvidia_stack_t' (offset 11864, size 3)!
[  188.279338] WARNING: CPU: 2 PID: 4328 at mm/usercopy.c:80 usercopy_warn+0x7d/0xa0
[  188.279339] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat ip6table_raw nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack nf_defrag_ipv4 iptable_filter af_packet cfg80211 rfkill vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia(PO) joydev raid1 kvm_amd ccp snd_hda_codec_hdmi kvm irqbypass sha1_generic input_leds wmi_bmof r8169 k10temp realtek libphy snd_hda_codec_via
[  188.279366]  snd_hda_codec_generic sp5100_tco ledtrig_audio i2c_piix4 snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core asus_atk0110 snd_hwdep snd_pcm snd_timer ide_pci_generic jmicron snd ide_core soundcore acpi_cpufreq evdev sch_fq_codel ip_tables x_tables ipv6 crc_ccitt nf_defrag_ipv6 autofs4 hid_generic usbhid hid uas usb_storage sr_mod ohci_pci serio_raw xhci_pci xhci_hcd ehci_pci ehci_hcd ohci_hcd usbcore ata_generic pata_acpi usb_common pata_jmicron video mxm_wmi i2c_algo_bit drm_kms_helper ttm wmi button drm dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nouveau]
[  188.279387] CPU: 2 PID: 4328 Comm: Xorg Tainted: P           O      5.4.6-desktop-2.mga7 #1
[  188.279388] Hardware name: System manufacturer System Product Name/M4A87TD EVO, BIOS 2001    03/08/2011
[  188.279390] RIP: 0010:usercopy_warn+0x7d/0xa0
[  188.279392] Code: 0d 95 41 51 4d 89 d8 48 c7 c0 c7 7f 0c 95 49 89 f1 48 89 f9 48 0f 45 c2 48 c7 c7 18 a1 0d 95 4c 89 d2 48 89 c6 e8 ac 8d e1 ff <0f> 0b 48 83 c4 18 c3 48 c7 c6 9f 69 0c 95 49 89 f1 49 89 f3 eb 96
[  188.279393] RSP: 0018:ffffa5db4080bbb8 EFLAGS: 00010286
[  188.279395] RAX: 0000000000000000 RBX: ffff94e9acba5e58 RCX: 0000000000000006
[  188.279395] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff94e9afa974c0
[  188.279396] RBP: 0000000000000003 R08: 0000000000000457 R09: 0000000000000004
[  188.279397] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
[  188.279397] R13: ffff94e9acba5e5b R14: ffff94e9acba5e58 R15: ffff94e9acba5ea0
[  188.279399] FS:  00007f905c5a9940(0000) GS:ffff94e9afa80000(0000) knlGS:0000000000000000
[  188.279400] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  188.279400] CR2: 00007f9057a2cf40 CR3: 00000002194ec000 CR4: 00000000000006e0
[  188.279401] Call Trace:
[  188.279406]  __check_object_size+0x162/0x173
[  188.279548]  os_memcpy_to_user+0x21/0x40 [nvidia]
[  188.279701]  _nv001372rm+0xa5/0x260 [nvidia]
[  188.279862]  ? _nv004782rm+0x4eba/0x5500 [nvidia]
[  188.280005]  ? _nv004329rm+0xec/0xf0 [nvidia]
[  188.280135]  ? _nv004324rm+0xca/0x650 [nvidia]
[  188.280266]  ? _nv015124rm+0x576/0x5c0 [nvidia]
[  188.280403]  ? _nv000694rm+0x2e/0x60 [nvidia]
[  188.280532]  ? _nv000789rm+0x5f5/0x8b0 [nvidia]
[  188.280657]  ? rm_ioctl+0x73/0x100 [nvidia]
[  188.280784]  ? nvidia_ioctl+0x148/0x490 [nvidia]
[  188.280924]  ? nvidia_frontend_ioctl+0x2d/0x50 [nvidia]
[  188.281051]  ? nvidia_frontend_unlocked_ioctl+0x19/0x20 [nvidia]
[  188.281054]  ? do_vfs_ioctl+0xa4/0x630
[  188.281056]  ? ksys_ioctl+0x60/0x90
[  188.281058]  ? ksys_write+0x59/0xd0
[  188.281060]  ? __x64_sys_ioctl+0x16/0x20
[  188.281062]  ? do_syscall_64+0x5f/0x200
[  188.281064]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  188.281066] ---[ end trace 3522e411b9a71731 ]---


Version-Release number of selected component (if applicable):
108-1.mga7.nonfree.x86_64 

How reproducible:
??

Steps to Reproduce:
1. ??
2.
3.
Comment 1 Christian C 2020-01-06 09:49:02 CET
Created attachment 11442 [details]
Extract from /var/log/messages at the time of crash
Comment 2 Thomas Backlund 2020-01-06 13:35:35 CET
Crap.

340.108 was suppoed to have official kernel 5.4 support and was tested by some nvidia340 users without issues.

does it work at all, or does it always crash ?


I guess nVidia devs forgot to test their changes with HARDENED_USERCOPY enabled kernels :(

technically it should still work, as we have enabled HARDENED_USERCOPY_FALLBACK that will spit out the kernel trace as info, but still keep working...


And the nvidia_stack_t symbol is in the binary-only code, so we cant patch it out :/


If you want to go back to the older driver:

dkms-nvidia340-340.107-12.mga7.nonfree.x86_64.rpm
nvidia340-cuda-opencl-340.107-12.mga7.nonfree.x86_64.rpm
nvidia340-devel-340.107-12.mga7.nonfree.x86_64.rpm
nvidia340-doc-html-340.107-12.mga7.nonfree.x86_64.rpm
x11-driver-video-nvidia340-340.107-12.mga7.nonfree.x86_64.rpm


check which rpms are installed with rpm -qa |grep nvidia340

and then downgrade them, for example if you have dkms-nvidia340 and x11-driver-nvidia340 you can do:

urpmi --downgrade dkms-nvidia340-340.107-12.mga7.nonfree x11-driver-video-nvidia340-340.107-12.mga7.nonfree

and then add the following lines to /etc/urpmi/skip.list
/^dkms-nvidia340/
/^x11-driver-video-nvidia340/


and so on...

CC: (none) => tmb
Assignee: bugsquad => kernel

Comment 3 Martin Whitaker 2020-01-06 20:43:54 CET
FWIW, that log message has been present for a long time - see bug 24663 - without any apparent ill effects. So it may be a red herring.

CC: (none) => mageia

Comment 4 Thomas Backlund 2020-01-06 21:40:13 CET
Yeah, I know, thats the stack trace printed out by HARDENED_USERCOPY_FALLBACK to notify users about it but "keep working" as I wrote in comment 2,

but then in the log in comment 1 I see:

Jan  5 04:56:45 localhost kernel: [130066.500696] NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ChID 0029, Class 00008597, Offset 00001b0c, Data 0000f000
Jan  5 04:56:45 localhost kernel: [130066.556059] NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ChID 0029, Class 00008597, Offset 00001b0c, Data 0000f000
Jan  5 04:57:18 localhost kernel: [130099.586739] NVRM: Xid (PCI:0000:05:00): 6, PE0001
Jan  5 04:57:18 localhost okular[12927]: The X11 connection broke (error 1). Did the X11 server die?


which is why I asked: 

does it work at all, or does it always crash ?

and the downgrade info is to know if the problem goes away
Comment 5 Christian C 2020-01-06 22:15:38 CET
I rebooted at Jan  5 17:59:19 and for the moment, my x11 server is still alive.

But half an hour later, I still got these messages I didn’t see this morning  :

Jan  5 18:33:01 localhost kglobalaccel5[4882]: The X11 connection broke (error 1). Did the X11 server die?
Jan  5 18:33:01 localhost kscreen_backend_launcher[4892]: The X11 connection broke (error 1). Did the X11 server die?
Jan  5 18:33:01 localhost kuiserver5[6765]: The X11 connection broke (error 1). Did the X11 server die?
Jan  5 18:33:01 localhost org.a11y.Bus[5005]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Jan  5 18:33:01 localhost org.a11y.Bus[5005]:       after 1633 requests (1633 known processed) with 0 events remaining.
Jan  5 18:33:01 localhost kactivitymanagerd[4968]: The X11 connection broke (error 1). Did the X11 server die?

I'll try the previous nvidia package the next time it crashes (or I have to reboot). I have some heavy tasks to finish.
Or I'll look into my old /var/log/messages to find some X11 errors.
Comment 6 Christian C 2020-01-07 09:36:26 CET
It finally crashed at Jan  7 02:32:50 with the same errors :

Jan  7 02:32:32 localhost kernel: [117101.322769] NVRM: GPU at PCI:0000:05:00: GPU-2d5ce2d6-32ab-88b3-e5cd-97d122043eb4
Jan  7 02:32:33 localhost kernel: [117101.322774] NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ChID 0028, Class 00008597, Offset 00001b0c, Data 0000f000
Jan  7 02:32:33 localhost kernel: [117101.415830] NVRM: Xid (PCI:0000:05:00): 13, Graphics Exception: ChID 0028, Class 00008597, Offset 00001b0c, Data 0000f000
Jan  7 02:32:50 localhost ksmserver[30719]: The X11 connection broke (error 1). Did the X11 server die?

I downgraded with :

urpmi --downgrade x11-driver-video-nvidia340-340.108-1.mga7.nonfree dkms-nvidia340-340.108-1.mga7.nonfree nvidia340-doc-html-340.108-1.mga7.nonfree

and what is surprising is that it seems to have installed the same versions :

rpm -qa |grep nvidia340
dkms-nvidia340-340.108-1.mga7.nonfree
x11-driver-video-nvidia340-340.108-1.mga7.nonfree
nvidia340-doc-html-340.108-1.mga7.nonfree

I got the same oops after reboot.

Well, I'll see the difference in the coming hours.
Comment 7 Christian C 2020-01-07 10:09:53 CET
I checked the files installed by urpmi --downgrade and they are the same as before.
And in the repository, there is no 340-340.107 but for devel rpm : nvidia340-devel-340.107-9.mga7.nonfree.x86_64

What to do ?
Comment 8 Thomas Backlund 2020-01-07 10:13:03 CET
You need to specify version to downgrade to, as I wrote in comment 2:

urpmi --downgrade dkms-nvidia340-340.107-12.mga7.nonfree x11-driver-video-nvidia340-340.107-12.mga7.nonfree
Comment 9 Christian C 2020-01-07 10:18:28 CET
Sorry, I hadn't seen the version in the command line.

But as I say in my last comment, there is no dkms-nvidia340-340.107-* in my repositories !
Comment 10 Christian C 2020-01-07 10:28:55 CET
Well, you were right.
I ran :
urpmi --downgrade x11-driver-video-nvidia340-340.107-12.mga7.nonfree dkms-nvidia340-340.107-12.mga7.nonfree nvidia340-doc-html-340.107-12.mga7.nonfree
 and it completed !
Comment 11 Christian C 2020-01-09 11:26:59 CET
started at Jan  7 10:53:25

crashed at :
Jan  9 00:19:10 rottennvidiadriver kscreenlocker_greet[7591]: The X11 connection broke: I/O error (code 1)
Jan  9 00:19:10 rottennvidiadriver ksmserver[31822]: The X11 connection broke (error 1). Did the X11 server die?

my conf :
rpm -qa|grep nvidia
dkms-nvidia340-340.107-12.mga7.nonfree
nvidia340-doc-html-340.107-12.mga7.nonfree
x11-driver-video-nvidia340-340.107-12.mga7.nonfree

any advice ?
Comment 12 Aurelien Oudelet 2021-07-06 13:14:55 CEST
Mageia 7 is EOL since July 1st 2021.
There will not have any further bugfix for this release.

You are encouraged to upgrade to Mageia 8 as soon as possible.

@reporter, if this bug still apply with Mageia 8, please let us know it.

@packager, if you work on the Mageia 7 version of your package, please check the Mageia 8 package if issue is also present. In this case, please fix the Mageia 8 version instead.

This bug report will be closed OLD if there is no further notice within 1st September 2021.
Comment 13 Marja Van Waes 2021-09-07 14:10:18 CEST
Hi bug reporter and hi assignee and others involved,

Please reopen this bug report if it is still valid for Mageia 8 or 9(cauldron), and change "Version:" in the upper left of this report accordingly.

This report is being closed as OLD because it was filed against Mageia 7, for which  support ended on June 30th 2021.

Thanks,
Marja

Status: NEW => RESOLVED
Resolution: (none) => OLD


Note You need to log in before you can comment on or make changes to this bug.