Bug 31982 - In some cases the upgrade process started by mgaapplet stalls, due to dkms hanging when building modules for a Mageia 8 kernel
Summary: In some cases the upgrade process started by mgaapplet stalls, due to dkms ha...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: Installer (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
Priority: release_blocker critical
Target Milestone: ---
Assignee: Mageia tools maintainers
QA Contact:
URL:
Whiteboard:
Keywords: IN_ERRATA9, IN_RELEASENOTES9
Depends on: 32093 32094
Blocks:
  Show dependency treegraph
 
Reported: 2023-05-31 22:47 CEST by Len Lawrence
Modified: 2023-07-19 21:59 CEST (History)
6 users (show)

See Also:
Source RPM: make-4.4.1-1.mga9
CVE:
Status comment:


Attachments
Results of log search for mgaapplet upgrade (1.77 KB, application/octet-stream)
2023-06-02 08:19 CEST, Len Lawrence
Details
gurpmi_upgrade_to_9_M708E2ns.log (69.80 KB, application/octet-stream)
2023-06-02 10:07 CEST, Len Lawrence
Details

Description Len Lawrence 2023-05-31 22:47:13 CEST
Description of problem:
This applies, so far, to a Mate session on Intel CPU and nvidia graphics hardware with a high bandwidth wired connection to the internet.  Plasma, Xfce and Cinnamon are also installed.
An upgrade from Mageia 8 to 9 via the mgaapplet looks fine at first but stops without any apparent reason after dealing with about two thirds of the available packages.  In this particular case the process stopped at the installation of dkms-nvidia-current.  There have been other similar reports of this happening where dkms has been involved.  Some of the logs from the most recent session are available but further investigation is needed, perhaps by way of htop monitoring in a background terminal session.  gkrellm shows that the computer is continuing to work - half the cores are running at 10% or more.


Version-Release number of selected component (if applicable): Mageia 9 x86_64


How reproducible: Twice, once with nouveau driver, followed by another with the proprietary nvidia driver.


Steps to Reproduce:
1. Start mgaapplet afresh in a normal Mageia 8 desktop session
   $ killall mgaapplet
   $ mgaapplet --session 
2. Using the applet choose to upgrade to the latest Mageia 9 distribution
3. Watch the packages being downloaded and installed
4. until nothing more seems to be happening.
5. A monitor such as gkrellm will show what is happening with the CPUs and disk   drives.
   Some idea of how far the installation has proceeded can be gained by
   $ rpm -qa|grep mga8|wc -l
   $ rpm -qa|grep mga9|wc -l
   Those numbers should not change after the stall point.
Comment 1 Len Lawrence 2023-05-31 22:58:42 CEST
Darn.  Half asleep - should have been filed against mgaapplet not Mageia9.
Comment 2 Len Lawrence 2023-05-31 23:22:24 CEST
The description should be amended to reference mgaonline-3.31-3.mga9.src.rpm.
Sorry about that.
Comment 3 sturmvogel 2023-06-01 07:31:05 CEST
How long did you wait? Because this sounds like you reached the point where kernel modules and initrd get build. And depending on your hardware this can take some time...
Whilst kernel module and initrd build, there is no graphical process. You can see the progress only when you use the CLI way for upgrading...
Comment 4 Morgan Leijström 2023-06-01 08:49:08 CEST
Per comment 0 this also happened with nouveau, no dkms for that should build?

--

Another tester reported stall when it seemed to do dkms for wifi
https://ml.mageia.org/l/arc/qa-discuss/2023-05/msg00347.html

--

For next testing, tip from Dave H in same thread:

When testing upgrades on real hardware, before starting the upgrade I make sure
htop and strace are installed. Open a second terminal (alt+ctrl+f2 etc.), and
start htop running there.
If things stall, I can just switch terminals to see what's going on, including
running strace from within htop if needed. 


And i can add that maybe also have another terminal up with journalctl -f
i.e dkms building progress should be visible there

CC: (none) => fri

Comment 5 Len Lawrence 2023-06-01 10:16:59 CEST
In reply to comment 3.  In both cases the wait was several hours - probably 6 or more.

Shall retry with Morgan's journalctl tip.
Comment 6 Morgan Leijström 2023-06-01 10:28:53 CEST
Speaking of mgaapplet upgrades, do you use Wayland?
I never get a relpy when asking if still relevant for mga9...
Bug 29182 - mgaapplet-upgrade-helper crashed under Wayland session
Comment 7 Len Lawrence 2023-06-01 11:06:04 CEST
I know nothing about Wayland so only login to Wayland sessions for testing purposes.  Using Mate exclusively for everything else but it looks like we need to cover Wayland in testing upgrades.
Comment 8 Len Lawrence 2023-06-01 18:32:07 CEST
Restarted mgaonline upgrade on a fully updated Mageia8 system with nvidia proprietary driver.  Chose download all at once, storing the RPMs in the default urpmi location.  That took about 90 minutes.  Installation started.  Some four minutes to install 1734 packages and then the dkms build of nvidia-current started.  make seems to be the busiest process, at about 9% CPU usage.
Leaving this to run for as long as it takes.
Comment 9 Len Lawrence 2023-06-01 22:17:08 CEST
More than 3 hours later, make has used 22 minutes of CPU time, preload 3 minutes, top, scheduler and pipewire and a few other processes a minute each.  dkms has used about 3 seconds.  Does this all seem reasonable?
Still going.
Comment 10 Len Lawrence 2023-06-01 23:04:35 CEST
Checked to see how many kernels were installed - just one.  Only kernel-userspace-headers has been updated to mga9 so far.
Comment 11 Dave Hodgins 2023-06-01 23:13:49 CEST
"time urpmi dkms-nvidia-current" in an m9 x86_64 vb guest shows ...
real    4m32.621s
user    5m33.009s
sys     0m48.870s

also with just one kernel.

Check in /var/lib/dkms/nvidia-current/*/build/ for a make.log file. It's
created during the make, but appears to be deleted once it finishes ok.

If the make.log is there, any obvious errors? What are are last dozen or so
lines?

CC: (none) => davidwhodgins

Comment 12 Len Lawrence 2023-06-01 23:39:18 CEST
The nvidia 470 branch has a build directory but no make.log.  The 525 branch has an empty build directory.  It looks like these directories have never been visited since the original installation.

The other thing that puzzles me is what is the dkms build aimed at?  There are no mga9 kernels installed at this stage AFAICS.
Comment 13 Len Lawrence 2023-06-02 00:14:23 CEST
Tried brute force, in caveman fashion; used urpmi directly on the cached kernel-desktop-latest rpm but that immediately raised a conflict with the resident dkms-nvidia-current and I don't know how to resolve that safely.  Time to crash out of this.
Comment 14 Dave Hodgins 2023-06-02 00:41:12 CEST
What does the following show for the order?
grep -e dkms -e nvidia -e kernel /root/.MgaOnline/*.log
Comment 15 Len Lawrence 2023-06-02 08:15:12 CEST
$ su -
# cd .MgaOnline
# grep -e dkms -e nvidia -e kernel *.log > mgaonline_log_search
# ll
total 1652
-rw-r--r-- 1 root root 1665612 Jun  1 23:39 gurpmi_upgrade_to_9_M708E2ns.log
-rw-r--r-- 1 root root   10022 Jun  2 06:27 mgaonline_log_search
-rw-r--r-- 1 root root    5905 Jun  1 15:21 urpmi.cfg.backup.68698

Thanks Dave.  Results attached.
Comment 16 Len Lawrence 2023-06-02 08:19:08 CEST
Created attachment 13861 [details]
Results of log search for mgaapplet upgrade
Comment 17 Dave Hodgins 2023-06-02 08:52:29 CEST
Looks like it would be better to compress a copy of
gurpmi_upgrade_to_9_M708E2ns.log
and attach that. The results of the search don't show what is going wrong, but
is does show that there are desktop, server, and linus kernels to build for.

The first dkms build is for the running m8 kernel. I don't think the dkms
build for the other kernels will happen until they are booted.
Comment 18 Dave Hodgins 2023-06-02 08:53:24 CEST
Also, what does df -h show?
Comment 19 Len Lawrence 2023-06-02 10:01:02 CEST
$ df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  16G     0   16G   0% /dev
tmpfs                     16G     0   16G   0% /dev/shm
tmpfs                     16G  2.1M   16G   1% /run
/dev/sda3                 53G   16G   36G  30% /
tmpfs                     16G   48K   16G   1% /tmp
/dev/sda1                2.4G  324K  2.4G   1% /boot/EFI
/dev/nvme0n1p2            11G  4.5G  6.0G  43% /localrepo
/dev/nvme0n1p1           905G  501G  405G  56% /home
/dev/sdb2                3.6T  2.5T  984G  72% /data
tmpfs                    3.2G  292K  3.2G   1% /run/user/1000
/dev/sdc1                916G  608G  263G  70% /run/media/lcl/gemma
gomeisa:/home/lcl/topaz  705G  110G  559G  17% /home/lcl/pad
gomeisa:/home/lcl/ruby   705G  110G  559G  17% /home/lcl/quinckler
tmpfs                    2.0G   98M  2.0G   5% /home/lcl/.cache

Shall attach the gurpmi log.
Comment 20 Len Lawrence 2023-06-02 10:07:11 CEST
Created attachment 13862 [details]
gurpmi_upgrade_to_9_M708E2ns.log

As it says on the can.
Comment 21 Brian Rockwell 2023-06-02 17:09:56 CEST
This happened to me as well.  

# less gurpmi_upgrade_to_9_pTWTmNir.log

here is from the last output when it hung up

Preparing kernel 5.15.110-desktop-2.mga8 for module build:
(This is not compiling a kernel, just preparing kernel symbols)
Storing current .config to be restored when complete
Running Generic preparation routine
make mrproper....
using /proc/config.gz
make oldconfig....
make prepare....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

CC: (none) => brtians1

Comment 22 Lewis Smith 2023-06-03 21:17:49 CEST
To put some of Len's comments about 'long times' into perspective, he has very powerful hardware. Things like dkms builds should happen quickly.

There is clearly a problem with upgrades involving dkms (terra incognito for me, so I can say nothing constructive about that). 

The comment 20 attached compressed log file should tell the cogniscenti something.

Source RPM: (none) => mgaonline-3.31-3.mga9.src.rpm
CC: (none) => lewyssmith

Comment 23 Brian Rockwell 2023-06-03 23:58:10 CEST
The thing I noticed, my amateur perspective, is it is linking dkms to the wrong kernel.  The old MGA8 versus new MGA9 kernel.
Comment 24 katnatek 2023-06-04 00:24:35 CEST
(In reply to Brian Rockwell from comment #21)
> This happened to me as well.  
> 
> # less gurpmi_upgrade_to_9_pTWTmNir.log
> 
> here is from the last output when it hung up
> 
> Preparing kernel 5.15.110-desktop-2.mga8 for module build:
> (This is not compiling a kernel, just preparing kernel symbols)
> Storing current .config to be restored when complete
> Running Generic preparation routine
> make mrproper....
> using /proc/config.gz
> make oldconfig....
> make
> prepare......................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> .............................................................................
> ......................

This remember my bug https://bugs.mageia.org/show_bug.cgi?id=31621 , perhaps in upgrade by mgaonline would be good if could skip the dkms building until system reboot?

BTW upgrading with urpmi never bite me this issue
Comment 25 Dave Hodgins 2023-06-04 01:17:11 CEST
Tried to recreate the issue by downloading dkms-nvidia-current-525.116.04-1.mga9.nonfree.x86_64.rpm
and installing it on an m8 system. It installed cleanly for the running
kernel-desktop-5.15.110-2.mga8
Comment 26 katnatek 2023-06-04 03:03:12 CEST
(In reply to Dave Hodgins from comment #25)
> Tried to recreate the issue by downloading
> dkms-nvidia-current-525.116.04-1.mga9.nonfree.x86_64.rpm
> and installing it on an m8 system. It installed cleanly for the running
> kernel-desktop-5.15.110-2.mga8

Because is a mga8 system, i think that in the moment that dkms rebuild during the upgrade we have a system that is mixed mga9 with mga8, if you try run the mga8 kernel in a mga9 system when the dkms build is triggered you will have the issue, at less is what happens to me on my report but i did upgrade with classic installer and the cause that after the reboot the system runs in mga8 kernel was fixed, the main thing is the system is still running mga8 kernel and try to build the module with mga9 "tools"
Comment 27 Lewis Smith 2023-06-04 21:02:02 CEST
Assigning this to the 'tools' people for the upgrade process.

Assignee: bugsquad => mageiatools
CC: lewyssmith => (none)

Comment 28 Morgan Leijström 2023-06-26 16:48:54 CEST
Setting for errata for now.

Keywords: (none) => FOR_ERRATA9

Comment 29 Morgan Leijström 2023-07-09 22:13:38 CEST
This really is severe

https://ml.mageia.org/l/arc/qa-discuss/2023-07/msg00100.html

Priority: Normal => release_blocker

Comment 30 Martin Whitaker 2023-07-09 23:47:50 CEST
I have reproduced the root cause as follows:

1. In VirtualBox, install a minimal system from the Mageia-9-rc1-x86_64 ISO. I selected the Xfce DE plus the Configuration and Console Tools categories, but this should be reproducible on any Mageia 9 system.

2. Install a dkms module package. I chose dkms-broadcom-wl, but this should be reproducible with any other dkms module. This should build and install the dkms module for the 6.3.9 kernel without any problem.

3. Add a full set of Mageia 8 urpmi media by

  urpmi.addmedia --distrib <mirror-url>/distrib/8/x86_64

4. Install the current Mageia 8 kernel plus its development package by

  urpmi kernel-desktop-5.15.117-2.mga8-1-1.mga8
  urpmi kernel-desktop-devel-5.15.117-2.mga8-1-1.mga8

5. Attempt to build the dkms module for that kernel by e.g.

  dkms build -m broadcom-wl -v 6.30.223.271-66.mga9.nonfree -k 5.15.117-desktop-2.mga8

This will hang at the 'make prepare' step.

Adding tmb to CC.

CC: (none) => mageia, tmb
Summary: In some cases the upgrade process started by mgaapplet stalls, for no obvious reason. => In some cases the upgrade process started by mgaapplet stalls, due to dkms hanging when building modules for a Mageia 8 kernel

Comment 31 Martin Whitaker 2023-07-10 09:49:27 CEST
Downgrading 'make' to make-4.3-2.mga8 allows the dkms build to complete.

Adding make to /etc/urpmi/skip.list before performing the online upgrade allows the upgrade to complete without error.

Source RPM: mgaonline-3.31-3.mga9.src.rpm => make-4.4.1-1.mga9

Comment 32 Morgan Leijström 2023-07-10 12:27:12 CEST
For now I made a helpful temporary entry *to be reverted for release* (Assuming it get fixed), pointing to comment 31

https://wiki.mageia.org/mw-en/index.php?title=Mageia_9_Release_Notes&action=historysubmit&type=revision&diff=58987&oldid=58981

Keywords: (none) => IN_RELEASENOTES9

Comment 33 Pascal Terjan 2023-07-10 12:33:33 CEST
I remember Thomas had fixed the kernel Makefile for make 4.4 last year, but that will not help for older kernels and we need a workaround :(

CC: (none) => pterjan

Comment 34 Thomas Backlund 2023-07-10 13:06:05 CEST
Heh, good memory :)

So technically I could make next mga8 kernel update "make 4.4" compliant as we require fully updated mga8 before running distro upgrades and that would take care of this...
Comment 35 Morgan Leijström 2023-07-10 13:08:24 CEST
thumbs up :)
Comment 36 Pascal Terjan 2023-07-10 14:19:02 CEST
I am not sure if this is enough as new versions of dkms packages will be rebuilt for older still installed kernel I believe
Comment 37 Morgan Leijström 2023-07-10 14:24:54 CEST
IIRC dkms packages are only getting built during boot time if you boot the elder kernel.  Or if running elder kernel when installing something for dkms.

We can add in update instructions that user need to be running latest kernel in updates or backports when starting an online upgrade.
Comment 38 Thomas Backlund 2023-07-10 14:31:41 CEST
(In reply to Pascal Terjan from comment #36)
> I am not sure if this is enough as new versions of dkms packages will be
> rebuilt for older still installed kernel I believe

hm, it should only build for running kernel, not for older ones...
Comment 39 Thomas Andrews 2023-07-10 14:40:57 CEST
A little out of my element, but as I recall the new versions for installed-but-not-running kernels are not built until those kernels are booted. 

Even so, having a fully-updated kernel in place before an upgrade attempt doesn't mean that the user is actually USING that updated kernel. 

We can warn users that they MUST be using the latest kernel, but how many actually read those warnings? In my observations, the more experience people have, the less likely they are to read documentation.
Comment 40 Morgan Leijström 2023-07-10 16:21:41 CEST
Can mgaapplet be improved to check both that the system is fully updated *and* is running latest installed kernel, before proposing upgrade?

Users advanced enough to choose urpmi probably know where to search for information and help if they fail to read release notes.
Comment 41 Dave Hodgins 2023-07-10 17:49:07 CEST Comment hidden (obsolete)
Comment 42 Thomas Backlund 2023-07-10 21:59:07 CEST
fix for this is now in kernel and kernel-linus 5.15.120-2.mga8 currently building
Comment 43 katnatek 2023-07-11 01:25:56 CEST
The ideas of comment#38 and comment#40 don't works because if you have a fully updated Mageia 8 System (what is recommended in the upgrade process) usually you have the Latest kernel for Mageia 8, so the upgrade will be triggered and if the user have any dkms module the issue will be produced.

Possible solutions

1: Fix the kernel make files to be compatible with make 4.4 as suggest comment#34 

or 

2: In a upgrade by mgaonline Block the dkms build until the reboot
Thomas Backlund 2023-07-11 07:13:28 CEST

Depends on: (none) => 32093

Thomas Backlund 2023-07-11 07:13:38 CEST

Depends on: (none) => 32094

Comment 44 Morgan Leijström 2023-07-11 09:25:41 CEST
What about the mga8 backport kernels?
Comment 45 Thomas Backlund 2023-07-11 10:54:32 CEST
(In reply to Morgan Leijström from comment #44)
> What about the mga8 backport kernels?

already fixed since 6.0.8-3
Comment 46 Len Lawrence 2023-07-11 14:44:59 CEST
Mageia8 -> 9, x86_64, Mate

Intel Core i9 and GTX1080 with nvidia driver.

With the .120 kernel in place the mgaapplet upgrade worked without a hitch.
Download all at once, instalation finished within two hours and rebooted smoothly to the desktop with nvidia and virtualbox drivers rebuilt.
All Desktop functions seem to be working.
Comment 47 Len Lawrence 2023-07-11 14:54:43 CEST
Amendment to comment 46 - starting with kernel desktop 5.15.120-desktop-2.mga8.
Rebooted to 6.3.9-1.mga9.
Comment 48 Morgan Leijström 2023-07-17 11:50:01 CEST
Updated rel notes under
https://wiki.mageia.org/en/Mageia_9_Release_Notes#Online-Upgrade

Added note in Errata under
https://wiki.mageia.org/en/Mageia_9_Errata#If_upgrade_failed

Keywords: FOR_ERRATA9 => IN_ERRATA9

Comment 49 Thomas Backlund 2023-07-19 21:59:40 CEST
An update for this issue has been pushed to the Mageia Updates repository.

https://advisories.mageia.org/MGASA-2023-0237.html
https://advisories.mageia.org/MGASA-2023-0238.html

Resolution: (none) => FIXED
Status: NEW => RESOLVED


Note You need to log in before you can comment on or make changes to this bug.