Bug 29564

Summary: unable to remove old kernels
Product: Mageia Reporter: Pierre Fortin <pf>
Component: RPM PackagesAssignee: Mageia Bug Squad <bugsquad>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: Normal CC: davidwhodgins, joselp, lewyssmith
Version: 8   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: CVE:
Status comment:
Attachments: output of script from attachment 12921
Results after 18 hours

Description Pierre Fortin 2021-10-19 00:01:32 CEST
Description of problem: This is a follow-on from bug 29422...
Using the script in attachment 12921 [details], urpme wants to re-install files with:
<many>.ko.xz:
 - Uninstallation
   - Deleting from: /lib/modules/<kernel>.mga8/dkms-binary/extra/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.




Version-Release number of selected component (if applicable):


How reproducible:  always


Steps to Reproduce:
1.  execute script from attachment 12921 [details]
2.
3.
Comment 1 Pierre Fortin 2021-10-19 00:02:59 CEST
Created attachment 12956 [details]
output of script from attachment 12921 [details]
Comment 2 Dave Hodgins 2021-10-19 01:55:55 CEST
Something caused the rpm db to become corrupt, resulting in rpm -qa
somehow showing the packages as installed though urpme doesn't show it as
installed.

Possible causes I can think of include file system becoming full during
install/uninstall or system shutting down during an install or uninstall. 

I'd try uninstalling each of the three packages using "rpm -e $package"
directly.

As to the uninstall wanting to reinstall files, that's just a mis-understanding
of the messages from dkms. It's simply saying that the module has been
uninstalled, no original module of the same name for that kernel exists for
dkms to automatically re-install, and informing the user they should manually
install a copy of the module with that name for that kernel, should they want
to have it. It's not trying to re-install the module, just informing the user
they should do so if they want the module.

CC: (none) => davidwhodgins

Comment 3 Jose Manuel López 2021-10-19 16:23:36 CEST
I have always uninstalled the old kernels with "urpme --auto-orphans, and I haven't problems here.

I think that we could create a gui tool of "urpme --auto-orphans", where the user can select what packages he want uninstall, and we can add a highlighted option for uninstalling the kernels, whith a warning or indications about the uninstallation of the packages marked.

CC: (none) => joselpddj

Comment 4 Pierre Fortin 2021-10-20 03:38:47 CEST
(In reply to Dave Hodgins from comment #2)
> Something caused the rpm db to become corrupt, resulting in rpm -qa
> somehow showing the packages as installed though urpme doesn't show it as
> installed.

Shouldn't upgrades have cleaned up the old stuff?  Note that there are mga6 and mga7 packages listed...

> Possible causes I can think of include file system becoming full during
> install/uninstall or system shutting down during an install or uninstall. 

Never had either happen. This system has two 1TB SSD (Samsung SSD 860) and a 250G mSATA card (Samsung SSD 850).  Root partition is 48GB.

> I'd try uninstalling each of the three packages using "rpm -e $package"
> directly.

Just did that on each package (308) in the previous list. After running 18 hours on my Intel i7-4710MQ CPU @ 2.50GHz -- 8 threads each averaging 25% to 60% (while I watched for a few minutes), I'm left with a smaller; but still significant list (65)[1].  I'm absolutely blown away by the amount of computing power needed to remove old kernels...  Surely, there must be design flaw therein...?

[1] Actually 61. Four packages are now listed twice...
Comment 5 Pierre Fortin 2021-10-20 03:44:53 CEST
Created attachment 12957 [details]
Results after 18 hours

Before I run this file to try to delete those that weren't on first try, anyone have any ideas why there are now duplicate entries?
Any files I can provide that might help figure this out?
Comment 6 Dave Hodgins 2021-10-20 03:49:59 CEST
Try deleting the files /var/lib/rpm/__db.00* and then running rpm --rebuilddb
Comment 7 Pierre Fortin 2021-10-20 03:59:57 CEST
(In reply to Jose Manuel López from comment #3)
> I think that we could create a gui tool of "urpme --auto-orphans", where the
> user can select what packages he want uninstall, and we can add a
> highlighted option for uninstalling the kernels, whith a warning or
> indications about the uninstallation of the packages marked.

Given previous issues I've had with kernels (e.g., bugs 21979, 21753), may I suggest that in addition to preventing removal of the running kernel (Doh :) we also prevent the removal of at least one alternate kernel.
Comment 8 Pierre Fortin 2021-10-20 04:02:45 CEST
(In reply to Dave Hodgins from comment #6)
> Try deleting the files /var/lib/rpm/__db.00* and then running rpm --rebuilddb

For the record before I do.....
-rw-r--r-- 1 root root         0 Oct 12 12:45 __db.000
-rw-r--r-- 1 root root    352256 Oct 19 21:59 __db.001
-rw-r--r-- 1 root root    174264 Oct 19 21:59 __db.002
-rw-r--r-- 1 root root   1318912 Oct 19 21:59 __db.003
Comment 9 Pierre Fortin 2021-10-20 04:10:16 CEST
Sigh...  story of my life... nothing is ever easy...
# rpm --rebuilddb
error: could not delete old database at /var/lib/rpmold.3272123
Comment 10 Pierre Fortin 2021-10-20 04:28:56 CEST
In addition to the above error (comment 9), just before running rpm --rebuilddb, now _empty_ /var/lib/rpm contained:
ll
total 447884
lrwxrwxrwx 1 root root        21 Oct 16  2019 alternatives -> /var/lib/alternatives/
-rw-r--r-- 1 root root  59408384 Oct 19 18:16 Basenames
-rw-r--r-- 1 root root    135168 Oct 18 15:39 Conflictname
-rw-r--r-- 1 root root  49815552 Oct 19 18:16 Dirnames
-rw-r--r-- 1 root root      8192 Sep 12  2017 Enhancename
-rw-r--r-- 1 root root      8192 Oct  2 17:23 Filetriggername
drwxr-xr-x 2 root root      4096 Aug 15 21:41 filetriggers/
-rw-r--r-- 1 root root     90112 Oct 19 18:16 Group
-rw-r--r-- 1 root root   1615755 Oct 12 12:45 installed-through-deps.list
-rw-r--r-- 1 root root   1670369 Oct 18  2019 installed-through-deps.list.old
-rw-r--r-- 1 root root     94208 Oct 19 18:16 Installtid
-rw-r--r-- 1 root root    311296 Oct 19 18:16 Name
-rw-r--r-- 1 root root    327680 Oct 18 15:39 Obsoletename
-rw-r--r-- 1 root root 335683584 Oct 19 18:16 Packages
-rw-r--r-- 1 root root   6987776 Oct 19 18:16 Providename
-rw-r--r-- 1 root root     40960 Oct 19 18:16 Recommendname
-rw-r--r-- 1 root root   1572864 Oct 19 18:16 Requirename
-rw-r--r-- 1 root root    503808 Oct 19 18:16 Sha1header
-rw-r--r-- 1 root root    278528 Oct 19 18:16 Sigmd5
-rw-r--r-- 1 root root      8192 Oct  2 17:22 Suggestname
-rw-r--r-- 1 root root      8192 Mar  1  2021 Supplementname
-rw-r--r-- 1 root root      8192 Oct  9 12:53 Transfiletriggername
-rw-r--r-- 1 root root      8192 Oct  2 17:22 Triggername

Yet:
$ rpm -qa | wc
   4696    4696  142192

Now what?
Comment 11 Dave Hodgins 2021-10-20 05:03:06 CEST
The question is why is it failing to rebuild. On on of my working m8 systems,
running the rebuild under strace the only references to rpmold are ...
# grep rpmold strace.txt 
2245  rename("/var/lib/rpm", "/var/lib/rpmold.2245") = 0
2245  access("/var/lib/rpmold.2245/pubkeys", F_OK) = -1 ENOENT (No such file or directory)
2245  openat(AT_FDCWD, "/var/lib/rpmold.2245", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
2245  unlink("/var/lib/rpmold.2245/Basenames") = 0
2245  unlink("/var/lib/rpmold.2245/Conflictname") = 0
2245  unlink("/var/lib/rpmold.2245/Dirnames") = 0
2245  unlink("/var/lib/rpmold.2245/Enhancename") = 0
2245  unlink("/var/lib/rpmold.2245/Filetriggername") = 0
2245  unlink("/var/lib/rpmold.2245/Group") = 0
2245  unlink("/var/lib/rpmold.2245/Installtid") = 0
2245  unlink("/var/lib/rpmold.2245/Name") = 0
2245  unlink("/var/lib/rpmold.2245/Obsoletename") = 0
2245  unlink("/var/lib/rpmold.2245/Packages") = 0
2245  unlink("/var/lib/rpmold.2245/Providename") = 0
2245  unlink("/var/lib/rpmold.2245/Recommendname") = 0
2245  unlink("/var/lib/rpmold.2245/Requirename") = 0
2245  unlink("/var/lib/rpmold.2245/Sha1header") = 0
2245  unlink("/var/lib/rpmold.2245/Sigmd5") = 0
2245  unlink("/var/lib/rpmold.2245/Suggestname") = 0
2245  unlink("/var/lib/rpmold.2245/Supplementname") = 0
2245  unlink("/var/lib/rpmold.2245/Transfiletriggername") = 0
2245  unlink("/var/lib/rpmold.2245/Triggername") = 0
2245  unlink("/var/lib/rpmold.2245/__db.000") = 0
2245  openat(AT_FDCWD, "/var/lib/rpmold.2245", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
2245  rmdir("/var/lib/rpmold.2245")     = 0

In my case /var/lib/, and /var/lib/rpm are in the same filesystem as /

What's the output of df -h
Comment 12 Dave Hodgins 2021-10-20 05:04:26 CEST
Also what's the output of "df -h -i" in addition to "df -h".
Comment 13 Dave Hodgins 2021-10-20 05:11:54 CEST
And the output of "mount|grep ^/dev".
Comment 14 Pierre Fortin 2021-10-20 13:36:46 CEST
(In reply to Dave Hodgins from comment #12)
> Also what's the output of "df -h -i" in addition to "df -h".

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G  4.0K   16G   1% /dev/shm
tmpfs            16G   11M   16G   1% /run
/dev/sdb1        48G   44G  2.3G  95% /
/dev/sdb2       1.5G  168K  1.5G   1% /boot/EFI
/dev/sda5        32G   23G  7.5G  76% /root.old
/dev/sdc2       867G  702G  121G  86% /home.bak
/dev/sdb4       175G  158G  8.8G  95% /home.test
/dev/sda3       101G   46G   55G  46% /media/windows
/dev/sda7       741G  634G   70G  91% /home
tmpfs            16G   80M   16G   1% /tmp
tmpfs           3.2G  240K  3.2G   1% /run/user/1000
/dev/sda1        39M  114K   39M   1% /run/media/pfortin/E3A2-CAFF
tmpfs           3.2G  116K  3.2G   1% /run/user/1001
$ df -h -i
Filesystem     Inodes IUsed IFree IUse% Mounted on
devtmpfs         4.0M   741  4.0M    1% /dev
tmpfs            4.0M     2  4.0M    1% /dev/shm
tmpfs            4.0M  1.5K  4.0M    1% /run
/dev/sdb1        3.1M  1.1M  2.0M   35% /
/dev/sdb2           0     0     0     - /boot/EFI
/dev/sda5        2.1M  502K  1.6M   25% /root.old
/dev/sdc2         56M  1.5M   54M    3% /home.bak
/dev/sdb4         12M  239K   11M    3% /home.test
/dev/sda3         55M   91K   55M    1% /media/windows
/dev/sda7         48M  2.3M   45M    5% /home
tmpfs            400K   169  400K    1% /tmp
tmpfs            801K   157  801K    1% /run/user/1000
/dev/sda1           0     0     0     - /run/media/pfortin/E3A2-CAFF
tmpfs            801K    83  801K    1% /run/user/1001
Comment 15 Pierre Fortin 2021-10-20 13:38:41 CEST
(In reply to Dave Hodgins from comment #13)
> And the output of "mount|grep ^/dev".

$ mount|grep ^/dev
/dev/sdb1 on / type ext4 (rw,noatime)
/dev/sdb2 on /boot/EFI type vfat (rw,relatime,fmask=0000,dmask=0000,allow_utime=0022,codepage=437,iocharset=utf8,shortname=mixed,utf8,errors=remount-ro)
/dev/sda5 on /root.old type ext4 (rw,noatime)
/dev/sdc2 on /home.bak type ext4 (rw,noatime)
/dev/sdb4 on /home.test type ext4 (rw,noatime)
/dev/sda3 on /media/windows type fuseblk (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096)
/dev/sda7 on /home type ext4 (rw,noatime)
/dev/sda1 on /run/media/pfortin/E3A2-CAFF type vfat (rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,showexec,utf8,flush,errors=remount-ro,uhelper=udisks2)
Comment 16 Pierre Fortin 2021-10-20 15:11:06 CEST
rpm --rebuilddb  silently ended this time. Is that a good sign? 
Now re-running rpm -e on remaining 61 packages.

BTW, the 18 hour removal of some of the old kernels only changed the free space on / (sdb1) from 2.2G to 2.3G

Progress; down to 4 remaining packages... and removing 57 only took 18 minutes... that's a drastic difference:
  18 hours to remove 247 -- avg 4.4 minutes 
  18 minutes for 57 -- avg 0.3 minutes

Got this from trying to remove the last 61 packages:

$ . /tmp/rmoldk
dkms.conf: Error! No 'DEST_MODULE_LOCATION' directive specified.
dkms.conf: Error! No 'PACKAGE_NAME' directive specified.
dkms.conf: Error! No 'PACKAGE_VERSION' directive specified.

Error! Bad conf file.
Your dkms.conf is not valid.
error: %preun(<PACKAGE>) scriptlet failed, exit status 4
error: <PACKAGE>: erase failed

where PACKAGE is each of:

virtualbox-kernel-4.9.56-desktop-1.mga6-5.1.30-1.mga6
virtualbox-kernel-4.14.56-desktop-1.mga6-5.2.14-6.mga6
virtualbox-kernel-4.14.56-server-1.mga6-5.2.14-6.mga6
xtables-addons-kernel-4.14.56-desktop-1.mga6-2.13-48.mga6

These 4 OLD packages are each listed twice; though I get only one of the above error for each.

Question:  Any chance rpm looks through the entire filesystem rather than staying within the mga8 system?  I still have (possibly-)bootable mga5 and mga6 systems on other drives (3 drives in this laptop) as hinted to with:
   /dev/sda5 on /root.old type ext4 (rw,noatime)
Comment 17 Lewis Smith 2021-10-20 20:13:05 CEST
Yes, deleting old kernels can be tedious. I start with:
 $ rpm -qa | grep kernel | sort
which gives a clear ordered list:
 kernel-desktop-5.10.60-2.mga8-1-1.mga8
 kernel-desktop-5.10.62-1.mga8-1-1.mga8
 kernel-desktop-5.10.70-1.mga8-1-1.mga8
 kernel-desktop-latest-5.10.70-1.mga8
 ...
then 'urpme' the older full package names, several at a time. Always leaving at least the two latest. If it takes ages, this might be due to rare systems where re-generating Grub each time (as happens for kernel updates) takes ages; this does not affect most people.

The end result is visible in /boot, where each kernel now includes for example:
    config-5.10.60-desktop-2.mga8
    initrd-5.10.60-desktop-2.mga8.img
   symvers-5.10.60-desktop-2.mga8.xz
System.map-5.10.60-desktop-2.mga8
   vmlinuz-5.10.60-desktop-2.mga8
but this list was shorter in the past.

CC: (none) => lewyssmith

Comment 18 Dave Hodgins 2021-10-21 02:40:18 CEST
It would not be looking on other file systems.
 
For each of those packages, I'm not sure what is causing the problem. Rather then
investigate them further since it's likely one of the changes in dkms since mga6
that is causing it, I'd next try "rpm -e --noscripts $packagename" to remove
them. Then manually delete each directory shown by
"tree -ifa /var/lib/dkms|grep mga6".
Comment 19 Pierre Fortin 2021-10-21 03:11:26 CEST
Wow...  rpm -e --noscripts finally got rid of those four last old packages and returned 6.5G
  /dev/sdb1        48G   37G  8.8G  81% /

tree -ifa /var/lib/dkms|grep mga6  gave no output, so all looks fine now.

THANKS!!  Hope this results in a better old kernel removal tool...
Comment 20 Pierre Fortin 2021-10-21 03:12:23 CEST
Thanks!

Resolution: (none) => FIXED
Status: NEW => RESOLVED

Comment 21 Dave Hodgins 2021-10-21 03:39:22 CEST
I have no idea why having those mga6 packages installed caused such problems,
especially with the space. I'm guessing it's due to a combination of changes
in dkms and rpm since mga6.

As the author of the script being used, I don't see any changes needed at
this point. I'm not going to change it to try and handle mga6 packages, but
will keep in mind their impact if others report a similar problem.

I recommend running the script and more often and uninstalling packages from
the prior release shortly after each upgrade. I've been doing that on my main
personal use install (not one used normally for qa testing) which started
as a Mageia 3 install.