Bug 28801

Summary: MGA Applet still bring the 'can't open the rpm database' at each boot of the system when trying to check updates...
Product: Mageia Reporter: Jean-François Bachelet <jfbachelet>
Component: RPM PackagesAssignee: Mageia tools maintainers <mageiatools>
Status: REOPENED --- QA Contact:
Severity: major    
Priority: Normal CC: davidwhodgins, fri, jfbachelet, neoser10, ouaurelien, rolfpedersen
Version: 8   
Target Milestone: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Source RPM: CVE:
Status comment:
Attachments: ps results after 'crash'
demsg and journal
proof
journal -b -2
new boot of today and crash
journal good :)

Description Jean-François Bachelet 2021-04-18 16:41:16 CEST
Description of problem:
each time I boot my new install of MGA8 when mga applet start to check for updates it brings the infamous 'can't open the rpm database' error.

that's like that since I've started to use Mageia (at v4 version), I get this bug at each new install.

BTW, until now I was able to cure this using 'rpm -vv --rebuilddb' command one or two times but since MGA8 the bug resurface at each reboot of the system.

Version-Release number of selected component (if applicable):
maga-applet v3.31-1.mga8 and all other version since MGA4

How reproducible:
just boot and wait for the automatic updates check, if you have the same 'luck' than me you will see the bug coming up

Steps to Reproduce:
1. here simply by booting the system
2. wait the updates check
3. get the bug
4. issue 'rpm -vv --rebuilddb'
5. do force a new update check using mga-applet
6. this time it works.
7. reboot and go to 2...
Jean-François Bachelet 2021-04-18 16:41:23 CEST

CC: (none) => jfbachelet

Comment 1 Aurelien Oudelet 2021-04-18 17:24:35 CEST
Hi, thanks reporting this.
Sadly, I can't really reproduce this whereas I have urpmi and dnf installed on my system.

If mgaapplet complains about RPM database locked, it is because an other process involving RPM is running. It could be PackageKit with DNF-backend which computes updates also at the same time. It could be also a leading mgaapplet which is running for an other user.

To try really fixing this,

1- Quit mgaapplet by right-click on it and choose quit.
2- Remove .mgaapplet folder in your user home directory.

3- Do you also use DNF? If you don't use it,
Remove dnf-makecache.timer with the following command as root:
systemctl stop dnf-makecache.timer
systemctl mask dnf-makecache.timer

You can also remove DNF from your system if you only use Urpmi.

4- If you are sure there is no process involved with RPM, rerun mgaapplet in a terminal as user:
mgaapplet &

See after 5 minutes if it correctly computes looking for updates.

CC: (none) => ouaurelien

Comment 2 Dave Hodgins 2021-04-18 17:58:30 CEST
It's been long enough since I've seen the problem and it wasn't handled
automatically I had to dig through my old messages to find it. The last
time I replied to a person with a corrupted rpm db was in 2014.

The fix is to run
rm -f /var/lib/rpm/__db.00*
rpm --rebuilddb

The typical causes of any database corruption are things like hard poweroff,
or i/o errors (bad sectors, intermittent connections due to faulty cable or
connectors, etc).

In most cases rpm can now handle it without having to manually run the above
commands. If it recurs after running those commands, the cause of the db
corruption needs to be identified and fixed.

CC: (none) => davidwhodgins

Comment 3 Jean-François Bachelet 2021-04-18 20:29:47 CEST
(In reply to Aurelien Oudelet from comment #1)

> To try really fixing this,
> 
> 1- Quit mgaapplet by right-click on it and choose quit.
> 2- Remove .mgaapplet folder in your user home directory.
> 
> 3- Do you also use DNF? If you don't use it,
> Remove dnf-makecache.timer with the following command as root:
> systemctl stop dnf-makecache.timer
> systemctl mask dnf-makecache.timer
> 
> You can also remove DNF from your system if you only use Urpmi.
> 
> 4- If you are sure there is no process involved with RPM, rerun mgaapplet in
> a terminal as user:
> mgaapplet &
> 
> See after 5 minutes if it correctly computes looking for updates.

Done all that and it seems to have cured the problem. ^^)

BTW, I've never used anything else than rpm. so it would be cool to note somewhere to ask the user at install time for what update software it want to use, this should prevent such a problem in the future.

Thanks Aurélien ^^)
Comment 4 Jean-François Bachelet 2021-04-18 20:35:14 CEST
(In reply to Dave Hodgins from comment #2)

> The fix is to run
> rm -f /var/lib/rpm/__db.00*
> rpm --rebuilddb
> 
Hum, that's what I do with 'rpm -vv --rebuildb'...
advantage is I don't have to erase anything before ^^)

> The typical causes of any database corruption are things like hard poweroff,
> or i/o errors (bad sectors, intermittent connections due to faulty cable or
> connectors, etc).
well, I got that problem at each install I've done on numerous machines since Mageia 4, old and brand new...


> In most cases rpm can now handle it without having to manually run the above
> commands. If it recurs after running those commands, the cause of the db
> corruption needs to be identified and fixed.

yeah, I've seen the new 'auto rpm db rebuild' switch in the install software, I've added it but got the problem as if not there...

Let's see if the solution from Aurélien works well in long run.

However, this can be closed as resolved for now...

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 5 Aurelien Oudelet 2021-04-18 20:46:51 CEST
(In reply to Jean-François Bachelet from comment #3)
> 
> Thanks Aurélien ^^)

If this works, this is due to a race between dnf-makecache.timer which is triggered on this system at the same time mgaapplet is looking for update...

Normally,
- mgaapplet looks for updates 5 minutes after user logs on AND there is a network connection.

- dnf-makecache.timer is
OnBootSec=10min
OnUnitInactiveSec=1h
RandomizedDelaySec=60m
Unit=dnf-makecache.service

dnf-makecache should be triggerred 10 minutes after boot, then every 1 hour. not 5 minutes. Or, you did make mgaapplet to look for updates 10 minutes after boot in MCC...

Note if you have removed dnf, Plasma Discover or Gnome Software applications will not be able to install new rpms from Mageia Repos. You should now use Rpmdrake.
Comment 6 Rolf Pedersen 2021-04-18 21:52:47 CEST
(In reply to Dave Hodgins from comment #2)
> It's been long enough since I've seen the problem and it wasn't handled
> automatically I had to dig through my old messages to find it. The last
> time I replied to a person with a corrupted rpm db was in 2014.
> 
> The fix is to run
> rm -f /var/lib/rpm/__db.00*
> rpm --rebuilddb
> 
It's a long time for me, also, and I have to search the web.  It must be more than 20 years since I was first aware of this sequence of commands, in Mandrake, when the need was more common.
(In reply to Jean-François Bachelet from comment #4)
> (In reply to Dave Hodgins from comment #2)
> 
> > The fix is to run
> > rm -f /var/lib/rpm/__db.00*
> > rpm --rebuilddb
> > 
> Hum, that's what I do with 'rpm -vv --rebuildb'...
> advantage is I don't have to erase anything before ^^)
> 
I don't see where you use
> rm -f /var/lib/rpm/__db.00*
or, do you say 'rpm -vv --rebuildb' includes it or makes it unnecessary?
I tried
# rpm -vv --rebuilddb 2>&1 | tee SomeFile.txt
and looked through the text file for some evidence of this function but the file is long and I am not so expert.

Not knowing any better, I'd suggest that not issuing the commands as Dave wrote would prevent the database rebuild from fixing the problem.

CC: (none) => rolfpedersen

Comment 7 Dave Hodgins 2021-04-18 22:44:02 CEST
My understanding is that without first deleting /var/lib/rpm/__db.00*
the rebuild, even with the verbose option specified twice, will do nothing
to fix a corrupt database.
Comment 8 Jean-François Bachelet 2021-04-19 06:29:57 CEST
arg. that's not enough, sadly.

I got the bug again after two succesful checks the third brang the bug...

I completely killed dnf, so what can trigger that again ???

Resolution: FIXED => (none)
Status: RESOLVED => REOPENED

Comment 9 Aurelien Oudelet 2021-04-19 11:35:07 CEST
(In reply to Jean-François Bachelet from comment #8)
> arg. that's not enough, sadly.
> 
> I got the bug again after two succesful checks the third brang the bug...
> 
> I completely killed dnf, so what can trigger that again ???

Perhaps bad harddrive sectors... And near faulty device...?

Also, you can copy paste here the process list while mgaapplet complains with:
$ ps aux >> ~/process_running.txt

And we try to see what's wrong on your system.
Comment 10 Dave Hodgins 2021-04-19 17:49:59 CEST
Check the output of dmesg and "journalctl --no-hostname -b" for any obvious
errors. Also "smartctl -a /dev/sda" (run smartctl for each hard drive).
Comment 11 Jean-François Bachelet 2021-04-19 22:11:40 CEST
(In reply to Dave Hodgins from comment #10)
> Check the output of dmesg and "journalctl --no-hostname -b" for any obvious
> errors. Also "smartctl -a /dev/sda" (run smartctl for each hard drive).

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.30-desktop-1.mga8] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00MFCA0
Serial Number:    WD-WCC6Y2KV1XRC
LU WWN Device Id: 5 0014ee 2b801e360
Firmware Version: 01.01A01
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr 19 22:09:47 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (12240) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 127) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   171   168   021    Pre-fail  Always       -       2408
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1340
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15490
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1340
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       6689
194 Temperature_Celsius     0x0022   111   095   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Comment 12 Jean-François Bachelet 2021-04-19 22:16:29 CEST
(In reply to Aurelien Oudelet from comment #9)
> (In reply to Jean-François Bachelet from comment #8)
> > arg. that's not enough, sadly.
> > 
> > I got the bug again after two succesful checks the third brang the bug...
> > 
> > I completely killed dnf, so what can trigger that again ???
> 
> Perhaps bad harddrive sectors... And near faulty device...?
> 
> Also, you can copy paste here the process list while mgaapplet complains
> with:
> $ ps aux >> ~/process_running.txt
> 
> And we try to see what's wrong on your system.

As said it's not only My system, I get that bug on ALL systems I've installed since Mageia v4... and it where nearly all different.

BTW, where I find the result of this command ? in my /home ?
Comment 13 Aurelien Oudelet 2021-04-20 09:45:40 CEST
(In reply to Jean-François Bachelet from comment #12)
> > 
> > Also, you can copy paste here the process list while mgaapplet complains
> > with:
> > $ ps aux > ~/process_running.txt

> BTW, where I find the result of this command ? in my /home ?

With the above command, you will find the process_running.txt file at the root of YOUR user directory.

Please make the command with the less processes running.
Close all applications like firefox, KMail or other before, so we will find the culprit, eventually.
Comment 14 Jean-François Bachelet 2021-04-21 07:29:02 CEST
Created attachment 12659 [details]
ps results after 'crash'

(In reply to Aurelien Oudelet from comment #13)
> (In reply to Jean-François Bachelet from comment #12)
> > > 
> > > Also, you can copy paste here the process list while mgaapplet complains
> > > with:
> > > $ ps aux > ~/process_running.txt

> Please make the command with the less processes running.
> Close all applications like firefox, KMail or other before, so we will find
> the culprit, eventually.

here are the results :
Comment 15 Jean-François Bachelet 2021-04-22 13:09:20 CEST
hello Aurelien, did you see something in the textfile that can be causing that havoc ?
Comment 16 Aurelien Oudelet 2021-04-22 21:37:13 CEST
(In reply to Jean-François Bachelet from comment #15)
> hello Aurelien, did you see something in the textfile that can be causing
> that havoc ?

Sadly NO.
I see mgaapplet, but no other program that can lock the RPM database.

Is /var directory on other partition than / (system root) ?
Comment 17 Dave Hodgins 2021-04-22 23:13:12 CEST
Also please post the output of (run as root) "df -h|grep -v tmpfs" and
"df -hi|grep -v tmpfs".
Comment 18 Jean-François Bachelet 2021-04-23 00:26:02 CEST
(In reply to Aurelien Oudelet from comment #16)

> Is /var directory on other partition than / (system root) ?

nope :( only /, /home and /usr/local have their own partitions.
Comment 19 Jean-François Bachelet 2021-04-23 00:29:15 CEST
(In reply to Dave Hodgins from comment #17)
> Also please post the output of (run as root) "df -h|grep -v tmpfs" 

Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/sda2           32G     15G   16G  50% /
/dev/sdi1          9,1T    7,3T  1,3T  85% /stk03
/dev/sdj1          2,7T    2,6T   39G  99% /sve0
/dev/sdh1          2,7T    2,4T  178G  94% /sve1
/dev/sdg1          2,7T    1,4T  1,2T  54% /stk1
/dev/sde1          2,7T    2,1T  535G  80% /sve2
/dev/sdf1          2,7T     18G  2,6T   1% /sve3
/dev/sda4           63G    175M   59G   1% /usr/local
/dev/sda5          807G    641G  166G  80% /home

and

> "df -hi|grep -v tmpfs".

Sys. de fichiers Inœuds IUtil. ILibre IUti% Monté sur
/dev/sda2          2,0M   379K   1,7M   19% /
/dev/sdi1          292M   4,6K   292M    1% /stk03
/dev/sdj1          175M   1,3M   174M    1% /sve0
/dev/sdh1          175M    41K   175M    1% /sve1
/dev/sdg1          175M   583K   175M    1% /stk1
/dev/sde1          175M   604K   175M    1% /sve2
/dev/sdf1          175M     32   175M    1% /sve3
/dev/sda4          4,0M   4,4K   4,0M    1% /usr/local
/dev/sda5           52M   287K    51M    1% /home
Comment 20 Jean-François Bachelet 2021-04-25 17:39:00 CEST
one ecerpt of the 'journalctl --pager-end --catalog' command


░░ L'unité (unit) UNIT s'est arrêtée et a consommé les ressources indiquées.
avril 25 17:15:39 localhost.localdomain mgaapplet-update-checker[518285]: updating inactive backport media Core Backports, Nonfree Backports, Tainted Backports, Core 32bit Backports, Nonfree 32bit Backports, Tainted 32bit Backports
avril 25 17:15:39 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Core Backports
avril 25 17:15:39 localhost.localdomain pkexec[519099]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:39 localhost.localdomain pkexec[519099]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Core Backports]
avril 25 17:15:40 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Nonfree Backports
avril 25 17:15:40 localhost.localdomain pkexec[519113]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:40 localhost.localdomain pkexec[519113]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Nonfree Backports]
avril 25 17:15:40 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Tainted Backports
avril 25 17:15:40 localhost.localdomain pkexec[519139]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:40 localhost.localdomain pkexec[519139]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Tainted Backports]
avril 25 17:15:40 localhost.localdomain vnstatd[4287]: Info: Monitoring (1): eno1 (1000 Mbit)
avril 25 17:15:41 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Core 32bit Backports
avril 25 17:15:41 localhost.localdomain pkexec[519207]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:41 localhost.localdomain pkexec[519207]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Core 32bit Backports]
avril 25 17:15:41 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Nonfree 32bit Backports
avril 25 17:15:41 localhost.localdomain pkexec[519220]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:41 localhost.localdomain pkexec[519220]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Nonfree 32bit Backports]
avril 25 17:15:41 localhost.localdomain mgaapplet-update-checker[518285]: running: urpmi.update Tainted 32bit Backports
avril 25 17:15:42 localhost.localdomain pkexec[519229]: pam_unix(polkit-1:session): session opened for user root by (uid=1000)
avril 25 17:15:42 localhost.localdomain pkexec[519229]: voxel: Executing command [USER=root] [TTY=unknown] [CWD=/] [COMMAND=/usr/libexec/urpmi.update Tainted 32bit Backports]
avril 25 17:15:44 localhost.localdomain mgaapplet[7450]: Failed to open urpmi database

I haven't seen other thing in connexion with mgaapplet there, btw this comes each time it tries to check updates.
and rebuilding the database just works for the one or two next checks in the day, after a reboot the problem resurfaces...

it's really #!?@ !!!
Comment 21 Dave Hodgins 2021-04-25 18:38:13 CEST
How do you normally shut down the system?

What's the output of "urpmq --list-url |grep core/release" and
"journalctl --no-hostname -b -1 |tail -n 20"?
Comment 22 Jean-François Bachelet 2021-04-25 18:45:43 CEST
(In reply to Dave Hodgins from comment #21)
> How do you normally shut down the system?
> 
always use the gui shutdown button.

> What's the output of "urpmq --list-url |grep core/release" 

# urpmq --list-url | grep core/release
Core Release http://magie.remi.lu/distrib/8/x86_64/media/core/release


and
> "journalctl --no-hostname -b -1 |tail -n 20"?

# journalctl --no-hostname -b -l | tail -n 20
avril 25 18:39:53 systemd[4637]: Started Firefox - Navigateur Web.
avril 25 18:39:53 plasmashell[7308]: file:///usr/share/plasma/plasmoids/org.kde.plasma.taskmanager/contents/ui/Task.qml:379: Unable to assign [undefined] to QString
avril 25 18:39:54 mgaapplet[7450]: Failed to open urpmi database
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 7 threads of 3 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Successfully made thread 893457 of process 893280 owned by '1000' RT at priority 10.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 8 threads of 4 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 8 threads of 4 processes of 1 users.
avril 25 18:39:56 rtkit-daemon[963]: Supervising 8 threads of 4 processes of 1 users.
avril 25 18:39:57 rtkit-daemon[963]: Supervising 8 threads of 4 processes of 1 users.
avril 25 18:39:57 rtkit-daemon[963]: Supervising 8 threads of 4 processes of 1 users.
avril 25 18:40:00 vnstatd[4287]: Info: Monitoring (1): eno1 (1000 Mbit)
avril 25 18:41:19 plasmashell[7308]: file:///usr/share/plasma/plasmoids/org.kde.plasma.kicker/contents/ui/ItemListDialog.qml:51: TypeError: Cannot read property 'separatorCount' of null
avril 25 18:41:19 plasmashell[7308]: trying to show an empty dialog
Comment 23 Dave Hodgins 2021-04-25 23:56:02 CEST
journalctl --no-hostname -b -1 |tail -n 20

Note, number one, not letter l. The purpose is to see if there were any
errors in the prior shutdown. Use -2 etc., if you've rebooted more than
once since the last good urpmi db open.
Comment 24 Jean-François Bachelet 2021-04-26 02:01:47 CEST
(In reply to Dave Hodgins from comment #23)
> journalctl --no-hostname -b -1 |tail -n 20
> 
> Note, number one, not letter l. The purpose is to see if there were any
> errors in the prior shutdown. Use -2 etc., if you've rebooted more than
> once since the last good urpmi db open.

oups !

well I've gone back to -10 and the only error I see there is about the watchdog that sometimes does not stop before the full shutdown...

# journalctl --no-hostname -b -9 | tail -n 20

avril 21 14:34:30 systemd[1]: systemd-remount-fs.service: Succeeded.
avril 21 14:34:30 systemd[1]: Stopped Remount Root and Kernel File Systems.
avril 21 14:34:30 systemd[1]: systemd-fsck-root.service: Succeeded.
avril 21 14:34:30 systemd[1]: Stopped File System Check on Root Device.
avril 21 14:34:30 systemd[1]: systemd-tmpfiles-setup-dev.service: Succeeded.
avril 21 14:34:30 systemd[1]: Stopped Create Static Device Nodes in /dev.
avril 21 14:34:30 systemd[1]: lvm2-monitor.service: Succeeded.
avril 21 14:34:30 systemd[1]: Stopped Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
avril 21 14:34:30 systemd[1]: Reached target Shutdown.
avril 21 14:34:30 systemd[1]: Reached target Final Step.
avril 21 14:34:30 systemd[1]: systemd-reboot.service: Succeeded.
avril 21 14:34:30 systemd[1]: Finished Reboot.
avril 21 14:34:30 systemd[1]: Reached target Reboot.
avril 21 14:34:30 systemd[1]: Shutting down.
avril 21 14:34:30 systemd[1]: Hardware watchdog 'iTCO_wdt', version 0
avril 21 14:34:30 systemd[1]: Set hardware watchdog to 10min.
avril 21 14:34:30 kernel: watchdog: watchdog0: watchdog did not stop!
avril 21 14:34:31 systemd-shutdown[1]: Syncing filesystems and block devices.
avril 21 14:34:31 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
avril 21 14:34:31 systemd-journald[595]: Journal stopped




# journalctl --no-hostname -b -10 | tail -n 20

avril 21 11:09:44 systemd[1]: Removed slice system-systemd\x2dfsck.slice.
avril 21 11:09:44 systemd[1]: Stopped target Local File Systems (Pre).
avril 21 11:09:44 systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
avril 21 11:09:44 systemd[1]: systemd-remount-fs.service: Succeeded.
avril 21 11:09:44 systemd[1]: Stopped Remount Root and Kernel File Systems.
avril 21 11:09:44 systemd[1]: systemd-fsck-root.service: Succeeded.
avril 21 11:09:44 systemd[1]: Stopped File System Check on Root Device.
avril 21 11:09:44 systemd[1]: systemd-tmpfiles-setup-dev.service: Succeeded.
avril 21 11:09:44 systemd[1]: Stopped Create Static Device Nodes in /dev.
avril 21 11:09:44 systemd[1]: lvm2-monitor.service: Succeeded.
avril 21 11:09:44 systemd[1]: Stopped Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
avril 21 11:09:44 systemd[1]: Reached target Shutdown.
avril 21 11:09:44 systemd[1]: Reached target Final Step.
avril 21 11:09:44 systemd[1]: systemd-poweroff.service: Succeeded.
avril 21 11:09:44 systemd[1]: Finished Power-Off.
avril 21 11:09:44 systemd[1]: Reached target Power-Off.
avril 21 11:09:44 systemd[1]: Shutting down.
avril 21 11:09:44 systemd-shutdown[1]: Syncing filesystems and block devices.
avril 21 11:09:44 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
avril 21 11:09:44 systemd-journald[601]: Journal stopped
Comment 25 Dave Hodgins 2021-04-26 23:47:02 CEST
At this point I have no idea why the local rpm database is becoming corrupted
to the point it can not be recovered automatically.

My only suggestion is to ensure it's working now , rebuild the db as per
comment 2 and the next time it fails immediately save the output of the commands
"dmesg", "journalctl --no-hostname -b" and "journalctl --no-hostname -b -1" for
analysis. There has to be some sort of file conflict or io error triggering
the database corruption, which should be shown by messages in those log
outputs.
Comment 26 Jean-François Bachelet 2021-04-27 15:02:14 CEST
Created attachment 12679 [details]
demsg and journal

(In reply to Dave Hodgins from comment #25)
> At this point I have no idea why the local rpm database is becoming corrupted
> to the point it can not be recovered automatically.
> 
> My only suggestion is to ensure it's working now , rebuild the db as per
> comment 2 and the next time it fails immediately save the output of the
> commands
> "dmesg", 

in the attachement

> "journalctl --no-hostname -b" and

is way tooooo big


> "journalctl --no-hostname -b -1"

in the attachment


> for
> analysis. There has to be some sort of file conflict or io error triggering
> the database corruption, which should be shown by messages in those log
> outputs.
Comment 27 Jean-François Bachelet 2021-04-27 15:05:52 CEST
a little note : I've rebuilded the databases then rebooted, waited for the auto update check, it worked fine, and waited for the next update check that failed.
Comment 28 Jean-François Bachelet 2021-04-27 15:17:42 CEST
another note :

when I dan't need to go to internet I cut the ethernet link (disconnect) until I need it again.

in that cases, as soon as I reconnect mgaapplet send me the 'can't open the rpm databases' error...

so there is a bug there as well...
Comment 29 Rolf Pedersen 2021-04-27 16:08:57 CEST
(In reply to Jean-François Bachelet from comment #27)
> a little note : I've rebuilded the databases....

After claiming

(In reply to Jean-François Bachelet from comment #4)
> (In reply to Dave Hodgins from comment #2)
> 
> > The fix is to run
> > rm -f /var/lib/rpm/__db.00*
> > rpm --rebuilddb
> > 
> Hum, that's what I do with 'rpm -vv --rebuildb'...
> advantage is I don't have to erase anything before ^^)

Nowhere do you acknowledge the requirement to first run 'rm -f /var/lib/rpm/__db.00*' as the first step in rebuilding the database. 

Could you confirm that you are now following these *two* steps Dave supplied in comment #2 please?  Thanks.
Comment 30 Jean-François Bachelet 2021-04-27 16:26:50 CEST
Created attachment 12680 [details]
proof

(In reply to Rolf Pedersen from comment #29)
> (In reply to Jean-François Bachelet from comment #27)
> > a little note : I've rebuilded the databases....
> 
> Could you confirm that you are now following these *two* steps Dave supplied
> in comment #2 please?  Thanks.

YES I have done it as said in comment 2 !

here is the content of the /var/lib/rmp directory after the rebuilding today at 13h32 :

in attachment
Comment 31 Rolf Pedersen 2021-04-27 16:45:02 CEST
Good!
Comment 32 Dave Hodgins 2021-04-27 18:51:43 CEST
The dmesg output doesn't show what ever is going wrong.
What the "journalctl --no-hostname -b -1" output shows is that
at 12:12:46 mgaapplet failed
at 12:12:57 su to root was run, presumably to fix the db
at 12:13:18 mgaapplet succeeded
at 12:32:37 the system journal ends as part of the reboot

If you still have the output of the "journalctl --no-hostname -b" command,
run the command
$ grep -e 'Packages are up to date' -e 'Failed to open urpmi database' journal.txt
note: the above grep command is all one line. Replace journal.txt with the
file name where the journal output is stored.

That will show what time the last "Packages are up to date" message was
displayed before the error was detected.

Delete all lines in that journal file from before the last line with the
mgaapplet "Packages are up to date" message.
Delete all lines in the journal file from after the line with
"Failed to open urpmi database" message.

Compress the remaining lines. Is that still too big?
Comment 33 Jean-François Bachelet 2021-04-27 20:20:53 CEST
Created attachment 12681 [details]
journal -b -2

first time I've tried it was printing more than 12000 lines...

this one is 2500 line, btw I hab to go back to -2 to get the two messages.
Comment 34 Dave Hodgins 2021-04-27 21:31:56 CEST
That one again shows mgaapplet failing, the db being fixed and then mgaapplet
working.

What's needed are the journal entries where in between when it's working and
when it first fails.
Comment 35 Jean-François Bachelet 2021-04-27 22:34:46 CEST
(In reply to Dave Hodgins from comment #34)

> What's needed are the journal entries where in between when it's working and
> when it first fails.

I do it as soon as it breaks again ^^)
Comment 36 Jean-François Bachelet 2021-04-28 12:57:23 CEST
Created attachment 12683 [details]
new boot of today and crash

Hello ^^)

well look at the attachment, in fact, I rebuild the database when mgaapplet throw me the eroor, then test for updates, then continue to work and then shutdown properly the machine when going to bed, in between mgaapplet worked fine.

then the next day I boot my machine and after 3 minutes get mgaapplet throwing the error again....
Comment 37 Jean-François Bachelet 2021-04-28 14:11:52 CEST
(In reply to Dave Hodgins from comment #25)
> At this point I have no idea why the local rpm database is becoming corrupted
> to the point it can not be recovered automatically.
> 
> My only suggestion is to ensure it's working now , rebuild the db as per
> comment 2 and the next time it fails immediately save the output of the
> commands
> "dmesg", "journalctl --no-hostname -b" and "journalctl --no-hostname -b -1"
> for
> analysis. There has to be some sort of file conflict or io error triggering
> the database corruption, which should be shown by messages in those log
> outputs.

Hello Dave ^^)

Little question, how can I ask for a 'journalctl....' with auto skipping of 'msec' messages ? (more than 300000 lines -and I stopped way before the end- is wayyyy tooo long ;))

Thanks by advance.

Jeff
Comment 38 Jean-François Bachelet 2021-04-29 05:30:35 CEST
Created attachment 12687 [details]
journal good :)

Hello ^^)

this time this should be good to see what's wrong with mgaapplet failing reppeatidly...
Comment 39 Dave Hodgins 2021-04-29 09:40:02 CEST
Thanks. Looking through this, the only obvious problem I see are the lines with
smartd[1015]: Device: /dev/sdj [SAT], 1 Currently unreadable (pending) sectors

If sdj is not where the database is stored, that is likely not relevant.

I look through it more after I get some sleep.
Comment 40 Dave Hodgins 2021-04-29 19:02:41 CEST
There's also ...
Device: /dev/sdg [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 113 to 114

I'm surprised there is no device error message associated with the msg
mgaapplet[8882]: Failed to open urpmi database

While sdg and sdj look somewhat questionable (likely to fail soon), there
is no indication of any problems with sda, where the rpm db is stored.

I see nothing in the log to indicate why the database is failing to open.

I'm at a loss for suggestions on how to debug this further.

Assigning to the kernel team who may have some idea or see something I've missed.
in attachment 12687 [details]

Assignee: bugsquad => kernel

Comment 41 Mauricio Andrés Bustamante Viveros 2022-09-27 04:35:27 CEST
I had this issue too
1. When i configured the paranoic security level and after that the msec polices
2. After a few restars of the VM

In the first case, i did not pay attention to it, because my mods
But second case, i (without knowledge of this 28801), deleted the mageia mirror listing, because that mirror was slow to download packages and the mgaapplet worked again

If i hit the issue again i will load the journal... CCing to me to keep as a tracker for me

CC: (none) => neoser10

Comment 42 Jean-François Bachelet 2022-12-04 11:55:27 CET
Hello folks ^^)

well another new machines (four) set up with new disks and new mageia 8 latests installs and... still that bug !

these time I have disabled the install CDrom 'core' and 'Core update' repos at install time, so ONLY the online repos are used when installing and for the later updates.

that changes nothing, the bug still resurface ever now and then, that works after a db rebuild for one-two times then re-bug...

that's getting very unpleasant.

DNF is NOT installed nor running.

what I've seen is that after each use of the MCC for whatever I need to do, after quitting it the rpm database stays locked and the __db_00* files are let in the rpm directory...

so could it be the auto-update of repos that bring the bug ?
Comment 43 Morgan Leijström 2022-12-04 18:55:48 CET
Can you see in the journal, the last operation before you experience the problem, if/how it fail?

> so could it be the auto-update of repos that bring the bug ?

Wonder if this could be an effect of downloader failing.
https://wiki.mageia.org/en/Mageia_8_Errata#Downloading_software
So for a shot in the dark, change downloader to wget.

CC: (none) => fri

Comment 44 Jean-François Bachelet 2022-12-06 09:13:07 CET
Hello ^^)

here are my findings from searching what happened before the mgaaplet fail :

1 / each time you use the MCC it locks the rpm database and create the __db.00x files, and when you exit it it let that files in the the /var/lib/rpm directory :( 

so when mgaaplet starts it find the __db.00x files and fail...

(I have constated that when it find the xxx.lock files but no __db.00x files that doesn't stop it to make its job fine)

--> possible cure 1 : force MCC delete that __db.00x files before exiting.


2 / each time MSEC starts, it soon run scripts from the /usr/share/msec/scripts directory, one of these scripts is devoted to rpm ==> 03_rpm.sh.
this script upon run create the __db.00x files too and let it in the /var/lib/rpm directory after its job just like the MCC :(

so when mgaaplet start it find that __db.00x files and fail again :(

--> possible cure 2 : modify the 03_rpm.sh script so it delete the __db.00x files from the /var/lib/rpm directory before it exit running.

for now I've not found other possible stuff that could create that ?!#*! __db.00x files...

what I can understand is that that __db.00x files are useful for MCC and the 03_rpm.sh script only when that two are running and serve nothing and no-one more after.

and when running 'rpm -vv --rebuilddb' it seems to ignore and then erase that __db.00x and xxx.lock files upon exiting...

So, from that findings I think it will be wise to make the two modifications I suggest to MCC and 03_rpm.sh script to possibly cure the mgaaplet failing with 'unable to open the rpm database' error for good.

another way of doing that will be to NOT store that __db.00x files on disk but make MCC and 03_rpm.sh script keep it in memory or /tmp, so it will vanish from itself when the two exits and not interfere with mgaaplet job...

what do you think of that Mr Devs ?
Comment 45 Morgan Leijström 2022-12-06 09:27:42 CET
This really is for our tools experts

Assignee: kernel => mageiatools