Bug 27580 - Mageia Live 8b1 persistent: urpmi killed by timeout while removing localisations - rpm database get irrepairably broken
Summary: Mageia Live 8b1 persistent: urpmi killed by timeout while removing localisati...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: High critical
Target Milestone: ---
Assignee: Martin Whitaker
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-08 22:37 CET by Morgan Leijström
Modified: 2020-12-06 01:59 CET (History)
3 users (show)

See Also:
Source RPM: drakxtools-18.34
CVE:
Status comment:


Attachments
File from running # LC_ALL=C remove-unused-packages 2>&1 | tee 1_rmu.txt (68.82 KB, text/plain)
2020-11-11 15:09 CET, Morgan Leijström
Details
Relevant part of journal (82.07 KB, text/plain)
2020-11-11 15:09 CET, Morgan Leijström
Details

Description Morgan Leijström 2020-11-08 22:37:30 CET
On first boot of a live system without doing anything before, i executed our tool remove-unused-packages to get rid of a half thousand language packages, to make updating quicker.
But it quit because wrong owner of XDG_RUNTIME_DIR and rpm database is now broken :(

To me, it raises too many questions to understand, i.e why did it finish 366 uninstalls before reacting to that??



Steps to Reproduce:
1. Create a persistent USB stick from Mageia-8-beta1-Live-Xfce-x86_64.iso
2. Boot it, open xfce terminal app, and there
3. Attempt removing unused locales, per https://wiki.mageia.org/en/Removing_packages#Removing_unused_localisation_and_hardware_support

 $ su
 # remove-unused-packages

And in there i deselected to remove hardware support and let it go.  Result: After operating a while it interrupts, complaining "XDG_RUNTIME_DIR (/run/user/1000) is not owned by us":

...

tar bort paket mageia-doc-mcc-es-7.3-2.mga8.noarch
  365/505: tar bort mageia-doc-mcc-es-7.3-2.mga8.noarch
                                 #############################################
tar bort paket hunspell-yi-1.1-12.mga8.noarch
  366/505: tar bort hunspell-yi-1.1-12.mga8.noarch
                                 #############################################
XDG_RUNTIME_DIR (/run/user/1000) is not owned by us (uid 0), but by uid 1000! (This could e.g. happen if you try to connect to a non-root PulseAudio as a root user, over the native protocol. Don't do that.)
[root@localhost live]#

I found rpm database is broken, tried to repair:

[root@localhost live]# LC_ALL=C rpm --rebuilddb
error: failed to replace old database with new database!
error: replace files in /var/lib/rpm with files from /var/lib/rpmrebuilddb.4881 to recover

(And I tried that file replacing: still no joy, but that is not important for me)


I have earlier used drakrpm GUI to remove localisations, selecting manually, before i knew about our tool and that finished without error.  But I did not remove so many in one pass.
Comment 1 Morgan Leijström 2020-11-08 22:40:34 CET
I guess you Martin can see where the problem is?

I am about to try repeating using mga7 xfce live.

Assignee: bugsquad => mageiatools
CC: (none) => mageia

Comment 2 Dave Hodgins 2020-11-08 23:04:04 CET
Please try again after using "su -" instead of just su.

CC: (none) => davidwhodgins

Comment 3 Pascal Terjan 2020-11-08 23:15:04 CET
My basic guess would be that the problem was triggered by using su instead of "su -" meaning that all environment variables from the current user remain defined, and root ends up using all the stuff from the user (and possibly breaking other things by having root owned files in user directory).

Still it's not great that somehow pulseaudio kills things running a command as root with a non root user environment. The strange thing is what would be using pulseaudio while urpme is uninstalling packages? If it was a %preun script, I would expect it to only kill the script.
Comment 4 Martin Whitaker 2020-11-08 23:18:36 CET
Yes, the error message you report is because you used 'su' not 'su -'. But it is otherwise harmless. I have just tried to reproduce the problem, and it completed without error for me, but gave me the same error message at the end (as well as several times on the way through. Check the system log for other errors.
Comment 5 Martin Whitaker 2020-11-08 23:20:42 CET
I think the reference to pulseaudio is just an example of what might trigger that error message.
Comment 6 Morgan Leijström 2020-11-08 23:27:28 CET
Thank you for the very quick replies :)

Yes getting shot in the back is no fun.

I just verified that there is not the same problem with mga7.1 64bit xfce live.  But it may just be coincidence.  No occurrence of that message (at least in the terminal) but maybe I am just lucky.

Yes i should learn to always use su - 
- and we should also write it in all relevant user instructions...

Now i am loading that stick again with the 8b1 iso to try with su -.

Any new iso on the horizon?


Hardware i am testign on: Very fast USB stick, but limited by USB2 port; on a Thinkpad T400.
Comment 7 Morgan Leijström 2020-11-09 00:15:06 CET
Yes, the " -" after "su"  did the trick :)

OK, so how to best avoid other users fall in same hole as me?
Morgan Leijström 2020-11-09 00:16:08 CET

Summary: Mageia Live 8b1 persistent: rpm database get irrepairably broken by removing localisations => Mageia Live 8b1 persistent: rpm database get irrepairably broken by simple mistake while removing localisations

Comment 8 Dave Hodgins 2020-11-09 05:11:20 CET
Using "su --login" (which can be abbreviated as just "su -" is described in
the man page for su. It's something all linux users have to learn, usually
the hard way as they won't think to look for the information until they
encounter the bad effects it can have. For example, using just "su", and
then running any program that updates files in ~/.config will change
ownership of those files to root:root. When the user tries running the
same program under their own id they will fail since it can no longer
update the files in ~/.config

You're the first person I'm aware of that hadn't already learned the
impact of using "su" before trying to use the remove-unused-packages
script.

I frequently explain it in various linux related usenet newsgroups and
other places, and instruct them to read the su man page.

I'm closing the bug as invalid since it's a linux education problem rather
then a Mageia specific problem.

Feel free to continue the discussion here, even though the bug has been closed
if you have an idea of where it can be better documented.

Status: NEW => RESOLVED
Resolution: (none) => INVALID

Comment 9 Morgan Leijström 2020-11-09 10:17:14 CET
Lesson learned.

Any suggestion about where to put such information in our wiki or other docs?

"su --login" is nowhere to be found currently.

( And searching for "su -" in our wiki result in a database query error :/ ) 



( For this specific case i have suggested:  Bug 27372 - MCC: Link to our tool for removing unused localisations and translations, remove-unused-packages )
Comment 10 Dave Hodgins 2020-11-09 21:13:56 CET
I've created https://wiki.mageia.org/en/Never_use_%22su%22! and asked the docteam
to pretty it up etc.
Comment 11 Morgan Leijström 2020-11-09 22:36:15 CET
Great!  :)

I think the page should mention sudo, and some words when and when not sudo can be used instead of "su -".  Could you, who know much better than me, add some words?  Use link to [[Configuring_sudo|Configuring sudo]].



Announced on forum: https://forums.mageia.org/en/viewtopic.php?f=41&t=13769

I edited https://wiki.mageia.org/en/Removing_packages#Removing_unused_localisation_and_hardware_support to reference your page.
Comment 12 Morgan Leijström 2020-11-09 22:43:26 CET
Correct headline ;)

Summary: Mageia Live 8b1 persistent: rpm database get irrepairably broken by simple mistake while removing localisations => Mageia Live 8b1 persistent: rpm database get irrepairably broken by forgetting dash after su while removing localisations

Comment 13 Martin Whitaker 2020-11-09 23:44:52 CET
(In reply to Morgan Leijström from comment #12)
> Correct headline ;)

As said in comment 4, using 'su' is responsible for the XDG_RUNTIME_DIR message, but I'm not convinced it caused the rpm database error - it didn't for me.
Comment 14 Morgan Leijström 2020-11-09 23:46:43 CET
OK. Details are too convoluted for me.

Should we reopen and dig more?
Comment 15 Martin Whitaker 2020-11-09 23:51:20 CET
Only if you can reproduce the fault. If you can, capture the system log as well as the full terminal output.
Comment 16 Morgan Leijström 2020-11-09 23:56:22 CET
OK then.  In a few days i will try.

Resolution: INVALID => (none)
Status: RESOLVED => REOPENED

Comment 17 Morgan Leijström 2020-11-10 08:28:58 CET
Wiki page moved to https://wiki.mageia.org/en/Never_use_just_su
Comment 18 Morgan Leijström 2020-11-11 15:07:37 CET
Caught a crash now on first attempt.
Logs of journal and terminal output follows as attachments.

And, this time I _did_ use "su -" !
And no error visible nor noted in the log during the process.
Journal tells process urpmi runaway.
And again rpm fail to rebuild database.

Strangely, it now quit a hundred packages earlier in the process.
This happens to be a few seconds after screensaver kicked in.
During the process i reported in comment #0, I used the keyboard/pointer so there screensaver did not kick in, so it is probably not that problem.


This is the same machine and same stick.
To make the install it juvenile, I first output the stick in another system and renamed the folder memory/ in the persistent partition.
(so I can easily revert to my system in use.)
During boot I selected the same options as before, for Swedish localisation.

But how come it crashed this time and not last time??
Both times I used "su -".
So Remi is right, it is some other problem.

I wonder if it may be a hardware problem.  I bought this old laptop together with four other T400 machines, sold one but had to exchange it because sometimes it hangs completely. But i find it strange it is only during removal of many packages this happened.  I could update 1000+ packages and install lxde, and have used it for surfing an evening without any visible problem.

I can not see a pattern.

It often amuse me that system tells me how exiting things are ("### Program is exiting ###")  :^)

Summary: Mageia Live 8b1 persistent: rpm database get irrepairably broken by forgetting dash after su while removing localisations => Mageia Live 8b1 persistent: urpmi runaway while removing localisations - rpm database get irrepairably broken

Comment 19 Morgan Leijström 2020-11-11 15:09:15 CET
Created attachment 11985 [details]
File from running # LC_ALL=C remove-unused-packages 2>&1 | tee 1_rmu.txt
Comment 20 Morgan Leijström 2020-11-11 15:09:49 CET
Created attachment 11986 [details]
Relevant part of journal
Comment 21 Pascal Terjan 2020-11-11 15:32:17 CET
This means the command took longer than the given timeout.

The code does not set one:

    run_program::rooted($o_prefix, 'urpme', '--auto',
                     if_($hardware, @unused_hardware_packages),
                     if_($locales, @unselected_locales),
        );

So it is using a default of 10 minutes.

CC: (none) => pterjan

Comment 22 Dave Hodgins 2020-11-11 15:49:51 CET
See bug 44
Comment 23 Morgan Leijström 2020-11-11 16:47:23 CET
Yes you are correct:  it got killed after exactly 10 minutes :)

The variable results from testing could come from variable write times to the USB stick.

Do that timeout also affect urpmi commands given by user, i.e 
urpmi --auto-up --auto    ?

I used the GUI to update another install on same stick and machine, and that ran for very long without problems.

Also note, this is on a rather fast USB stick.  (but on USB2 port)

Priority: Normal => High
Summary: Mageia Live 8b1 persistent: urpmi runaway while removing localisations - rpm database get irrepairably broken => Mageia Live 8b1 persistent: urpmi killed by timeout while removing localisations - rpm database get irrepairably broken
Source RPM: (none) => drakxtools-curses

Comment 24 Martin Whitaker 2020-11-11 17:14:04 CET
The timeout is only applied to commands run from inside one of the drakx tools, so no, it won't affect urpm* commands you run from the command line.

We had a similar problem when I added the option to install updates in the Live installer (see thread starting at https://ml.mageia.org/l/arc/qa-discuss/2020-07/msg00002.html). I fixed that by disabling the timeout for that case.
Comment 25 Martin Whitaker 2020-11-14 22:16:20 CET
Fixed in git, will be fixed in next release of drakxtools.

Source RPM: drakxtools-curses => drakxtools-18.34
Assignee: mageiatools => mageia

Comment 26 Morgan Leijström 2020-11-18 19:39:54 CET
Great :)

BTW, I tested Mageia 7.1 i586 xfce live persistent:

There it did not get interrupted.
(I tried to interrupt after 9  minutes using ctrl-C but it did not stop.)
It took 80 minutes to remove unused localisations, 501 packages.
And that was not on a very slow USB, writing the iso took 50 seconds.

Hmm that makes me wonder why so slow and what can be done about it...
If buffering could be increased, or flushing/syncing relaxed?
At least for us enthusiast with some CLI command?
Comment 27 Martin Whitaker 2020-11-19 19:00:45 CET
I don't know why some people see such poor performance with persistence (you aren't the first to report this). What make/model of USB stick are you using?

If you install a system onto that USB stick, is it similarly slow when you are running that system?

On my 6 year old laptop with a SanDisk Ultra USB 3.0 memory stick, when plugged into a USB 3.0 port it took 2.5 minutes to remove the 520 unused packages, and when plugged into a USB 2.0 port it took 3 minutes.
Comment 28 Morgan Leijström 2020-11-19 22:34:20 CET
Whoa - yes very different speed with another USB drive!!

Now i tried the exact same .iso put on another stick with persistence.
It finishes removing unused locales, 501 packages in just exactly one minute.
 !! That is *80 times* faster !!
It is still USB2 port on the same one core i586 1,86GHz CPU Thinkpad T43.

What info can I provide?

For starter, maybe i should test mga8b1 xfce 32 bit and we work from there?

journal excerpts follow, first part of lines deleted for readability


The stick where persistence is  s l o w :

usb 5-2: new high-speed USB device number 5 using xhci_hcd
usb 5-2: New USB device found, idVendor=1b1c, idProduct=1a90, bcdDevice=11.00
usb 5-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 5-2: Product: Voyager GT
usb 5-2: Manufacturer: Corsair
usb 5-2: SerialNumber: A164012700015819
usb-storage 5-2:1.0: USB Mass Storage device detected
scsi host10: usb-storage 5-2:1.0
scsi 10:0:0:0: Direct-Access     Corsair  VoyagerGT        1100 PQ: 0 ANSI: 0 CCS
sd 10:0:0:0: [sdc] 31719424 512-byte logical blocks: (16.2 GB/15.1 GiB)
sd 10:0:0:0: [sdc] Write Protect is off
sd 10:0:0:0: [sdc] Mode Sense: 43 00 00 00
sd 10:0:0:0: [sdc] No Caching mode page found
sd 10:0:0:0: [sdc] Assuming drive cache: write through
sdc: sdc1 sdc2 sdc3
sd 10:0:0:0: [sdc] Attached SCSI removable disk


The stick where persistence is lightning fast:

usb 6-1: new SuperSpeed Gen 1 USB device number 4 using xhci_hcd
usb 6-1: New USB device found, idVendor=1b1c, idProduct=1a0e, bcdDevice= 1.00
usb 6-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
usb 6-1: Product: Voyager GTX
usb 6-1: Manufacturer: Corsair
usb 6-1: SerialNumber: 511190606267A0D6018B
scsi host10: uas
scsi 10:0:0:0: Direct-Access     Corsair  Voyager GTX      0    PQ: 0 ANSI: 6
sd 10:0:0:0: [sdc] Spinning up disk...
.ready
sd 10:0:0:0: [sdc] 250069680 512-byte logical blocks: (128 GB/119 GiB)
sd 10:0:0:0: [sdc] Write Protect is off
sd 10:0:0:0: [sdc] Mode Sense: 43 00 00 00
sd 10:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 10:0:0:0: [sdc] Optimal transfer size 33553920 bytes
sdc: sdc1 sdc2 sdc3
sd 10:0:0:0: [sdc] Attached SCSI disk
Comment 29 Morgan Leijström 2020-11-19 22:39:50 CET
Feel free to open a separate bug for the speed issue if you start working on that.
Comment 30 Martin Whitaker 2020-11-19 23:20:25 CET
From

https://usb.userbenchmark.com/Corsair-Flash-Voyager-GT-USB-30-16GB/Rating/1272 https://usb.userbenchmark.com/SpeedTest/13691/Corsair-Voyager-GTX

Corsair Voyager GT 16GB scores 9.74/0.37/0.55 MB/s on random 4K read/write/mixed
Corsair Voyager GTX 128GB scores 17/19.2/7.45 MB/s on random 4K read/write/mixed

I think that explains the difference!

My SanDisk Ultra USB 3.0 scores 5.32/1.45/2.51 MB/s, so is between your two.
Comment 31 Morgan Leijström 2020-11-19 23:56:51 CET
Oh, that performance of the slow stick on random writes is really ugly yes...

But I still think that is only part of explanation:
Averages number difference factor is 52 ( 19,2/0,37 ).
If i had logged a factor 52 in my tests, i would also have had to change CPU, memory, buses and USB port to 52 times faster devices.  
But I logged even greater difference - a factor 80 - with everything else but the stick unchanged.

Could it be that package management happens to be a worse test case than that linked tests, and some USB are perform extremely poor then... 

Or something regarding how kernel handles USB memories is not optimal for some?

To make it less random, can we buffer more, or aggregate writes?

Would a flash optimised file system perform better for persistence?

How much data is written when i *remove* 512 packages?
Comment 32 Martin Whitaker 2020-11-20 00:48:06 CET
It's likely most of the writes when deleting packages are less than 4K, so yes it will be a worse test case. And if CPU/memory/USB ports can keep up with the faster device, they don't matter.

You can try experimenting with different file systems for the persistent partition (just reformat it, and make sure to give it the volume label 'mgalive-persist'). I would guess a non-journaled file system would do better.

But really, for what you are trying to do, you want to be using a USB stick with better performance.
Comment 33 Morgan Leijström 2020-11-20 14:46:29 CET
I think you are correct.

I will try to write something in wiki about how to select USB stick for persistence use, especially for tasks like updating.  The userbenchmark site is a good information source, thanks.
Comment 34 Martin Whitaker 2020-11-22 21:43:59 CET
Fix released in drakxtools 18.35.

Resolution: (none) => FIXED
Status: REOPENED => RESOLVED

Comment 35 Morgan Leijström 2020-12-06 01:59:11 CET
Confirming fixed :)

Note You need to log in before you can comment on or make changes to this bug.