Bug 25541 - urpmi activity causes occasional system hang in syncfs() on root filesystem
Summary: urpmi activity causes occasional system hang in syncfs() on root filesystem
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Mageia tools maintainers
QA Contact:
URL:
Whiteboard:
Keywords: NEEDINFO
Depends on:
Blocks:
 
Reported: 2019-10-09 20:25 CEST by Frank Griffin
Modified: 2019-12-15 04:11 CET (History)
3 users (show)

See Also:
Source RPM: kernel, urpmi, rpm, grub2, os-prober
CVE:
Status comment:


Attachments
Last 1000 lines of strace (82.67 KB, text/plain)
2019-10-16 17:36 CEST, Frank Griffin
Details
tail end of urpmi stdout (16.96 KB, text/plain)
2019-10-16 17:37 CEST, Frank Griffin
Details

Description Frank Griffin 2019-10-09 20:25:32 CEST
Install of the current kernel packages is blocking, and only responds to kill -9:

virtualbox (6.0.12-2.mga8): Installing module.
.............
.......
Creating: target|kernel|dracut args|basicmodules
remove-boot-splash: Format of /boot/initrd-5.3.5-desktop-1.mga8.img not recognized
^C^C^C^C^C^C^C^C^C^C^Cwarning: %posttrans(kernel-desktop-5.3.5-1.mga8-1-1.mga8.x86_64) scriptlet failed, signal 9
ERROR: 'script' failed for x11-driver-video-intel-2.99.917-56.mga7.x86_64

Update: This has happened now 3 times on both a laptop and a desktop system, and in the last two times "kill -9" had no effect and a reboot was required.  The freeze appears to happen during the install of different coreqs of the kernel, e. g. VirtualBox.

"ps ax" shows the status of the hung urpmi as D.
Comment 1 Lewis Smith 2019-10-09 21:27:26 CEST
Can you please give some background information? e.g.
* Is this happening in a VB virtual machine, or a real hardware Mageia installation? You mention "both a laptop and a desktop system".
* If real hardware, can you say what graphics hardware you have? (since there is an error re an Intel video driver).
* You say "installing the kernel", and mention urpmi. So, is this happening when updating an existing system? Can you give the failing command?
* Can you give your current/previous kernel version; and that of the one failing to install?
* After the failed/aborted kernel install, can you still boot successfully to the previous one?
TIA

Assigning to the kernel team.

Assignee: bugsquad => kernel

Comment 2 Lewis Smith 2019-10-09 21:43:17 CEST
See also bug 25542.

CC: (none) => lewyssmith

Comment 3 Frank Griffin 2019-10-09 22:09:49 CEST
(In reply to Lewis Smith from comment #1)
> Can you please give some background information? e.g.
> * Is this happening in a VB virtual machine, or a real hardware Mageia
> installation? You mention "both a laptop and a desktop system".

Real hardware, during "urpmi --auto-update"

> * If real hardware, can you say what graphics hardware you have? (since
> there is an error re an Intel video driver).

The laptop has two cards:

Identification
Vendor: ‎Intel Corporation
Description: ‎UHD Graphics 620
Media class: ‎VGA compatible controller
Connection
Bus: ‎PCI Express
PCI domain: ‎0
Bus PCI #: ‎0
PCI device #: ‎2
PCI function #: ‎0
PCI revision: ‎0x07
Vendor ID: ‎0x8086
Device ID: ‎0x5917
Sub vendor ID: ‎0x1043
Sub device ID: ‎0x163e
Misc
Module: ‎Card:Intel 810 and later

Identification
Vendor: ‎NVIDIA Corporation
Description: ‎GP108M [GeForce MX150]
Media class: ‎3D controller
Connection
Bus: ‎PCI Express
PCI domain: ‎0
Bus PCI #: ‎1
PCI device #: ‎0
PCI function #: ‎0
PCI revision: ‎0xa1
Vendor ID: ‎0x10de
Device ID: ‎0x1d10
Sub vendor ID: ‎0x1043
Sub device ID: ‎0x163e
Misc
Module: ‎Card:NVIDIA GeForce 635 series and later

As far as I know, it's using the Intel card.

For the desktop, it's

Identification
Vendor: ‎Advanced Micro Devices, Inc. [AMD/ATI]
Description: ‎RS780D [Radeon HD 3300]
Media class: ‎VGA compatible controller
Connection
Bus: ‎PCI
PCI domain: ‎0
Bus PCI #: ‎1
PCI device #: ‎5
PCI function #: ‎0
Vendor ID: ‎0x1002
Device ID: ‎0x9614
Sub vendor ID: ‎0x1565
Sub device ID: ‎0x0217
Misc
Module: ‎Card:ATI Radeon HD 4870 and earlier

> * You say "installing the kernel", and mention urpmi. So, is this happening
> when updating an existing system? Can you give the failing command?

"urpmi --auto-update", as above.

> * Can you give your current/previous kernel version; and that of the one
> failing to install?

This appears to have happened with the last 3 kernels to hit cauldron.  The current one is 5.3.4-desktop-1.mga8

> * After the failed/aborted kernel install, can you still boot successfully
> to the previous one?

In general, you can even boot to the current one, but there may be incompletely installed packages.  In one case, every Konsole window that opened showed an error claiming that a lib64gdk library was not long enough, and I had to reinstall the rpm containing that library using --replacepkgs to correct this.  However, see also bug#25542.
Comment 4 Thomas Backlund 2019-10-09 22:19:16 CEST
Kernel is not at fault, it only calls out to /sbin/installkernel and the toolchain / utils takes over...


Could be grub2, os-prober, some other thing hanging...

What happends if you actually wait it out... ?

CC: (none) => tmb
Assignee: kernel => mageiatools
Source RPM: kernel => bootloader-utils, drakxtools-backend, grub(2)?

Comment 5 Frank Griffin 2019-10-09 22:27:30 CEST
(In reply to Thomas Backlund from comment #4)
> What happends if you actually wait it out... ?

I left one occurrence hanging for several hours with no progress.
Comment 6 Thomas Backlund 2019-10-09 22:33:41 CEST
Ok, so something (even maybe the new rpm) is hanging, as iirc we normally should time out after ~10 minutes ...

So we'd need something like strace or gdb backtrace to see what/where we get stuck.

Of course if you happend to have an older kernel (5.1 or 5.2 series) installed, it would be nice to see if the same hang happends when you install a new 5.3.5 kernel
Comment 7 Frank Griffin 2019-10-09 23:04:50 CEST
OK, I'll do the updating with strace and get a gdb backtrace if I get a hang.
Comment 8 Frank Griffin 2019-10-16 17:35:24 CEST
I was just about to close this when I got a hit.

I have the strace file, which is unfortunately about 1 GB in size.  I'll attach the last 1000 lines, as well as the stdout lines leading up to the hang.

I can't get a gdb backtrace, since gdb won't attach to a process being straced.

The hang appears to happen in the midst of a syncfs() call.
Comment 9 Frank Griffin 2019-10-16 17:36:56 CEST
Created attachment 11321 [details]
Last 1000 lines of strace
Comment 10 Frank Griffin 2019-10-16 17:37:54 CEST
Created attachment 11322 [details]
tail end of urpmi stdout
Comment 11 Frank Griffin 2019-10-16 17:40:48 CEST
As before, the hung process does not respond to CTRL-C, kill, or kill -9.
Comment 12 Frank Griffin 2019-10-16 20:45:59 CEST
The syncfs() appears to be holding some sort of lock, because although it was only the Konsole window running urpmi that was hung initially, other open Konsole windows hung as time passed.  Eventually, everything including Plasma and X became unresponsive, requiring a magic key reboot.
Lewis Smith 2019-10-17 10:16:12 CEST

CC: lewyssmith => (none)

Comment 13 Frank Griffin 2019-10-17 20:47:02 CEST
Got another hit, same exact strace signature - openat() of "/" followed by syncfs() of that fd.  This one had nothing to do with kernel, but was installing task-obsolete and nothing else.

Summary: Kernel install freezes, requiring "kill -9" or reboot => urpmi activity causes occasional symptom hang in syncfs() on root filesystem
Source RPM: bootloader-utils, drakxtools-backend, grub(2)? => urpmi, rpm

Frank Griffin 2019-10-17 20:54:22 CEST

Summary: urpmi activity causes occasional symptom hang in syncfs() on root filesystem => urpmi activity causes occasional system hang in syncfs() on root filesystem

Comment 14 Frank Griffin 2019-10-17 23:06:23 CEST
After the latest hang, I tried to verify if bug#25573 was still occuring, and when I tried to invoke drakboot from MCC it eventually timed out with:

The "drakboot" program has crashed with the following error:

  update-grub2 failed:  at /usr/lib/libDrakX/any.pm line 697.
  	...propagated at /usr/libexec/drakboot line 49.
  Perl's trace:
  drakbug::bug_handler() called from /usr/libexec/drakboot:49

Used theme: oxygen-gtk

To submit a bug report, click on the report button.  
This will open a web browser window on Bugzilla where you'll find a form to fill in.  The information displayed above will be transferred to that server
Things useful to attach to your report are the output of the following commands: 'lspcidrake -v', 'blkid'.
You should also attach the following files: /etc/modprobe.conf, /etc/fstab, /boot/grub/menu.lst, /boot/grub/devices.map as well as /etc/lilo.conf.
Comment 15 Frank Griffin 2019-10-18 20:34:54 CEST
I found another oddity in a subsequent update on the same system as the two above.  The update for a tex package appeared to stall midway, and when I used "tail -f" on the strace file, it was reading and writing as fast as it could.  That update never completed, but the strace activity continued.

As in the syncfs() cases, various components in the system stopped responding, and I ended up rebooting.  On a hunch, I did an "rpm --rebuilddb", which succeeded, and then restarted the update which also succeeded.

I'd hold off pursuing this unless it happens again.  It's possible that the rpm database was screwed up and causing infinite I/O loops.
Comment 16 papoteur 2019-10-19 08:04:11 CEST
Hello,
I saw hangs with another application, konversation. When I look at the activity, konversation was "waiting for disk". This occurred after installation of kernel-desktop-5.3.2-1.mga7-1-1.mga7. Each time this occurred and that I tried to kill this unresponsive application, the whole desktop became unresponsive, except the moving of the mouse. Magic keys wasn't of any help.
I use now a previous kernel which is fine.

CC: (none) => yves.brungard_mageia

Comment 17 Frank Griffin 2019-10-21 21:37:11 CEST
It's happened again, but with a variation.  This time the hang occurs in a different sort of sync:

23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 getpid()                          = 23451
23451 pread64(6, "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0*\1\"\6\0\r\373\17\366\17\361\17"..., 4096, 4096) = 4096
23451 pwrite64(6, "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0*\1\"\6\0\r\373\17\366\17\361\17"..., 4096, 4096) = 4096
23451 fdatasync(6
Comment 18 Frank Griffin 2019-10-22 15:42:28 CEST
Yet another variation.  In this case, the urpmi completes, both with a new prompt on the command line and the strace indicating "exited with rc 0".  However, the Konsole window in which urpmi ran has stopped responding and will not receive the focus.  The same is true for other Konsole windows ***on the same virtual desktop***, but not Konsole windows on other virtual desktops.

I don't understand this.  My only guess is that some kernel callback related to sync activity but running asynchronously to the client app request is hanging while holding some lock needed by other processes.  What this should have to do with virtual desktops is anybody's guess.

One other symptom I've seen when switching to a tty to initiate reboot is that the tty is receiving messages to the effect that journald-stop has timed out, so perhaps the hang is related to journald not responding.
Comment 19 Thierry Vignaud 2019-10-31 09:50:47 CET
Could be update-grub2 (via os-prober) doing sg harmful on some partitions.
Did you try disabling "Probe Foreign OS" in drakboot?
That should fix timeouting while running update-grub2

AFAIC, this is not an urpmi/rpm bug, calling sync() is not a bug.

CC: (none) => thierry.vignaud
Keywords: (none) => NEEDINFO
Source RPM: urpmi, rpm => kernel, urpmi, rpm, grub2, os-prober

Comment 20 Frank Griffin 2019-12-15 04:11:46 CET
I'm closing this as RESOLVED.  I have been running urpmi under strace on both systems with no hangs for a couple of months.  Wherever this bug was, it has apparently been fixed.

Status: NEW => RESOLVED
Resolution: (none) => FIXED


Note You need to log in before you can comment on or make changes to this bug.