Bug 10484 - dracut/initrd appears to NOT be honoring "rootdelay" option in grub2
Summary: dracut/initrd appears to NOT be honoring "rootdelay" option in grub2
Status: RESOLVED OLD
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 3
Hardware: i586 Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-11 02:42 CEST by George Mitchell
Modified: 2015-03-31 16:06 CEST (History)
3 users (show)

See Also:
Source RPM: dracut-025-8.mga3.src
CVE:
Status comment:


Attachments

Description George Mitchell 2013-06-11 02:42:57 CEST
Description of problem:

I am having an open_ctree problem with btrfs at boot.  One of the recommended band aides for this problem is the rootdelay option in the vmlinux string in grub2 configuration.  But no matter what number I put in this option to the point of the ridiculous the boot order remains unchanged and the initial ro root mount is attempted at exactly the same point in the boot order and when it fails exactly the same in progress other processes spill out on boot.  It becomes really frustrating when one is trying to deal with one bug only to encounter another that becomes a gotcha in the process.  I have reviewed countless examples of the use of rootdelay and it certainly should work.  I have tried all manner of different delay increments and the process follows the same exact pattern every time.  Here is the exact line from grub:

linux   /vmlinuz-server root=/dev/sda7 rootdelay=120 ro


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.


Reproducible: 

Steps to Reproduce:
Manuel Hiebel 2013-06-11 20:13:58 CEST

CC: (none) => mageia, zen25000

Comment 1 George Mitchell 2013-06-14 16:02:41 CEST
It also appears that dracut is NOT doing proper initial scans on btrfs filesystems (see Bug 9714).  I have collected two cases so far of impending boot failures on btrfs root filesystem where dracut suddenly decided to do btrfs scans and the problem immediately resolved as an apparent result.  The first time this happened I accepted the fact that it could have been just a coincidence.  But after seeing it happen twice, I am convinced that there is a problem with dracut that is triggering these failures.  At the very least, dracut should be intervening with btrfs scans immediately upon any initial open_ctree failure and it is NOT doing that.  Ideally dracut should be doing initial btrfs scans per-emptively on a routine basis in a btrfs root environment.
Comment 2 George Mitchell 2013-06-14 16:19:38 CEST
In any case, an open_ctree failure should cause dracut to automatically WAIT for at least one second before attempting to rescan btrfs and proceed with the boot process.  That also appears that it would resolve this problem.  I am more convinced than ever that this problem is occurring because the btrfs filesystem has not had sufficient time to prepare itself for the boot process.  In that way it is similar to a problem that can occur with conventional raid arrays whereby they are not ready when the boot process tries to initially mount them and the boot subsequently fails as a result.  This problem is only going to get worse as disk sizes grow larger and filesystems grow more complex.  The easy fix would be to resolve problems with the "rootdelay" kernel option in dracut/initrd.
Comment 3 Barry Jackson 2013-06-14 16:56:07 CEST
The bug title seems to imply that this only happens with grub2, however I'm guessing that this is not the case?
Does it happen using grub?
If it does not only happen with grub2, please edit the title.
Comment 4 George Mitchell 2013-06-14 17:20:23 CEST
Since legacy grub does not handle btrfs root file systems, there is really know way I can test this with legacy grub.  So I guess they answer is "I don't know."
Comment 5 Sander Lepik 2013-06-14 17:23:38 CEST
Just a question (bit OT): why not use /boot partition?

CC: (none) => sander.lepik

Comment 6 George Mitchell 2013-06-14 17:25:24 CEST
If there is any way I can be of further help with this let me know.  For example, I can run it with an rd.debug option and post the results if that would help.
Comment 7 George Mitchell 2013-06-14 17:28:03 CEST
Sander, I AM using a separate /boot partition.  The problem is with the initial ro mount of the  /(root) filesystem.  The /boot partition is being read successfully by grub2, but does not actually get mounted until far later in the boot process.  The initial mount is the ro mount of /, and that is what is failing.
Comment 8 Barry Jackson 2013-06-14 20:20:14 CEST
(In reply to George Mitchell from comment #4)
> Since legacy grub does not handle btrfs root file systems, there is really
> know way I can test this with legacy grub.  So I guess they answer is "I
> don't know."

Ah yes - my mistake.

Did you try "rootwait"?
IIUC it should wait indefinitely for the root device to show up, compared to rootdelay which takes a timeout parameter.
Comment 9 Sander Lepik 2013-06-14 20:41:39 CEST
(In reply to George Mitchell from comment #7)
> Sander, I AM using a separate /boot partition.  The problem is with the
> initial ro mount of the  /(root) filesystem.  The /boot partition is being
> read successfully by grub2, but does not actually get mounted until far
> later in the boot process.  The initial mount is the ro mount of /, and that
> is what is failing.

If you have separate /boot partition then what is stopping you from using legacy grub?
Comment 10 George Mitchell 2013-06-14 20:44:22 CEST
Thanks for the suggestion, but rootwait does not work because the root device HAS shown up, but it is NOT READY.  That is the problem.  Only an arbitrary wait can resolve this problem because the root device APPEARS to be ready, but it is NOT ready.
Comment 11 George Mitchell 2013-06-14 20:53:24 CEST
Sander, the /boot partition is also on btrfs in order to use the raid feature.  That is why it does not work with legacy grub.  I realize I could solve this whole thing in a snap simply by moving everything back to my old 3ware raid cards and ext4 and calling it a day.  But at some point it is going to have to be fixed and until it is, it is a bug.  I realize that it may not be a very important bug and may not be high on the priority list to fix and I accept that proposition.  But please don't try to live in denial about it.  It also may well be a bug that is affecting other users using other file systems in other situations.  The rootdelay option does work with other distrobutions, so the problem is unique to Mageia or to whatever upstream source Mageia derived the code from OR is some sort of regression in the upstream code.  At this point I have a satisfactory, if messy workaround that is seemingly forcing it through eventually since I have not had an actually boot failure in the last two dozen boots.  But there is a problem and that problem is revealed by the dracut trail:

Jun 14 06:28:09 localhost.localdomain dracut: dracut-3 dracut-025-
Jun 14 06:28:09 localhost.localdomain dracut: Starting plymouth daemon
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Scanning for all btrfs devices
Jun 14 06:28:09 localhost.localdomain dracut: failed to read /dev/sr0
Jun 14 06:28:09 localhost.localdomain dracut: Scanning for Btrfs filesystems
Jun 14 06:28:09 localhost.localdomain dracut: Checking, if btrfs device complete
Jun 14 06:28:09 localhost.localdomain dracut: Remounting /dev/sda7 with -o relatime,ro
Jun 14 06:28:09 localhost.localdomain dracut: Mounted root filesystem /dev/sda7
Jun 14 06:28:09 localhost.localdomain dracut: Mounting /usr with -o subvol=USR,relatime,ro
Jun 14 06:28:09 localhost.localdomain dracut: Switching root
Comment 12 George Mitchell 2013-06-15 02:06:52 CEST
OK Sander, I think I get what you are saying.  It has taken me a while to think this over.  I think the problem here is that:

1) This could be a grub2 bug whereby grub2 is not forwarding the option to the kernel.

or even

2) This could be an obscure kernel bug whereby the kernel itself is not honoring the option.

Thanks! (and my apologies for not thinking this over before answering).  I am going to try to get further information for you.
Comment 13 Barry Jackson 2013-06-15 15:09:43 CEST
George,
There is no known reason (according to upstream) why grub2 would not pass rootdelay to the kernel.
cat /proc/cmdline will verify this:

[root@jackodesktop baz]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-desktop root=UUID=2315b9d2-dc3c-4f3c-9d0c-c72fb612c011 ro splash rootdelay=2

(I edited /etc/default/grub with GRUB_CMDLINE_LINUX_DEFAULT="splash rootdelay=2" then ran update-grub)

However, I did not notice a 2 second delay during boot :\
I will test again with 10 seconds to be sure which should double my boot time.
Comment 14 Barry Jackson 2013-06-15 15:19:09 CEST
It does seem to be ignored - I see no change with it set at 10.
Comment 15 Sander Lepik 2013-06-15 15:29:53 CEST
Not even if you move it before root=UUID=2315b9d2-dc3c-4f3c-9d0c-c72fb612c011 ?
I remember I have had problems when parameters are in the end. nokmsboot was probably one of those that didn't work in the end.
Comment 16 George Mitchell 2013-06-15 15:55:17 CEST
Thanks so much Barry.  I KNEW there was somewhere on the system where the kernel command line was stored but couldn't remember where it was.  I was looking all through the logs in /var for it.  /proc!  Of course.  Thanks for finding that.

So that means the culprit IS the kernel after all.  The kernel is receiving the rootdelay option and for some obscure reason is ignoring it.  Why?  That is the question.  So I guess this bug should be charged against the kernel and not dracut?
Comment 17 Barry Jackson 2013-06-15 18:38:18 CEST
(In reply to Sander Lepik from comment #15)
> Not even if you move it before
> root=UUID=2315b9d2-dc3c-4f3c-9d0c-c72fb612c011 ?
> I remember I have had problems when parameters are in the end. nokmsboot was
> probably one of those that didn't work in the end.

Nope - I just booted with:

[baz@jackodesktop ~]$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-desktop rootdelay=20 root=UUID=2315b9d2-dc3c-4f3c-9d0c-c72fb612c011 ro splash

...and no delay :(
Comment 18 Marja Van Waes 2015-03-31 16:06:27 CEST
Mageia 3 changed to end-of-life (EOL) status 4 months ago.
http://blog.mageia.org/en/2014/11/26/lets-say-goodbye-to-mageia-3/ 

Mageia 3 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of Mageia
please feel free to click on "Version" change it against that version of Mageia
and reopen this bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

--
The Mageia Bugsquad

Status: NEW => RESOLVED
Resolution: (none) => OLD


Note You need to log in before you can comment on or make changes to this bug.