Description of problem:
I am using the x86 version.
After upgrading my AMD64 system from Mga3 to Cauldron, on the first boot, the splash screen appears then it drops to the console screen with a few messages, such as:
/dev/resume does not exist, then it showed something about, /run/initramfs/rdsosreport.txt. Then it dropped into a dracut shell prompt for debugging.
I have tried booting an older kernel but it continues with the same issue, /dev/rsume does not exist.
I am able to boot in Safe Mode and get a complete desktop environment. I'm writing this bug report using safe mode, and now can't find the rdsosreport.txt to attach to this bug report. I will have to get it from the cli when I boot into the regular mode.
Version-Release number of selected component (if applicable):
/dev/resume does not exist. This is booting into Mga4 (Cauldron) after the upgrade.
Not sure how reproducible this is for anyone.
Steps to Reproduce:
Created attachment 4787 [details]
Grub Boot Files
This is a list of my grub boot files that have been installed. I plan to attach other related files when I get them available.
mageia, thierry.vignaud, tmb
Could you add to the bug report how you performed the upgrade and also the contents of /boot/grub/menu.lst please.
This can happen if you reformat your swap partition.
The kernel command line has a resume= argument which specifies which device to resume from (i.e. hibernation).
In an ideal world this wouldn't be any kind of fatal error, but sadly it seems it is - I will take a look at whether we can downgrade it.
Anyway, just edit your kernel command line and remove the resume= argument and you should be able to boot.
If you use the command blkid (as root), you can find the new UUID of your swap partition and add that in place of the old UUID.
Ok, I was informed to remove the line in the kernels about resume=, which I have done and I am able to boot into the regular desktop.
I'll look into adding the new UUID in the swap partition and keep you all informed of how it turns out.
Out of curiosity, did you reformat the swap partition at any point or run e.g. mkswap as part of your upgrade procedure? I'm not aware of this being done automatically during upgrade (it would actually be quite hard) and if you used the upgrade GUI (i.e. via DVD) it *should* (in theory at least) preserve the UUID even if you do reformat it.
No I did not partition or mkswap on the swap partition. Now that you mention swap, I have to turn on swap manually, (swapon /dev/sda6), it doesn't activate when I boot. It's been like that at least since Mga2, I think. I guess that would create a different UUID?
How do I have swap active each time I boot Mageia?
(In reply to Stephen Pettin from comment #6)
> How do I have swap active each time I boot Mageia?
Just correct the entry in /etc/fstab to use the correct UUID and it should activate on boot.
OK, so I think it is just a product of having an old swap UUID in your fstab rather than a fundamental issue, so I'll close this report.
Feel free to reopen if you disagree with this analysis!
(In reply to Stephen Pettin from bug #12416 comment #13) (accidental bug mixup)
> That sounds good for the long term fix. What happens, in my case, when the
> swap partition changes or any partition UUID changes? Is it a way to fall
> back to Safe Mode instead of dropping to the console screen with some
> cryptic error messages?
Well, there are several bits to this really.
1. Our installer tries very hard to preserve UUIDs over formats so as to mitigate this, but obviously users can do this outside our installer.
2. As far as dracut knows, the device that's needed isn't known at that particularity stage to be a (relatively) unimportant swap partition. It just knows "I need this device, and it's not shown up". Passing the metadata around that it's wanted, but optional is a bit tricky.
> Maybe the error message can give a better explanation on how to solve it.
> I understand it may take quite a bit to do this and may need more man power
> also. This is just a suggestion.
Well my aim for MGA5 is to use systemd in the initrd too, thus this bit of infrastructure will change and the error messages will be different, so I would prefer to put effort into this for the future.
Hopefully it'll also be easier to add the necessary optional deps+timeout needed for making the resume= bit optional (but slows down boot by e.g. 5-10s if not found - we need to give it some time to appear after all).
Hope that makes sense.
PS we're no worse/better than Mageia 3 I believe... although that's a bit of a lame excuse :p
Thanks for the reply. I understand.
As several users where affected by this after upgrades, we should make sure via installer or as the last step before we recommend a reboot to make sure that UUID of swap partition is the same as the one in fstab.
Marking as release_blocker for Mageia 5.
Release (media or process) =>
I was also Affected by this issue while upgrading from Mageia 3.
(In reply to Colin Guthrie from comment #8)
> OK, so I think it is just a product of having an old swap UUID in your fstab
> rather than a fundamental issue, so I'll close this report.
> Feel free to reopen if you disagree with this analysis!
Nope, the UUID will be embedded into initrd via /etc/dracut.conf.d/51-mageia-resume.conf and maybe also something else, and it will fail badly when the UUID has changed since last boot (e.g. installation of another distro which calls mkswap or similar).
There's no easy way to workaround that, or at least none that I know of - and this issue happens quite often. Also see the recent threads on -dev ml, and one back from february "is dracut's initrd married to /'s UUID"
Is this valid for Mageia 5?
colin any thoughts on that one ?
(In reply to Anne Nicolas from comment #15)
> colin any thoughts on that one ?
We could always try and kill off the /etc/dracut.conf.d/51-mageia-resume.conf file these days. Perhaps dracut is better at enabling LVMs when swap is on LVM?
Someone would need to check to see if it works OK with resume when swap is on an LVM (both an LV from the same VG and an LV from completely separate VG), and on RAID and make sure that in all cases dracut can activate it so it can check for resuming properly without that file.
I'm afraid I won't really have the time (or space) to setup so many test systems.
I've just been bitten by this bug (installing another OS reformatted my swap partition). I fixed it by booting the rescue system and following steps 2.2 to 2.5 in http://www.mageialinux-online.org/wiki/dracut-warning-could-not-boot. Failing a proper fix for this bug, could we add a script to the rescue system that automates this process?
I've submitted dracut-038-13.mga5 to core/updates_testing in Cauldron with the change Colin suggested during the council meeting today. Could someone verify that it will no longer fail the boot process if the UUID for the swap partition cannot be found. I believe for this bug to be triggered, there also has to be a resume=the-UUID part in the kernel line in the bootloader config.
I just installed a second distro, Saturday (3.21.15), and it happen again. I didn't think about not formatting swap, which it did, and I got the same error message. I ended up resintalling Mga5 Beta 3, because of a different issue for me.
I used the KDE Livecd version.
I don't know much about this but is it some type of way, instead of using UUID, it could use /dev/sda?
Just a thought!
(In reply to David Walser from comment #18)
> I believe for this bug to be triggered, there
> also has to be a resume=the-UUID part in the kernel line in the bootloader
No, it triggers without that, and when booting in failsafe mode as well (that's what makes it so hard to recover from). I'll try to test the update this evening - at least I now know how to recover if it doesn't work!
(In reply to Martin Whitaker from comment #20)
> (In reply to David Walser from comment #18)
> > I believe for this bug to be triggered, there
> > also has to be a resume=the-UUID part in the kernel line in the bootloader
> > config.
> No, it triggers without that, and when booting in failsafe mode as well
> (that's what makes it so hard to recover from). I'll try to test the update
> this evening - at least I now know how to recover if it doesn't work!
Yeah, I think the config file that causes this should only be written when there is a resume= command line argument, but once the initrd is generated it will still (try to) activate it all the time.
Created attachment 6134 [details]
log file from failed boot (with no resume= on the boot command line)
OK, I tested this by
- installing dracut-038-13.mga5
- running dracut --regenerate-all
- reformatting my swap partition
With no resume= on the boot command line, this fails with the message
dracut Warning: Cancelling resume operation. Device not found.
dracut Warning: Could not boot.
dracut Warning: /dev/disk/by-uuid/04fe59ec-e023-4a45-8426-b1e8d119fdcf does not exist
Adding resume=<new swap uuid> on the boot command line causes the extra message
ln: failed to create symbolic link '/dev/resume': File exists
to be output, but otherwise fails in the same way.
Looking in the attached rdsosreport.txt file, the problem seems to be that dracut is reading the kernel command line info stored in /etc/cmdline.d which is included in the initrd image, which still contains the old swap partition uuid.
It looks like we now have dracut-038-15.mga5 in core/release.
The new release was pushed by tmb for an unrelated matter, but the fix mentioned in comment 18 is now in core/release. Judging by comment 22 the fix does not seem satisfactory though; should it be reverted, or does one of you see a way to improve it?
Created attachment 6251 [details]
Patch to make dracut treat missing swap partitions as non-fatal error
I've done a bit more investigation on this. The problem is that if you build an initrd on a system with swap enabled, dracut adds the swap partition(s) to the list of devices it requires to be present, and won't continue boot unless they are there.
As it is perfectly possible to boot the system without the swap partition(s) being present, I suggest we fix this problem by adding a job in the dracut timeout initqueue to remove the swap partitions from the finished initqueue if they haven't appeared by the time the timeout expires. The attached patch is my attempt to do this. It works on my fairly simple system, but needs to be reviewed by someone more expert.
I notice there seems to be an attempt to do something similar in dracut already - it tries to remove /dev/resume from the list of required devices when the resume job times out. This seems to be rather broken; firstly, if I understand it correctly, /dev/resume is a symbolic link which only gets created if the underlying device is detected, and isn't in the list of required devices; secondly the function cancel_wait_for_dev() has a couple of bugs (which I've fixed as I use that function in my patch).
Can you post this patch to upstream (firstname.lastname@example.org IIRC, but please double check!).
Couple points that might be asked by Harald or others upstream.
The reason for waiting for swap devices is because a system may be suspended to disk and there may be critical, unsaved work there in that image.
The idea here is that if you reboot and throw-away that state, you could destroy data (albeit data that the user *should* have been more careful to save properly! :p)
In order to support this, any swap devices (an their underlying supporting infrastructure - e.g. raid, LVM, crypt etc) has to be all brought up in the initramfs and we have to check for a *lack* of saved state (assuming resume= command line option) before we know we can safely continue without destroying data.
If this patch allows you to silently continue when a swap device is not seen, you do run the risk of destroying this data.
I appreciate that this might be an edge case and that for most of the time, it's probably totally safe to just boot, but there are likely cases where this is not true also.
So with that in mind, it might be that a big warning needs to be printed out and instructions given to the user e.g.
Warning: we could not active all registered swap devices to check for saved state.
If you are OK to continue (possibly losing any saved state you may have) then please type "exit" to continue and boot.
Or something like that. Personally I don't have too strong an opinion on the matter, and this may not be a requirement of upstream guys, so take it with a pinch of salt. Obviously if there is no resume= argument then no warning need be shown if swap devices doesn't appear.
I do know that, in the past, people complained that resume images were not found when swap was on RAID/LVM drives. - your approach shouldn't interfere with this so no problems on that front.
Oh and FWIW, I'd submit any fixes for cancel_wait_for_dev as a separate patch upstream.
Also, if you commit your patch locally, you can use git send-email to send it directly upstream, or git format-patch to create a nice patch, complete with proper attribution and commit message :)
(In reply to Colin Guthrie from comment #25)
> If this patch allows you to silently continue when a swap device is not
> seen, you do run the risk of destroying this data.
IIUC, the existing code is intended to do this already - it just doesn't work :-(
> So with that in mind, it might be that a big warning needs to be printed out
> and instructions given to the user e.g.
I considered doing this, but don't know what the mechanism is to make the message visible if the user has booted with "splash quiet". Can you point me at an example bit of code that does this?
Just since it hasn't been clearly stated on this bug report, only elsewhere where we've discussed this bug, besides dracut being fixed, which is most important, the DrakX installer should also ensure that consistent and correct UUIDs are used in the bootloader, fstab, and (if applicable) dracut configurations.
If either DrakX or dracut is fixed, this needn't be a blocker anymore.
Patch posted to email@example.com. The bug in cancel_wait_for_dev() was already fixed upstream (commit 7d97c7a).
System crashed after upgrading from Mga3 to Cauldron. Stops after splash screen with message, /dev/resume does not exist. =>
dracut fails the boot process when swap (resume) partition UUID not found; installer doesn't help prevent this
(In reply to Martin Whitaker from comment #29)
> Patch posted to firstname.lastname@example.org. The bug in cancel_wait_for_dev()
> was already fixed upstream (commit 7d97c7a).
Corresponding thread on gmane: http://comments.gmane.org/gmane.linux.kernel.initramfs/4111
Martin, it looks like there's interest in your patch upstream, but it will probably take some time to get it or a modified version of it integrated in the main repo. Should we apply this patch to our dracut for Mageia 5?
I've been hit by this bug yesterday; it was late so i didn't take the time to copy the report.
I was performing a "standard" upgrade from Mageia 4 to Mageia 5 rc and the system now doesn't boot, with any kernel showing:
ln: failed to create symbolic link '/dev/resume': File exists
The system is pretty standard, since it dual boots with windows, has MBR and no LVM.
The upgrade was done with urpmi, so I don't think there were any swap reformats involved :)
I'm now reading the proposed "Dracut Warning: Could not boot" article; I won't be able to access the affected system until late afternoon so maybe the article will be enough to recover, this bug could be a show-stopper for novices.
Colin, Rémi, could we apply the patch?
I've run into this issue myself in the past in mga4 with swap on a 2nd hdd, once the hdd was repartitioned it was unable to boot and not easy to fix.
Meanwhile the patch is considered for inclusion, is there any straightforward method to recover using dracut shell or performing a chroot via a rescue media?
I would like to follow a sure path because even if I'm a sysadmin I wouldn't spend a lot of time trying different approaches to the problem.
Or do the comment #4 do the trick (temporarily removing resume= in the grub command line, rebuilding the dracut rd, and booting normally?)
Ok I corrected the problem, there were two issues, one is the one covered by this bug, the other was a missing /dev/non-hostonly-lvm, had to symlink that to /dev/null before exiting dracut and boot; and I had to rebuild the initrd with -H to let the boot complete automatically.
OK, I've spent several hours today looking into this issue.
I initially looked at Martins patch, but sadly none of the forms available would apply sadly (neither the attachment or the post to the mailing list), but I mangled it around until it was working then backported to our version of dracut.
I can confirm that it allows me to boot without regenerating initramfs when the UUID changes. There seems to be an approx 20s timeout in dracut and then (assuming your /etc/fstab is not updated) there is about a 1m30s timeout in the main OS waiting for the device there too. This is expected.
Seems to solve the problem OK.
But I'm not convinced it's the right approach.
The swap devices are only added to the initramfs to support resume= on the command line. There is no real need to setup wait_for_dev on them when building the initramfs, we can (and indeed should) do this dymanically at runtime depending on the kernel command line.
Now, in looking into this, I've found a rather interesting bug in dracut. In the not too distant past, it was reworked to allow a dracut --print-cmdline for generating the appropriate kernel command line needed to boot (we don't use this - yet).
In so doing, it embedded the resume= argument into the initrd which IMO is a bit wrong. If we disable this part of dracut (hostonly_cmdline=no) then any /usr device on LVM or RAID etc will not be properly assembled in the hostonly initramfs. Not ideal.
Really we want the raid/lvm stuff, but not the resume= stuff (we can rely on the command line for now).
So I've got three patches to dracut that do the following:
1. Do not call wait_for_dev for swap devices at build time.
2. Properly do wait_for_dev on /dev/resume based on the kernel command line.
3. Kill off any resume= saving in the initramfs
Patch 3 isn't actually quite as critical as it might sound. The worst that can happen after patches 1 and 2 is that it'll time out, but I still think we want the rest of the hostonly_cmdline stuff turned on.
Note: There were two other cherry picked patches before this too.
FWIW a lot of these problems go away when using systemd in dracut. So that'll be the first port of call when MGA6 reopens.
Submitted to core/updates_testing
Please test this thoroughly, ideally with swaps on LVM and RAID devices and actually suspending and (successfully) resuming to those swap partitions to make sure all is well in the world and there are no regressions!
(In reply to Colin Guthrie from comment #35)
> FWIW a lot of these problems go away when using systemd in dracut. So
> that'll be the first port of call when MGA6 reopens.
Why don't we do this BTW?
(In reply to Thierry Vignaud from comment #37)
> (In reply to Colin Guthrie from comment #35)
> > FWIW a lot of these problems go away when using systemd in dracut. So
> > that'll be the first port of call when MGA6 reopens.
> Why don't we do this BTW?
Mainly because the initrd stuff relies on proper udev-based activation of LVM+RAID and AFAIK this is all still rather hacky (e.g. we still rely on fedora-storage-init hack which shouldn't be needed).
It'll work for relatively simple setups but anything more complex (e.g. /usr on LVM LV on top of RAID) will likely fail horribly. Frankly I didn't have the time, motivation or hardware to fix the whole LVM+RAID mess so this had to stay out until it is all confirmed working. FWIW, it works fine for simpler setups.
Frankly I'm getting sick of us supporting such crazy setups in the installer. I think we should be super strict and force users into the one true way... Screw this choice nonsense ;)
(In reply to Colin Guthrie from comment #35)
> I initially looked at Martins patch, but sadly none of the forms available
> would apply sadly (neither the attachment or the post to the mailing list),
> but I mangled it around until it was working then backported to our version
> of dracut.
Sorry about that - looks like my mailer got creative and added some unwanted white space :-( The attachment shouldn't have been too far off, though, apart from being based on an installed system, not git.
(In reply to Colin Guthrie from comment #36)
> Please test this thoroughly, ideally with swaps on LVM and RAID devices and
> actually suspending and (successfully) resuming to those swap partitions to
> make sure all is well in the world and there are no regressions!
I've tested on my system by installing the new version from updates_testing, regenerating all my initrd files, then reformatting the swap partition. On reboot, with resume=<old-uuid> still on the boot command line and fstab unchanged, dracut reported
dracut Warning: Cancelling resume operation. Device not found.
and boot continued, with systemd subsequently timing out after 90s because it couldn't find the swap partition. Updating the boot command line and fstab restores the system to normal operation. I checked suspend/resume following this, and all seems OK.
Bottom line - looks good to me. I don't have a RAID system to test on, and have no experience with LVM.
(In reply to Martin Whitaker from comment #39)
> (In reply to Colin Guthrie from comment #35)
> > I initially looked at Martins patch, but sadly none of the forms available
> > would apply sadly (neither the attachment or the post to the mailing list),
> > but I mangled it around until it was working then backported to our version
> > of dracut.
> Sorry about that - looks like my mailer got creative and added some unwanted
> white space :-(
Yeah it's a common problem.
> The attachment shouldn't have been too far off, though, apart from being
> based on an installed system, not git.
Indeed, that was a problem, but even correcting the paths it didn't apply to git master.
No problem tho', it was fairly easy to do manually.
> (In reply to Colin Guthrie from comment #36)
> > Please test this thoroughly, ideally with swaps on LVM and RAID devices and
> > actually suspending and (successfully) resuming to those swap partitions to
> > make sure all is well in the world and there are no regressions!
> I've tested on my system by installing the new version from updates_testing,
> regenerating all my initrd files, then reformatting the swap partition. On
> reboot, with resume=<old-uuid> still on the boot command line and fstab
> unchanged, dracut reported
> dracut Warning: Cancelling resume operation. Device not found.
> and boot continued, with systemd subsequently timing out after 90s because
> it couldn't find the swap partition. Updating the boot command line and
> fstab restores the system to normal operation. I checked suspend/resume
> following this, and all seems OK.
Great! That is all expected and as intended! Thanks for that.
> Bottom line - looks good to me. I don't have a RAID system to test on, and
> have no experience with LVM.
While we should really test this, it will have no effect on normal operations here, so worst case scenario is that resume will break there. But I do highly doubt this to be the case and I think all will be well.
I'll ask for this version to be pushed.
*** Bug 14834 has been marked as a duplicate of this bug. ***
Thomas pushed this now so we can close.