Bug 9446

Summary: Fails to boot with Disk Not Found error when trying to mount /usr partition
Product: Mageia Reporter: Richard Walker <richard.j.walker>
Component: RPM PackagesAssignee: Mageia Bug Squad <bugsquad>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: Normal CC: mageia
Version: Cauldron   
Target Milestone: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Source RPM: CVE:
Status comment:
Attachments: sosreport

Description Richard Walker 2013-03-18 20:38:29 CET
Description of problem:
For a full description please see https://forums.mageia.org/en/viewtopic.php?f=15&t=4567

After working well for more than a day after the first successful install (2nd attempt) the system failed to reboot today after a session of around 30 minutes of trouble-free use in which some updates may have been installed. I cannot check this as the system log has been strangely silent about my use of the machine after 16:20 yesterday.

The boot stops after complaining it cannot find the disc it is booting from, nor either of the / or /usr partitions.

blkid at the dracut prompt returns an empty list. 

The associated sosreport.txt with added rd.debug goodness has been recovered.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. boot the affected machine
2.
3.


Reproducible: 

Steps to Reproduce:
Comment 1 Richard Walker 2013-03-18 20:42:57 CET
Created attachment 3634 [details]
sosreport
Manuel Hiebel 2013-03-19 00:05:29 CET

CC: (none) => mageia

Comment 2 Colin Guthrie 2013-03-19 10:43:21 CET
Hmm, it's hard to say what's going on here, but clearly none of the disks have been detected (i.e. no uuid's are showing up as evidenced by blkid output)

This likely suggests some kind of controller interface problem, in that the relevant kernel modules are not loaded for your particular h/w. You seem to suggest that the modules look correct however?

Can you double check if any /dev/sd* devices or /dev/hd* devices exist in the initrd shell?
Comment 3 Richard Walker 2013-03-19 14:22:37 CET
(In reply to Colin Guthrie from comment #2)
Definitely NO for hd*, sd* is not a problem; for example, I was able to plug in /dev/sda to capture the log output to /dev/sda8.

Don't know if it helps or is a red herring, Bug 9450 relates to failure of the install medium on the same setup. One of the logs from that says that sd_mod cannot be found! However, as explained in the background info from the forum message referred to above, this system was booting from the expansion card's PATA drive (aka hda1) on Monday (18th March) until the reboot around 15:00-15:30. There is no record in the system's log of the immediately preceeding shutdown, nor any other activity back to 16:20 on Sunday.

Richard
Comment 4 Colin Guthrie 2013-03-19 14:24:36 CET
How are you looking at the system log? Are you using journalctl or just looking in the old log files. If the latter then if rsyslog or similar is not running, then you won't see much in them!
Comment 5 Colin Guthrie 2013-03-19 14:30:13 CET
Also just to confirm, you had a working system which you presumably rebooted a few times, but the, after updates it now refuses to boot? Is that the general problem in broad strokes?

Can you confirm that an older kernel+initrd combo still works on the same machine?

If there are no new kernels installed during the updates, perhaps the initrd was just rebuilt which caused the breakage. In that case can you try editing the grub command line and rather than using initrd-desktop.img (or similarly named file) try initrd-desktop.img.old. This would get a booting machine and allow us to compare the contents of the initrd. If it does boot, please don't apply any further updates which might overwrite things until we work out the problem.

Cheers!
Comment 6 Richard Walker 2013-03-19 15:01:59 CET
journalctl - don't know any other way now that it has been improved. I am not a regular reader of log files so I am not sure what to expect, but in the good old days I always seemed to find records of actual activity in /var/log/messages. To find nothing happened on a machine I had used and booted and rebooted for nearly 24 hours is ... unusual I think?

The actual method used was;

1. Attach external MGA3 boot drive to sick machine
2. Boot external drive
3. mount sick system partitions under rescue system's /mnt
4. mount (bind) /proc, /sys, /dev, /dev/shm, /run 
5. chroot to sick root
6. poke around ... running journalctl produces output referring to the name of the sick machine so I jumped to the conclusion that it was, indeed, the correct log.

R

oops mid-air collision, says bugzilla, here are my further responses to your second:

1. Broad strokes, correct.
2. On my list of things to do. I was reluctant to dive in and trash the faulty system with a fresh install; firstly because I would lose all the evidence and secondly because it was humming along quite-nicely-thankyou before it went toes up. I have another spare disc (same brand, size, age too probably) onto which I started to install MGA3B3 (see Bug 9450 for why I haven't finished that yet) but I will be able to tell early this evening if older is better. The original working install had accepted at least one kernel update with no problems - some time Saturday or Sunday so MGA3B3 install iso kernel will be at least older than last weekend.
3. Interestingly enough I do not recall seeing a saved initrd image. Could my multiple attempts using dracut -f -H and bootloader-config have something to do with that? Still I won't say for certain until I take another good look at the /boot contents. One thing I do know for certain. There are files in /boot/grub datestamped Monday afternoon (just before the boot failure?) which I _know_ were created on Sunday evening as one of them is a personal backup of menu.lst (something renumbered the partitions incorrectly, so I thought I would keep the working version in case it happened again).

Just a thought, I could maybe dd the drive to a spare and send it to you??

Richard
Comment 7 Richard Walker 2013-03-19 23:29:53 CET
OK, baby steps. Re-install from MGA3B3 iso-on-a-stick onto an alternative 10G drive connected by USB (as before) has completed successfully. It is configured to update but I haven't done that yet.

I have re-booted a few times just to be sure and I am now going to bring it up with a data drive connected to the PATA card's secondary channel - just to be sure it is recognised though I predict it will be sd* rather than hd* as that was how this PATA card's drives appear from my main transportable USB MGA3 boot drive.

Back in a mo ... or two

R
Comment 8 Richard Walker 2013-03-19 23:42:19 CET
Update from the offending system. I was wrong. The data drive is in fact on hdc. I am strangely encouraged by that. I am going to try for a boot from hda now...
Comment 9 Richard Walker 2013-03-20 02:09:49 CET
That's it for tonight. I am still getting the same boot problem starting with a fresh MGA3B3. I am going to try to find my MGA3B2 iso and repeat the original attempts.

Richard
Comment 10 Richard Walker 2013-03-20 02:40:52 CET
I spoke too soon, but this is really the last update tonight: I checked the /boot directory again to see if the server kernel had really been installed alongside the desktop version I am using now. It looks to me like it is there and could be booted, but I'm no expert - just curious as to why it should have been installed at all!

Anyway, I spotted two initrds where there should have been only one. It looks like I took the dracut man page too literally and called my rebuilt version .image instead of .img. One quick fix and a reboot later and here I am, successfully booted from /dev/hda1 into a fresh MGA3B3.

The kernel version is 3.8.1-desktop-1. I am pretty sure I was able to update the desktop kernel at least once at the weekend and I installed and successfully ran the RT kernel. It is too late now but I will try RT again first thing tomorrow evening, and check to see which kernel the original failed boot is trying to run, and whether I have any older kernels + mathcing initrds (I doubt it as I normally don't leave stuff lying around if I don't expect to use it again).

Always open to suggestions,

Richard
Comment 11 Richard Walker 2013-03-22 03:29:29 CET
The trouble has passed. I kept my duplicate install (call it C) on this machine in its pristine condition, that is booting fine in the conditions which defeated its older twin (which we'll call B), but with an older kernel - the stock MGA3B3 kernel.

I am using the original failed system now. It was stuck on kernel 3.8.3-desktop-1 and now is running happily with 3.8.3-desktop-2.

I fell victim to the rpmdrake problem yesterday and wasted a lot of time trying to fix it, not realising that it hadn't been caused by me abandoning an update session which seemed to have hung. When tonight's replacements fixed my main MGA3 test system (on a portable drive, call it A) I took the plunge and first updated the C system; rpmdrake first, it worked, then the kernel (nothing else) and it worked too - 3.8.3-desktop-2.

From the working C system I mounted and chrooted to the B system, fixed its device.map, got its network running and updated it. The reboot to the repaired B system is completely successful. 

We may not know the cause, but it looks like it is fixed for me now.

Status: NEW => RESOLVED
Resolution: (none) => FIXED