| Summary: | System refuses to boot if secondary hard drive is experiencing SMART errors | ||
|---|---|---|---|
| Product: | Mageia | Reporter: | Paul Hounsell <paul.c.hounsell> |
| Component: | RPM Packages | Assignee: | Mageia Bug Squad <bugsquad> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | critical | ||
| Priority: | Normal | CC: | ftg, marja11, nic, pterjan, thierry.vignaud, tmb |
| Version: | Cauldron | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Source RPM: | CVE: | ||
| Status comment: | |||
|
Description
Paul Hounsell
2015-02-11 21:28:47 CET
That's interesting. I have a system with two SATA drives, and the secondary gets SMART errors from the BIOS on bootup which require me to enter BIOS setup and ESC out before it will proceed to the grub menu. The secondary disk has only one partition in the primary system's fstab, but the MGA boot proceeds without errors once you get to grub. So, I think the problem may be the options on your fstab lines for the secondary partitions. Unless the SMART problem is so bad that the entire drive is seen as unusable, in which case you'd need to get it out of fstab and hook it up through USB for recovery. If your fstab says "I need this guy" and the kernel can't initialize the disk, that's a conflict that Linux can't really resolve for you. But whenever this happens to me, the boot drops me into a recovery shell from which I can modify fstab. How exactly does your boot fail ? CC:
(none) =>
ftg Hello Frank; The system just locks up solid. I have to power cycle the box. I don't get any shell access or anything. 100% lock up. I shutdown the system and removed the bad drive. Unfortnuately I don't have logs of the problems. Here is the fStab line. # Entry for /dev/sdc1 : UUID=119e1634-ef69-4ce9-bc6e-e014ee504b4c /var/backup ext4 defaults 1 2 Installer and Release bugs should always be set to cauldron, because once the isos for a stable version have been created, they cannot be changed. This bug was filed very long ago, for Mageia 4. Was this bug still valid for Mageia 5? (If so, please do _not_ set the version to "5". Only asking because if it was still valid for Mageia 5, that then we'll know we'll have to pay attention to this issue when testing the 6 alpha's and later). Please close this bug report if the problem was solved in Mageia 5hhhh Keywords:
(none) =>
NEEDINFO Actually not an installer bug, it could be fixed as an update. If the bug affects Mageia 5, please add MGA5TOO to the whiteboard. Source RPM:
unknown =>
(none) Hi Paul Any more information? Or has the disk totally failed so you can't test it any more? Nic CC:
(none) =>
nic The drive is quasi dead, Linux does not see the drive but I am able to use Windows disk recovery tools to get back about half of the data on the disk. My complaints are the following. 1) Linux should give up trying to mount a failed drive and come up in "safe" mode. Minimum OS to allow you to do something with the system. I could not edit fstab because the system would totally lock so I was stuck. Linux should try to mount the fail disk a few times and the drop the disk from the mounted file systems. Of course this won't work if it is the OS disk that has failed. 2) Before a disk fails there are usually read and write errors. Linux should have a threshold of failed reads and/or writes and then pop up an alert window to the user that the disk is failing and try to do a backup as soon as possible. Question: Are there any Linux disk recovery tools other than dd? DD only works if the disk is good. It can't skip pass bad sectors and continue. Also it is not dd I want. I want to recover as many complete files as I can and skip bad sectors. I hope this helps. Paul Hi Paul I will respond to your 2 points and question. 1) Mageia does have a failback position of dropping back to a recovery shell which would allow editing of fstab. Then either marking the drive 'nofail' or commenting it out. Your issue is the computer locks up and doesn't allow the rescue shell. Might there be a inconsistent timeout on the probe? This is something that should be looked at. 2) This is eminently doable. Just install a drive monitoring program. Question: I have used testdisk to recover partitions and photorec to recover files. dd is not a recover tool rather a disk cloning tool. The idea is to clone the drive and then run recovery tools on the image. This way there is no risk of further damaging the drive. There is another tool, ddrescue which appears to be useful but I haven't used it. dd rescue is like dd, but it will try exhaustively to read a bad block, e. g. forwards, backwards, and together with blocks on either side. If you're interested in losing as little data as possible, it's probably your best bet. The idea is to get a copy of the partition on some other partition of the same size, e. g. /dev/sda12, and then mount that partition and recover your files from there.
Another option is to boot an install image (boot.iso) in rescue mode and run
e2fsck -p -c -C 0 /dev/sdc1
which will check the disk and run badblocks. This will take a very long time, but will fix your partition "in-place". Any files containing bad blocks will have those blocks replaced by new good blocks, but there is no guarantee that the content of the new block will match the content of the old block, so files recovered this way may be damaged and contain gaps.
I don't know what to do with this bug report. Change it into an enhancement request? If so, for what exactly? Keywords:
NEEDINFO =>
(none) By default we are running in "better safe than sorry" mode as we cant reliably detect all reasons for failure... And we dont know what the user considers critical or not. So the choice of what to do next is an end-user / sysadmin decision to make... not a distro-wide one... If you want your system to boot up even if a mount point fails, add the "nofail" option to that specific mount Resolution:
(none) =>
WONTFIX |