Bug 15268 - System refuses to boot if secondary hard drive is experiencing SMART errors
Summary: System refuses to boot if secondary hard drive is experiencing SMART errors
Status: RESOLVED WONTFIX
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-11 21:28 CET by Paul Hounsell
Modified: 2016-06-09 14:13 CEST (History)
6 users (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments

Description Paul Hounsell 2015-02-11 21:28:47 CET
I have a Mageia system with 3 internal hard drives. One of the secondardy drives has some bad blocks on it and fails the SMART test. Mageia hangs during boot up trying to mount the failed hard drive. I had to use live Linux DVD to boot up and comment out the failing drive from /etc/fstab. This is the wrong way to handle a failing drive. 

Mageia, should try a couple of times to mount the drive and then mark the drive bad and boot up the rest of the system. It should not hang to the point where I have to power cycle the box. Also I should be able to do a force mount  the failed drive to read back as much information as possible from the failing drive. I have removed the failed drive and the log files seem to have expired of the system.

Here is what I found in the dmesg.old file
[    2.765035] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[    2.784405] ata6.00: failed to read native max address (err_mask=0x1)
[    2.784410] ata6.00: HPA support seems broken, skipping HPA handling
[    2.784416] ata6.00: ATA-8: WDC WD20EARS-22MVWB0, 51.0AB51, max UDMA/133
[    2.784420] ata6.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32)
[    2.825219] ata6.00: failed to set xfermode (err_mask=0x1)
[    8.070036] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[    8.111218] ata6.00: failed to set xfermode (err_mask=0x1)
[    8.111553] ata6.00: limiting speed to UDMA/100:PIO3
[   13.375035] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   13.416216] ata6.00: failed to set xfermode (err_mask=0x1)
[   13.416549] ata6.00: disabled


Reproducible: 

Steps to Reproduce:
Comment 1 Frank Griffin 2015-02-11 22:41:12 CET
That's interesting. I have a system with two SATA drives, and the secondary gets SMART errors from the BIOS on bootup which require me to enter BIOS setup and ESC out before it will proceed to the grub menu.

The secondary disk has only one partition in the primary system's fstab, but the MGA boot proceeds without errors once you get to grub.

So, I think the problem may be the options on your fstab lines for the secondary partitions.  Unless the SMART problem is so bad that the entire drive is seen as unusable, in which case you'd need to get it out of fstab and hook it up through USB for recovery.  If your fstab says "I need this guy" and the kernel can't initialize the disk, that's a conflict that Linux can't really resolve for you.

But whenever this happens to me, the boot drops me into a recovery shell from which I can modify fstab.  How exactly does your boot fail ?

CC: (none) => ftg

Comment 2 Paul Hounsell 2015-02-12 00:25:17 CET
Hello Frank;

The system just locks up solid. I have to power cycle the box. I don't get any shell access or anything. 100% lock up. I shutdown the system and removed the bad drive. Unfortnuately I don't have logs of the problems. Here is the fStab line.

# Entry for /dev/sdc1 :
UUID=119e1634-ef69-4ce9-bc6e-e014ee504b4c /var/backup ext4 defaults 1 2
Comment 3 Marja Van Waes 2015-09-21 10:51:25 CEST
Installer and Release bugs should always be set to cauldron, because once the isos for a stable version have been created, they cannot be changed.

This bug was filed very long ago, for Mageia 4. Was this bug still valid for Mageia 5?

(If so, please do _not_ set the version to "5". Only asking because if it was still valid for Mageia 5, that then we'll know we'll have to pay attention to this issue when testing the 6 alpha's and later).

Please close this bug report if the problem was solved in Mageia 5hhhh

Keywords: (none) => NEEDINFO
Version: 4 => Cauldron

Comment 4 Rémi Verschelde 2015-09-21 10:59:29 CEST
Actually not an installer bug, it could be fixed as an update. If the bug affects Mageia 5, please add MGA5TOO to the whiteboard.

Source RPM: unknown => (none)
Component: Release (media or process) => RPM Packages
Hardware: i586 => All

Comment 5 Nic Baxter 2016-01-10 01:53:11 CET
Hi Paul
Any more information? Or has the disk totally failed so you can't test it any more?

Nic

CC: (none) => nic

Comment 6 Paul Hounsell 2016-01-12 04:09:11 CET
The drive is quasi dead, Linux does not see the drive but I am able to use Windows disk recovery tools to get back about half of the data on the disk. My complaints are the following.

1) Linux should give up trying to mount a failed drive and come up in "safe" mode. Minimum OS to allow you to do something with the system. I could not edit fstab because the system would totally lock so  I was stuck. Linux should try to mount the fail disk a few times and the drop the disk from the mounted file systems. Of course this won't work if it is the OS disk that has failed.

2) Before a disk fails there are usually read and write errors. Linux should have a threshold of failed reads and/or writes and then pop up an alert window to the user that the disk is failing and try to do a backup as soon as possible. 

Question: Are there any Linux disk recovery tools other than dd? DD only works if the disk is good. It can't skip pass bad sectors and continue. Also it is not dd I want. I want to recover as many complete files as I can and skip bad sectors.

I hope this helps.
Paul
Comment 7 Nic Baxter 2016-01-12 06:27:21 CET
Hi Paul
I will respond to your 2 points and question.

1) Mageia does have a failback position of dropping back to a recovery shell which would allow editing of fstab. Then either marking the drive 'nofail' or commenting it out.
Your issue is the computer locks up and doesn't allow the rescue shell. Might there be a inconsistent timeout on the probe? This is something that should be looked at.

2) This is eminently doable. Just install a drive monitoring program.

Question:
I have used testdisk to recover partitions and photorec to recover files.
dd is not a recover tool rather a disk cloning tool. The idea is to clone the drive and then run recovery tools on the image. This way there is no risk of further damaging the drive. There is another tool, ddrescue which appears to be useful but I haven't used it.
Comment 8 Frank Griffin 2016-01-12 14:20:48 CET
dd rescue is like dd, but it will try exhaustively to read a bad block, e. g. forwards, backwards, and together with blocks on either side.  If you're interested in losing as little data as possible, it's probably your best bet.  The idea is to get a copy of the partition on some other partition of the same size, e. g. /dev/sda12, and then mount that partition and recover your files from there.

Another option is to boot an install image (boot.iso) in rescue mode and run 
    e2fsck -p -c -C 0 /dev/sdc1
which will check the disk and run badblocks.  This will take a very long time, but will fix your partition "in-place".  Any files containing bad blocks will have those blocks replaced by new good blocks, but there is no guarantee that the content of the new block will match the content of the old block, so files recovered this way may be damaged and contain gaps.
Comment 9 Marja Van Waes 2016-06-09 14:02:41 CEST
I don't know what to do with this bug report. Change it into an enhancement request? 

If so, for what exactly?

Keywords: NEEDINFO => (none)
CC: sysadmin-bugs => marja11, pterjan, thierry.vignaud, tmb

Comment 10 Thomas Backlund 2016-06-09 14:13:42 CEST
By default we are running in "better safe than sorry" mode as we cant reliably detect all reasons for failure...

And we dont know what the user considers critical or not. 


So the choice of what to do next is an end-user / sysadmin decision to make... not a distro-wide one...

If you want your system to boot up even if a mount point fails, add the "nofail" option to that specific mount

Resolution: (none) => WONTFIX
Status: NEW => RESOLVED


Note You need to log in before you can comment on or make changes to this bug.