Bug 16077

Summary: Ext4fs corruption occurred after resuming from hibernation
Product: Mageia Reporter: Mike Burgener <mburgener>
Component: RPM PackagesAssignee: Thomas Backlund <tmb>
Status: RESOLVED INVALID QA Contact:
Severity: major    
Priority: Normal CC: sysadmin-bugs, thierry.vignaud
Version: CauldronKeywords: NEEDINFO
Target Milestone: Mageia 5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Source RPM: kernel-desktop-3.19.8-1.mga5 CVE:
Status comment:
Attachments: dmesg output

Description Mike Burgener 2015-06-04 12:52:56 CEST
Hi,

i just got in to filesystem corruption with my SSD disk (Samsung SSD 830) on my Windows (dual boot) everything is still ok and the Samsung magician application say the disk is ok.

also the smartctl application showed me that everything is ok.

it begun with some weird messages appearing in syslog (journalctl -f) about read errors.

at the moment i'm unable to boot the machine.

will now try to recover and get some more info out of the non-booting installation.

regards

Mike
Comment 1 Samuel Verschelde 2015-06-04 12:56:26 CEST
Exact version of the kernel would be nice, in addition to logs, if you can provide them.

Component: Release (media or process) => RPM Packages
Assignee: bugsquad => tmb
Source RPM: (none) => kernel

Comment 2 Mike Burgener 2015-06-04 13:09:54 CEST
of course, @ the moment hacking around to get them, working on a dd image with testdisk from the windows installation to my NAS, perhaps i can even provide the DD image (after i removed my pers data when i can read it) as a download on my tuxinator servers for debugging, hope it's hardware and not a software issue, so the release of 5 is not in danger
Comment 3 Mike Burgener 2015-06-04 14:10:21 CEST
Created attachment 6698 [details]
dmesg output
Comment 4 Mike Burgener 2015-06-04 14:10:33 CEST
got it booting after some manual fsck

kernel log is attached

look for the "Ext4" messages
Mike Burgener 2015-06-04 14:11:25 CEST

Priority: Normal => High

Mike Burgener 2015-06-04 14:11:53 CEST

Target Milestone: --- => Mageia 5

Comment 5 Mike Burgener 2015-06-04 14:19:14 CEST
Linux hostname 3.19.8-desktop-1.mga5 #1 SMP Mon May 11 16:35:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Comment 6 Thierry Vignaud 2015-06-04 16:10:56 CEST
I guess you were you using RAID, didn't you?
If use, you got hit by the infamous raid bug that was fixed in 3.19.8-desktop-2.mga5 that was released 2 weeks ago:
"  - md/raid0: fix restore to sector variable in raid0_make_request"

See http://lwn.net/Articles/645720/ for details

CC: (none) => thierry.vignaud

Comment 7 Mike Burgener 2015-06-04 18:00:39 CEST
no, there is and never was any raid setup on that machine, its a acer aspire s3 with a samsung evo 530 SSD and the SSD health is ok

regards Mike
Thierry Vignaud 2015-06-05 10:59:55 CEST

Attachment 6698 mime type: text/x-log => text/plain
Attachment 6698 description: logfiles => dmesg output

Comment 8 Thierry Vignaud 2015-06-05 11:05:35 CEST
You forgot to say that this happened after hibernation!!!

Thomas, before hibernating, we can see several "BUG: Bad page map in process" messages.

After resuming, there's a lot of:
"EXT4-fs error (device sda5): ext4_read_block_bitmap_nowait:427: comm kworker/u16:7: Cannot get buffer for block bitmap")

Mixed with a couple:
"JBD2: Spotted dirty metadata buffer (dev = sda5, blocknr = 0). There's a risk of filesystem corruption in case of system crash."


Mike: Are you sure you didn't boot any other OS while Mageia was hibernated?

Could you have booted eg a Windows with an ext4 driver or another Linux that could have mounted Mageia partition, thus causing differences between on-disk image and what the suspended kernel kept in its suspend imaged?

Keywords: (none) => NEEDINFO
Summary: filesystem corruption occured using ext4 fs after working normaly => Ext4fs corruption occurred after resuming from hibernation

Comment 9 Mike Burgener 2015-06-05 13:52:33 CEST
Hi,

sorry forgot that i use hibernation sometimes on that machine.

yes i'm sure i did not boot any windows on that machine before waking up from hibernation.


and my windows does not have any ext4 driver at the moment.

however at the moment the system works again after the fsck repaired some stuff.

i will keep an eye on the machine and check if i get soon any weird kernel messages, however i think this is no real blocker and for the moment we can change the bug state to a lower level.

regards

Mike

Status: NEW => UNCONFIRMED
Ever confirmed: 1 => 0

Mike Burgener 2015-06-05 13:52:51 CEST

Priority: High => Normal
Severity: critical => minor

Comment 10 Samuel Verschelde 2015-06-05 14:05:09 CEST
Any data corruption is still a major bug, raising severity a little bit.

Severity: minor => major

Comment 11 Thierry Vignaud 2015-06-05 14:39:10 CEST
Mike: did you run another Linux distribution before resuming?
Also you should update to kernel-desktop-3.19.8-2.mga5 which has important fixes

Source RPM: kernel => kernel-desktop-3.19.8-1.mga5

Comment 12 Thomas Backlund 2015-06-05 14:44:38 CEST
One question... when did you install this system, and with what ?

mga4 ? mga5-beta ? mga5-rc ? ...

There was  an older ext4 bug that was fixed in 3.19.7 that could have caused that... (the delayed extents bug we also squashed for mga4 in http://advisories.mageia.org/MGASA-2015-0236.html)

Having said that I see several possible related fixes in upstream -stable queue and there is also some specific to Samsung SSDs... I'll go review them...
Comment 13 Thomas Backlund 2015-06-05 18:25:53 CEST
actually this seem to be a is an issue of fsck running during resume as found it a matching Fedora bugreport:
https://bugzilla.redhat.com/show_bug.cgi?id=1174945

I've added the same fix as fedora did in: dracut-038-19.mga5

So when that gets installed, recreate your initrd with the new dracut
Comment 14 Mike Burgener 2015-06-05 22:35:56 CEST
hmm, possible that i installed beginning with beta5, not sure anymore.

for me the fedora report looks different

regards

Mike
Comment 15 Mike Burgener 2015-06-05 22:42:05 CEST
perhaps also related to https://bugzilla.redhat.com/show_bug.cgi?id=1185640
Comment 16 Thomas Backlund 2015-06-05 22:45:07 CEST
yes, and thats exactly what the fix I added to dracut should resolve
Comment 17 Mike Burgener 2015-06-05 22:48:19 CEST
ok nice, so i'll use hibernation much after i got that update/patch to give it a test

regards

Mike
Comment 18 Mike Burgener 2015-06-16 21:58:48 CEST
an update, never had the issue again, however after changing to another SSD on the same Notebook i frequently get those messages but everything continues to work, what i find also interesting is the UDMA133 message, as i think UDMA would be much to slow for a SATA 6Gb

Jun 16 18:35:02 localhost kernel: ata4: SATA link down (SStatus 0 SControl 300)
Jun 16 18:35:02 localhost kernel: ata5: SATA link down (SStatus 0 SControl 300)
Jun 16 18:35:02 localhost kernel: ata2: SATA link down (SStatus 0 SControl 300)
Jun 16 18:35:02 localhost kernel: usb 1-1: reset high-speed USB device number 2 using ehci-pci
Jun 16 18:35:02 localhost kernel: usb 3-1: reset full-speed USB device number 2 using xhci_hcd
Jun 16 18:35:02 localhost kernel: xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff880148c32780
Jun 16 18:35:02 localhost kernel: xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff880148c327e0
Jun 16 18:35:02 localhost kernel: usb 3-4: reset high-speed USB device number 3 using xhci_hcd
Jun 16 18:35:02 localhost kernel: usb 1-1.3: reset high-speed USB device number 3 using ehci-pci
Jun 16 18:35:02 localhost kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 16 18:35:02 localhost kernel: ata1.00: configured for UDMA/133
Jun 16 18:35:02 localhost kernel: usb 1-1.4: reset full-speed USB device number 4 usi
Comment 19 Mike Burgener 2015-06-16 22:11:54 CEST
ok, the UDMA seems to be a legacy message as speed looks ok:

 Timing O_DIRECT cached reads:   960 MB in  2.00 seconds = 479.95 MB/sec
 Timing O_DIRECT disk reads: 1168 MB in  3.00 seconds = 388.88 MB/sec
Comment 20 Mike Burgener 2016-05-05 20:24:21 CEST
was a longtime disk issue not kernel or fs related.

Status: UNCONFIRMED => RESOLVED
Resolution: (none) => INVALID