16657 – Late microcode update causes boot to fail on Intel Haswell processors

Bug 16657 - Late microcode update causes boot to fail on Intel Haswell processors

Summary: Late microcode update causes boot to fail on Intel Haswell processors

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	RPM Packages (show other bugs)
Version:	Cauldron
Hardware:	x86_64 Linux

Priority:	Normal Severity: normal
Target Milestone:	---
Assignee:	Thomas Backlund
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-08-26 22:17 CEST by Arne Spiegelhauer
Modified:	2015-08-30 00:46 CEST (History)
CC List:	1 user (show)

See Also:
Source RPM:	kernel-source-4.1.6-4.mga6; glibc
CVE:
Status comment:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Arne Spiegelhauer 2015-08-26 22:17:35 CEST

Description of problem:
When CPU microcode is updated, it causes sytemd-udevd to fail/crash.
This again causes systemd to time out waiting for disk devices to appear and go into Emergency Mode.

From journalctl -b:
--------------------------------------------------------------------------
Aug 26 13:12:04 localhost kernel: microcode: CPU7 sig=0x306c3, pf=0x2, revision=0x19
Aug 26 13:12:04 localhost kernel: microcode: CPU7 sig=0x306c3, pf=0x2, revision=0x19
Aug 26 13:12:04 localhost kernel: microcode: CPU7 updated to revision 0x1c, date = 2014-07-03
Aug 26 13:12:04 localhost kernel: microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
Aug 26 13:12:04 localhost systemd[1]: systemd-udevd.service: Unit entered failed state.
Aug 26 13:12:04 localhost systemd[1]: systemd-udevd.service: Failed with result 'signal'.
Aug 26 13:12:04 localhost systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.
Aug 26 13:12:04 localhost kernel: audit_printk_skb: 35 callbacks suppressed
Aug 26 13:12:04 localhost kernel: audit: type=1130 audit(1440587523.935:20): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journal-flush comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 26 13:11:56 localhost systemd[1]: Started udev Coldplug all Devices.
.
.
.
Aug 26 13:13:25 localhost systemd[1]: dev-disk-by\x2duuid-292D\x2dBAF8.device: Job dev-disk-by\x2duuid-292D\x2dBAF8.device/start failed with result 'timeout'.
Aug 26 13:13:25 localhost systemd[1]: Dependency failed for /home.
Aug 26 13:13:25 localhost systemd[1]: systemd-fsck@dev-disk-by\x2duuid-e98bd1f8\x2d348b\x2d45b0\x2d8177\x2daed2a089339e.service: Job systemd-fsck@dev-disk-by\x2duuid-e98bd1f8\x2d348b\x2d45b0\x2d8177\x2daed2a089339e.service/start failed with result 'dependency'.
Aug 26 13:13:25 localhost systemd[1]: dev-disk-by\x2duuid-e98bd1f8\x2d348b\x2d45b0\x2d8177\x2daed2a089339e.device: Job dev-disk-by\x2duuid-e98bd1f8\x2d348b\x2d45b0\x2d8177\x2daed2a089339e.device/start failed with result 'timeout'.
Aug 26 13:13:25 localhost systemd[1]: dev-disk-by\x2duuid-4e57d1ec\x2d6402\x2d4a89\x2da165\x2d44113047d47a.device: Job dev-disk-by\x2duuid-4e57d1ec\x2d6402\x2d4a89\x2da165\x2d44113047d47a.device/start timed out.
Aug 26 13:13:25 localhost systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-4e57d1ec\x2d6402\x2d4a89\x2da165\x2d44113047d47a.device.
--------------------------------------------------------------------------


Looks like the same issue as reported in this RedHat bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1146967
and this issue desribed on LWN:
https://lwn.net/Articles/632687/

Anyway, the solution agreed on (as I have understood it from googling) is to enable "early microcode update" in kernel build:
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_INTEL_EARLY=y
CONFIG_MICROCODE_EARLY=y

and in initrd build by setting:
early_microcode="yes"
in dracut.conf

A custom kernel built with these modifications boots fine on my system.

Note:
For the current Cauldron kernel, the microcode update can be skipped by adding "modprobe.blacklist=microcode" to the boot command.


Version-Release number of selected component (if applicable):


How reproducible:
Happens on every boot attempt.

Steps to Reproduce:
1.
2.
3.


Reproducible: 

Steps to Reproduce:

Thierry Vignaud 2015-08-28 09:52:14 CEST

CC: (none) => thierry.vignaud
Assignee: bugsquad => tmb

Comment 1 Thomas Backlund 2015-08-28 10:15:58 CEST

Hm, we fixed this in glibc for mga5 to not expose the broken lock elision...

So either the upstream fix was not covering all bases or glibc 2.22 in cauldron has been broken in this regard and needs the kernel-side "fix"...

@Arne:

Since you seem to have a system that actually triggers this bug, could you try mga5 too and see if it happends there (and also with the 4.1.6 kernel in updates testing)?

I have a Haswell system here but I cant trigger it

Comment 2 Arne Spiegelhauer 2015-08-28 15:49:13 CEST

On my system, mga5 apparently boots and runs fine with both 3.19.8 and 4.1.6 kernels.

I am writing this update on mga5 running the 4.1.6 kernel:

[root@localhost ~]# uname -a
Linux localhost 4.1.6-desktop-4.mga5 #1 SMP Tue Aug 25 20:14:21 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

and /proc/cpuinfo confirms that the microcode has been performed:

microcode       : 0x1c


I am also sure the issue started in Cauldron on the first boot after the glibc 2.22 update (although I might also have updated kernel to 4.1.5 before re-booting).

Thierry Vignaud 2015-08-29 00:32:41 CEST

Source RPM: kernel-source-4.1.6-4.mga6 => kernel-source-4.1.6-4.mga6; glibc

Comment 3 Thomas Backlund 2015-08-30 00:46:38 CEST

@Arne, thanks for confirming mga5 is safe.

as for Cauldron...

we really want to be able to support lock elision on non-broken hw, so I have enabled early firmware loading in dracut-038-21.mga6 and kernel-4.1.6-5.mga6 currently building

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Note You need to log in before you can comment on or make changes to this bug.