Bug 14172 - Install crashes immediately during package install (SIGILL, Illegal instruction in __GI___pthread_rwlock_unlock() -> ELIDE_UNLOCK() )
Summary: Install crashes immediately during package install (SIGILL, Illegal instructi...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: Installer (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
Priority: release_blocker critical
Target Milestone: ---
Assignee: Thomas Backlund
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-25 17:06 CEST by Frank Griffin
Modified: 2014-12-01 13:00 CET (History)
5 users (show)

See Also:
Source RPM: glibc
CVE:
Status comment:


Attachments
GDB trace with symbols (Illegal instruction in ELIDE_UNLOCK) (2.16 KB, text/plain)
2014-10-10 16:26 CEST, Thierry Vignaud
Details
GDB trace with symbols (Illegal instruction in ELIDE_UNLOCK) (663 bytes, text/plain)
2014-10-10 18:56 CEST, Thierry Vignaud
Details
GDB trace with symbols (deadlock) (5.07 KB, text/plain)
2014-10-10 22:33 CEST, Thierry Vignaud
Details
GDB trace with symbols (Illegal instruction in pthread_rwlock_unlock) (3.44 KB, text/plain)
2014-10-11 00:22 CEST, Thierry Vignaud
Details

Description Frank Griffin 2014-09-25 17:06:59 CEST
Fresh network install, select all package categories.

Package selection finishes, installation panel comes up, gives a package count and about 5-7 detail lines.  At that point, X shuts down and you're flipped back to tty1 with the system fully shut down (so no response on tty2 for "bug").

tty1 shows:
    warning: /etc/fstab created as /etc/fstab.rpmnew
    exited abnormally :-( received signal 4
    (usual shutdown messages)

Last thing in tty3 is:
    filesystem not installed, Generating 12 missing indexes, pkease wait...

Last thing on tty4:
    Traps: runinstall2[325] trap invalid opcode ip:xxx sp:xxx error:0 in libpthread.so.0[xxx+17000]

This has been happening for a week or so now (at least, because that was the first time I've tried a fresh install for a while now), but I wanted to see if the mass rebuild would fix it.

Reproducible: 

Steps to Reproduce:
Comment 1 Georges Eckenschwiller 2014-09-29 16:14:41 CEST
I confirm the problem. I have more or less the same messages, in particular
filesystem not installed, Generating 12 missing indexes
and
exited abnormally :-( received signal 4

I tried several times since about ten days.
I also thought that the problem came from Mass rebuild. But it seems to be another cause.

CC: (none) => paiiou

Comment 2 Georges Eckenschwiller 2014-09-29 16:44:52 CEST
I can specify that I had this kind of problem from 16 sept.
Until September 15th I had the message with systemd.
Comment 3 David Walser 2014-10-08 01:11:31 CEST
Frank, is this the same as Bug 14101 that you reported?
Comment 4 Thierry Vignaud 2014-10-10 14:10:44 CEST
The packaging issue in bug #14101 are fixed.
Let's focus on the crash here.

What's your CPU?
I saw this with Intel E8400 :
traps: runinstall2[24495] trap invalid opcode ip:7ff1d351b192 sp:7fffca665b98 error:0 in libpthread.so.0[7ff1d350f000+17000]

CC: (none) => thierry.vignaud, tmb
Source RPM: (none) => glibc

Comment 5 Georges Eckenschwiller 2014-10-10 14:21:50 CEST
My cpu : Athlon XP 2500+
Comment 6 Thomas Backlund 2014-10-10 14:24:03 CEST
@Georges:

what arch ? i586 or x86_64 ?
Comment 7 Georges Eckenschwiller 2014-10-10 14:32:01 CEST
(In reply to Thomas Backlund from comment #6)
> @Georges:
> 
> what arch ? i586 or x86_64 ?

i586
Comment 8 Thomas Backlund 2014-10-10 14:38:12 CEST
@Georges: and what trap message do you get ?
Comment 9 Georges Eckenschwiller 2014-10-10 15:15:33 CEST
(In reply to Thomas Backlund from comment #8)
> @Georges: and what trap message do you get ?

Try a new install. Local mirror, synchonised this morning. Boot.iso : 6 oct

Personalized desktop. Déselected all packages, then With X server, w/o suggest

The installation seems to begin.
Then black screen with:
Warning /etc/fstab created as fstab.new
exited abnormaly :-( --received signal 4

With AltF3 :
filesystem not installed, generating 12 missing indexes
Comment 10 Thierry Vignaud 2014-10-10 16:24:29 CEST
@tmb: I tried using --disable-lock-elision in glibc but it didn't help
But...
Comment 11 Thierry Vignaud 2014-10-10 16:26:13 CEST
Created attachment 5483 [details]
GDB trace with symbols (Illegal instruction in ELIDE_UNLOCK)

Trace got using:

CLEAN=1 drakx/tools/drakx-in-chroot /mageia/unstable/x86_64/ /T --useless_thing_accepted --flang fr --keyboard fr --lang fr --gdb

("--useless_thing_accepted --flang fr --keyboard fr --lang fr" really are just to speed up testing)

So it does crash due to elision code despite me having an E8400...
Thierry Vignaud 2014-10-10 16:27:17 CEST

Priority: Normal => release_blocker
Assignee: bugsquad => tmb
Summary: Install failsimmediately during package install => Install crashes immediately during package install (SIGILL, Illegal instruction in __GI___pthread_rwlock_unlock() -> ELIDE_UNLOCK() )

Comment 12 Thomas Backlund 2014-10-10 16:35:32 CEST
Ah, thats interesting...

I used the upstream fix to disable elision:

http://svnweb.mageia.org/packages?view=revision&revision=731421


the dropping of "--enable-lock-elision" means the same as "--disable-lock-elision"

So the disabling of elision is actually exposing another bug :/

And it was only supposed to trigger on Haswll level hw...

I guess this means the elision stuff is not properly #ifdeffed when it gets disabled...
Comment 13 Thomas Backlund 2014-10-10 17:39:05 CEST
OK, so atleast the ELIDE_UNLOCK is not properly protected for disabled elision.

I've reported it upstream and did a patch that ensures all adaptive elision callsites in rwlocks are not triggered in glibc-2.20-8.mga5

I'll rebuild stage2 when new glibc is available
Comment 14 Thierry Vignaud 2014-10-10 18:15:22 CEST
I just did it.
Comment 15 Thierry Vignaud 2014-10-10 18:56:16 CEST
Created attachment 5484 [details]
GDB trace with symbols (Illegal instruction in ELIDE_UNLOCK)

now it deadlocks
Comment 16 Thomas Backlund 2014-10-10 19:29:52 CEST
Crap.

A new fix is building with a more minimal change approach.

I now only touch ELIDE_UNLOCK path and add the same elide check as the other ELIDE_* defines use


If this does not work either we can also rollback to older microcode and enable elision again for beta1 until upstream gets a proper fix
Comment 17 Thomas Backlund 2014-10-10 20:09:57 CEST
stage2 rebuilt with glibc-2.20-9
Comment 18 Thierry Vignaud 2014-10-10 22:33:28 CEST
Created attachment 5485 [details]
GDB trace with symbols (deadlock)

With -9.mga5, it looks like it deadlocks
Comment 19 Thomas Backlund 2014-10-10 22:42:04 CEST
Hm, so it looks like upstream is right, it wiil be a pain disabling the elision as it's not really tested that way :/

I know they had to enable it back on atleast s390 for reasons like this...

I'll nuke the haswell specific microcodes from the tarball and re-enable elision for now...
Comment 20 Thomas Backlund 2014-10-11 00:10:30 CEST
ok, 

 microcode-0.20140913-2.mga5 dropped the problematic firmwares
 glibc-2.20-10.mga5 has lock elision enabled
 drakx-installer-* rebuilt with latest glibc & co

lets hoe it will behave better
Comment 21 Thierry Vignaud 2014-10-11 00:22:53 CEST
Created attachment 5486 [details]
GDB trace with symbols (Illegal instruction in pthread_rwlock_unlock)

Well, not better...
Comment 22 Thomas Backlund 2014-10-11 10:04:45 CEST
Hm,
I'm starting to think this might be a rpm bug, most likely the rpmlog part exposed by new glibc

If you compare backtrace with and without elision, it's exactly the same...

And the last commit to rpmlog (Feb 19, 2013) isn't really assuring :)
http://rpm.org/gitweb?p=rpm.git;a=commit;h=96e0cdf34b1d4b40d6565d396016f74446bd4b5f

Maybe we should ask Panu for input
Comment 23 Georges Eckenschwiller 2014-10-11 10:36:58 CEST
(In reply to Georges Eckenschwiller from comment #2)
> I can specify that I had this kind of problem from 16 sept.
> Until September 15th I had the message with systemd.

What changed on September 15th or 16th ?
Comment 24 Thierry Vignaud 2014-10-11 19:09:33 CEST
"Generating 12 missing indexes" is a normal message from rpmlib that it creates indexes in /var/lib/rpm.
The issue with filesystem package will be dealt later once the crash issue is fixed.
Let's focus on the crash issue here.
Comment 25 Frank Griffin 2014-10-12 19:41:39 CEST
(In reply to Thierry Vignaud from comment #4)
> The packaging issue in bug #14101 are fixed.
> Let's focus on the crash here.
> 
> What's your CPU?
> I saw this with Intel E8400 :
> traps: runinstall2[24495] trap invalid opcode ip:7ff1d351b192
> sp:7fffca665b98 error:0 in libpthread.so.0[7ff1d350f000+17000]

Sorry to be late back to the party.  This laptop is an Asus A54C, but the Asus site for the spec sheet doesn't work.  It's fairly old, and although it is a 64-bit machine, it lacks the hardware characteristics for defining 64-bit VMs in VBox.

I don't have anything on it I can boot at the moment to run HardDrake, but the sticker on the case says Intel Pentium inside.
Comment 26 Frank Griffin 2014-10-14 17:40:35 CEST
Confirming now on a different machine with an Intel i5 M 560.
Florian Hubold 2014-10-14 21:47:30 CEST

CC: (none) => doktor5000

Pascal Terjan 2014-10-15 00:44:06 CEST

CC: (none) => pterjan

Comment 27 Thierry Vignaud 2014-10-19 08:29:20 CEST
thomas we are still crashing in ELIDE_UNLOCK() with illegal instruction...
Comment 28 Thierry Vignaud 2014-10-19 08:36:49 CEST
I've reverted http://rpm.org/gitweb?p=rpm.git;a=commitdiff;h=96e0cdf3
it enables me to run the installer
Comment 29 Thomas Backlund 2014-10-19 11:24:07 CEST
(In reply to Thierry Vignaud from comment #27)
> thomas we are still crashing in ELIDE_UNLOCK() with illegal instruction...

Yeah, both with and without elision we die in thread lock management...
I have not yet managed to figure out why/where...

(In reply to Thierry Vignaud from comment #28)
> I've reverted http://rpm.org/gitweb?p=rpm.git;a=commitdiff;h=96e0cdf3
> it enables me to run the installer

Nice, so we atleast maybe can get beta1 out with that workaround in place

I will try to get some time to figure this out.
Comment 30 Georges Eckenschwiller 2014-10-21 09:26:25 CEST
For me, the problem is solved

Thanks
Comment 31 Frank Griffin 2014-10-21 16:20:20 CEST
Confirming the fix.  I'll leave this open for the eventual rpm fix.
Comment 32 Thierry Vignaud 2014-10-23 16:56:31 CEST
Thanks to Panu, we now have a real fix instead of just a workaround.
Fixed in URPM-5.01

Status: NEW => RESOLVED
Resolution: (none) => FIXED
Source RPM: glibc => glibc, rpm, perl-URPM

Thierry Vignaud 2014-12-01 13:00:08 CET

Source RPM: glibc, rpm, perl-URPM => glibc


Note You need to log in before you can comment on or make changes to this bug.