Bug 20538 - solid boot lockup, only switchoff recovers (display_driver_helper kept removing "nokmsboot")
Summary: solid boot lockup, only switchoff recovers (display_driver_helper kept removi...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
: release_blocker critical
Target Milestone: Mageia 6
Assignee: Mageia tools maintainers
QA Contact:
URL:
Whiteboard:
Keywords: NEEDINFO
Depends on:
Blocks:
 
Reported: 2017-03-20 16:31 CET by Richard Walker
Modified: 2017-04-27 16:41 CEST (History)
4 users (show)

See Also:
Source RPM: plymouth-0.9.2-9.mga6, drakx-kbd-mouse-x11, kernel
CVE:
Status comment: Fix pushed for amdgpu detection


Attachments
PCI-attached hardware (6.14 KB, text/plain)
2017-03-20 17:41 CET, Richard Walker
Details
Output from display_driver_helper --is-kms-allowed (2.71 KB, text/plain)
2017-03-20 17:43 CET, Richard Walker
Details
badly hacked function; is_kms_allowed() (155 bytes, text/plain)
2017-03-20 17:45 CET, Richard Walker
Details

Description Richard Walker 2017-03-20 16:31:39 CET
Description of problem:
This is the return of a problem I had worked around by forcing /sbin/display_driver_helper to believe that the "nokmsboot" kernel option is always valid.

A full treatment of this issue may be seen at https://forums.mageia.org/en/viewtopic.php?f=15&t=11584&p=68079#p68079

(It is probably safe to ignore the bits about grub and grub2 and the wandering into plasma and wayland issues towards the end of the thread.)

For convenience I summarise the salient points here:
The hardware is probably significant: AMD A10-5800 APU on ASUS F2A85-M LE motherboard.

I think this is significant as nobody has said anything in the forum thread (or elsewhere that I can find) about having the same problem.

This system runs MGA3 and MGA5 with no problems using the AMD proprietary fglrx driver or the radeon xorg driver.

A fresh boot disk has been prepared with Cauldron using the sta1 iso and it is kept up to date.

The problem first appeared as a rapid boot failure/lockup within 2 or 3 seconds of the boot process starting. Only "safe" mode booting allowed me back into the system and only power cycling would allow me to do that. 

I determined that adding the "nokmsboot" option to the grub (grub legacy in my case) command line would always produce a completely incident-free boot, though with the grey screen and three question marks.

I then discovered that a Mageia script (/sbin/display_driver_helper) was determining that my system should not use that option and was removing it on every successful boot. I fixed that by forcing the script to say that the kms boot option was needed, regardless of it's check results.

Today I installed the Plymouth update (along with all the others available) which was intended to restore the boot splash screens for systems which do not use kms. The hard lockup on boot has returned and now requires that I use the "nosplash" on the grub command line in order to complete the boot process.

Version-Release number of selected component (if applicable):


How reproducible:



Steps to Reproduce:
1. Possibly hardware-specific, else everyone with my APU would have the same issue.
2. Boot Cauldron with default boot parameters after configuring the Radeon driver for the screens.
3.
Comment 1 Thierry Vignaud 2017-03-20 17:11:37 CET
Something must lead /sbin/display_driver_helper to believe that "nokmsboot" is not needed.
Can you attach (not paste) the output of "lspcidrake -v".
eg: lspcidrake -v > /tmp/lspcidrake.txt

Please attach the /tmp/debug.txt resulting from running:
sh -x /sbin/display_driver_helper --is-kms-allowed &>/tmp/debug.txt

It might help understand what's happening...
Comment 2 Thomas Backlund 2017-03-20 17:14:05 CET
Hm, I see we have forgot to add support for amdgpu detection in ddh
Comment 3 Thomas Backlund 2017-03-20 17:15:45 CET
And for some other A10 issues there is a report uptream pointing out an 8 month regression that should be reverted... I will do that in the next kernel...
Comment 4 Thierry Vignaud 2017-03-20 17:21:50 CET
Adding kernel to the pkg list then :-)
Comment 5 Richard Walker 2017-03-20 17:41:09 CET
Created attachment 9122 [details]
PCI-attached hardware
Comment 6 Richard Walker 2017-03-20 17:43:19 CET
Created attachment 9123 [details]
Output from display_driver_helper --is-kms-allowed

Note that I have forced this probe to return 1 on line 303 of display_driver_helper
Comment 7 Richard Walker 2017-03-20 17:45:07 CET
Created attachment 9124 [details]
badly hacked function; is_kms_allowed()
Comment 8 Richard Walker 2017-03-20 17:46:58 CET
(In reply to Thierry Vignaud from comment #1)
I believe that its determination (ie. no need for "nokmsboot") is correct for the huge majority of users as the configured xorg driver, radeon, is expected to use kernel mode switching. Indeed, once the kernel has been booted I rather suspect that kernel mode switching is used during the rest of the system boot as the kernel messages switch from being large and lo-res into small and hi-res, indicating a screen mode change. 

for example:
[root@localhost ~]# lsmod | grep kms
drm_kms_helper        135168  1 radeon
drm                   335872  7 radeon,ttm,drm_kms_helper

I have attached the files you requested, and my modified (line 303) display_driver_helper which, brutal though it is, lets me boot on THIS system.

Note that, as detailed in the referenced forum discussion, the Nvidia card plays no part in handling any display. It also makes no difference to the boot failure should I remove that card. It is only present to allow me to use its CUDA support in Blender.

Note also that the hack of display_driver_helper was done ONLY to stop my "nokmsboot" option from being removed from my grub commandline on every successful boot.
Comment 9 Richard Walker 2017-03-20 18:00:48 CET
Speculation; the true cause of the boot failure is not Plymouth, nor the way the radeon driver uses kms on my APU. I justify these assumptions based on

1. when the system has booted correctly then kms is also used correctly
2. after the Plymouth upgrade today the "nokmsboot" option must be used with "nosplash"

The second point appears significant because it indicates that avoiding Plymouth alone will not let me boot. The "nokmsboot" option must also be present.

It suggests to me that something is done in the patched Plymouth code which mimics or duplicates something which also happens in the splash-free early seconds of the kernel boot process which triggers the particular conditions needed to lock up my hardware.
Comment 10 Rémi Verschelde 2017-03-20 18:04:21 CET
From Martin Whitaker on the dev ML:

> It looks to me like it has just re-exposed a bug that was temporarily hidden. I suspect the previous workaround only worked because it caused plymouth to fall back to text mode and not use the video device.

Setting assignee to mageiatools@ for now, please reassign as needed.
Comment 11 Richard Walker 2017-03-20 20:52:51 CET
The system is unchanged, the behaviour is different.

I have just completed two slightly different boot experiences with no changes to the kernel options. 

1. Boot but forgetting to add "nosplash" to the "nokmsboot" option.

There was no crash. There was the familiar grey screen with question marks, followed by a black screen and a couple of lines from the kernel, the second of which was something like "Checking for new hardware". This was followed by the Mageia splash screen with lots of bubbles popped out of the cauldron, but no more were observed to appear while it was on the screen. Then it may or may not have gone blank again before my autologin took me to the desktop.

2. As above. This time keeping the "splash" option was deliberate. 

There was no crash. There was the familiar grey screen with question marks, followed by a black screen and a couple of lines from the kernel. The screen went black as it changed to a higher text resolution and produced two lines of text at the top of the screen in much smaller letters, the second of which was "Checking for new hardware". This was followed a minute or so later by another black screen and my autologin took me to the desktop.

So, where previously I had to turn off Plymouth to boot the machine after the Plymouth update, now the same system has no such need and is quite happy to give me this strange Plymouth behaviour instead. Is Plymouth learning how to adapt itself to work on my kit? If so, much respect to the programmer!
Comment 12 Rick Stockton 2017-03-23 18:21:03 CET
I can also test any bugfix WRT this problem pretty easily, on nearly the same hardware (also an AMD-A10 GPU, different motherboard). 'Strange Plymouth Behavior' is bug #19890, and possibly fixable per that bug.

In my 5.1 box, attempting "online upgrade" caused this problem on the Plasma Desktop, but some other desktops ran alright. ('Gnome on X', in particular.) Plasma WITHOUT compositing also works. 'Upgrade' using the the STA-2 disk crashed partway through, so I can't advise whether that would resolve the problem.
- - - - -
OpenSuse 'Tumbleweed' *does not have the problem*. It runs Plasma-5 successfully with lots of kwin compositing "features" active. But I did a 'full install' (formatting everything but /home, where I had temporarily saved only my public and private key pairs from openssl). My boot kernel string on openSUSE is currently:

BOOT_IMAGE=/boot/vmlinuz-4.10.3-1-default root=UUID=xxxx video=2560x1440 resume=xxx splash=silent quiet showopts

nothing about "nokmsboot". The radeon "ati" driver runs fine, I have not tried amdgpu.

My SWAG: Something about the replacement of legacy "flgrx" with "ati" requires a change which isn't done during an 'online upgrade'. Richard, are you doing an 'online upgrade' or a full install?
Comment 13 Richard Walker 2017-03-24 03:23:14 CET
Rick, the problem first appeared for me on a bang-up-to-date sta1 iso installation, I am still trying to get another test system set up on a fresh disc using sta2, but it isn't ready yet.

I suppose it would be nice to have the Plymouth splash screen instead of the grey??? but I would worry more about that after I get the machine to boot safely with no crash when the Mageia default boot parameters for a system with radeon graphics are used. 

As long as I need to use the "nokmsboot" where it shouldn't be needed, I don't really expect Plymouth to be able to figure out correctly what my system is doing or going to do. Nevertheless, it is fascinating to see the variability in what Plymouth does on my machine when nothing relevant has changed. The mixture of grey??? AND Mageia boot graphics in one boot was weird to behold!
Comment 14 Rick Stockton 2017-03-26 00:15:57 CET
Devs, please advise: Is STA-2 "live KDE" now so far behind as to be "uninteresting", or should I burn and test it (for comparison versus online upgrade)?

Note-2: I have an "MGA-5.1" disk, up to date and working well - and it can be cloned into a "scratch hard drive" as many times as we like. "Scratch" can also be used for any number of "clean" installation attempts (from DVD), "upgrade from 5.1" (from DVD), or re-executions of "online upgrade".

Note 2: I have "splash=silent" as a boot parameter on openSUSE, and plymouth crashes there - I don't even get the "gray screen with 3 characters". BUT - I recommend that we leave the issue of "Pretty Plymouth Screen is missing" (IMO "normal") separate from this bug, which IMO should only be about the *Blocker* of no KDE Sessions on certain AMD configurations.

Note 3: I suspect that we _might_ be passing invalid randr parameters at the X "user session" startup of plasma.... and Qt fails to handle error returns in a viable way.
Comment 15 Rick Stockton 2017-03-26 04:29:45 CEST
A "fresh" installation of STA-2 "live-KDE" has none of these problems.

Even Plymouth Bootsplash works (i.e., Cauldron growing more bubbles, very slowly- there are a few short "fallbacks" to the console terminal during the process.)

And so: I think that some legacy MGA-5.1 x-startup coding is "left behind" during the online-upgrade process, and it becomes "broken" by the replacement of the fglrx driver, and/or the switch from kwin in KDE-4 to kwin *on qt5* in Cauldron.

Kwin used to do a lot more direct calls to X11 API, and the new design (requesting qtscreen to handle the lower-level protocols) may be somewhat broken when invalid parameters are sent into X11 modules from Qt5.

The faulty parameters might be shared among all users (somewhere within /etc); or they might by present within individual user directories.
- - - - 
Richard: Did your installation include reformat and total rewrite of "/etc", or only "upgrade" of existing contents?
Comment 16 Richard Walker 2017-03-26 06:51:05 CEST
I don't think /etc has much to do with my problem as it scarcely has time to load from initrd before it crashes. In any event, I don't upgrade - only install, though that does mean I am building up quite a collection of historical bootable drives (all the way back to Mandriva 2010). For other reasons I have had to wipe and repartition between attempts with sta2 but the faulty system was installed from sta1 and updated as required over weeks.

I have just had complete success booting a fresh install from sta2 iso. No tweaks to the command line have been needed - everything is "default" and even the Plymouth screen looks as it should.

The difference between the working system and the crashing system is grub and as grub is no longer available for fresh installs it may be that this bug really will not show up for anyone else.

I will try to dummy up an "upgrade" type install, where I will be offered grub, and see if the sta2 iso installation will still work. It may be easier just to remove grub2 and install grub on this disc, so I will try that first...
Comment 17 Richard Walker 2017-03-26 17:01:30 CEST
...now posting from the re-booted test install (sta2 DVD iso) using grub-legacy as the bootloader and this was just as successful as the first and subsequent boots using the default grub2.

In summary, it looks like the inexplicable logic of the "nokmsboot" requirement on a system which ought to have fully supported kernel mode switching cannot easily be duplicated on a system created from sta2. This is good.

The downside is that the faulty behaviour was most certainly experienced and might conceivably return when something else changes to reveal again this most obscure bug.

I would have no objection to having this bug report closed if it seems to others that it is unlikely to be fixable at this stage.
Comment 18 Rick Stockton 2017-03-26 18:09:52 CEST
Richard, yiour system goes to pieces much earlier than mine. (Mine falls apart after 'startx'. Is your problem possibly resolved by forcing a new initrd? (running 'dracut -f' as root.)
Comment 19 Richard Walker 2017-03-26 22:06:44 CEST
Yes Rick, this bug was always about an instant crash on booting and as its cause has not been found I can only say that the conditions under which it appears can be avoided (completely?) with a successful sta2 DVD iso install.

The system which still has the bug is essentially a throw-away one, being on an old spare 20G laptop drive. I will go back to it over the next week or so and see if I can find a way to scare it into obedience. Waving the threat of a fresh initrd may do the trick, as may switching between grub and grub2, but you are right - the focus has to be on something which happens VERY early in the boot process.
Comment 20 Rémi Verschelde 2017-03-27 18:23:48 CEST
I did not read everything thoroughly, but it appears from the last comments that the issue might be fixed?

Were the fixes mentioned in comment 2 and comment 3 done already?
Comment 21 Mageia Robot 2017-03-28 16:25:16 CEST
commit 1ecd39ba24adb8de07998efa77c78403f6cbd977
Author: Thomas Backlund <tmb@...>
Date:   Tue Mar 28 17:24:27 2017 +0300

    detect amdgpu (mga#20538)
---
 Commit Link:
   http://gitweb.mageia.org/software/drakx-kbd-mouse-x11/commit/?id=1ecd39ba24adb8de07998efa77c78403f6cbd977
Comment 22 Rémi Verschelde 2017-04-04 09:57:12 CEST
Thomas, would the bug be fixed now with comment 21, or are there still changes to do?
Comment 23 Rémi Verschelde 2017-04-26 13:16:09 CEST
Assuming fixed in drakx-kbd-mouse-x11-1.21-1.mga6, please reopen if it's not the case.
Comment 24 Richard Walker 2017-04-27 16:41:52 CEST
I am currently running Cauldron on a system which does not use the radeon/ati driver so I am not in a position to test if the problem using this driver for an AMD A10 APU has been fixed by an update to the drakx-kbd-mouse-x11-1.21-1.mga6 package.

It might be helpful to have a few words on how the crash was induced and how the fix will now correct that error so early in the grub boot process.

Note You need to log in before you can comment on or make changes to this bug.