Bug 8562 - Can not boot MSI/AMD based system after Mageia 2 install
Summary: Can not boot MSI/AMD based system after Mageia 2 install
Status: RESOLVED OLD
Alias: None
Product: Mageia
Classification: Unclassified
Component: Release (media or process) (show other bugs)
Version: 2
Hardware: x86_64 Linux
Priority: Normal critical
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-31 19:53 CET by Tom Cox
Modified: 2013-11-23 16:13 CET (History)
1 user (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments
PCI debug output without out Realtek device (4.56 KB, text/plain)
2013-01-06 21:08 CET, Tom Cox
Details
PCI debug output with Realtek device (66.03 KB, text/plain)
2013-01-06 21:10 CET, Tom Cox
Details
PCI debug output with Realtek device (4.89 KB, text/plain)
2013-01-06 21:12 CET, Tom Cox
Details

Description Tom Cox 2012-12-31 19:53:57 CET
Description of problem:

I am installing Mageia 2 to a new system using a MSI 760GM-P34 motherboard with internal Radeon graphics.  The install succeeds, but fails during kernel initialization.  The last message shown on the screen is the 'MSI quirk: subordinated MSI disabled' message.  Then the screen goes blank (back light is on), then goes black, then the system resets and starts over.  The next kernel message expected is 'Boot video', so it may have something to do with that.

The LiveCDs work correctly, both 32 and 64 bit.  So far I've tried:

Installing from 32 and 64 bit DVDs.
Installing from 32 and 64 bit LiveCds.
Disabling everything in the BIOS.
Using kernel options to turn off noapic, noacpi, nomodeset, vga=all, pci=nomsi

The result is always the same.  I also tried installing Ubuntu 12.04.01 desktop (kernel 3.2.2) and got the same result, so this problem is probably not distro or kernel version specific.  Oh yeah, I also tried Mandriva 2011 with the same results.

Any suggestions on how to proceed would be welcome.

How reproducible:


Steps to Reproduce:
1.
2.
3.
Comment 1 Tom Cox 2012-12-31 19:59:04 CET
Meant to write vga=ask, not vga=all
Comment 2 claire robinson 2012-12-31 21:55:07 CET
You might try xdriver=vesa and if it boots you can then enable the nonfree medias and use either MCC, or xfdrake if you don't have X, to configure the graphics card and install any missing nonfree drivers/firmware.
Comment 3 Tom Cox 2013-01-01 16:47:18 CET
I've been configuring for vesa at during the installation.  I went ahead and tried your suggestion as well, but it didn't help.  I'm pretty sure we're still in kernel initialization, so I don't think it's a driver issue.  I've been comparing logs from different systems:

Mandriva 2010.2 / ASUS AMD motherboard
NET: Registered protocol family 1
pci 0000:00:07.0: Enabling HT MSI Mapping
pci 0000:00:08.0: Enabling HT MSI Mapping
pci 0000:00:09.0: Enabling HT MSI Mapping
pci 0000:00:0a.0: Enabling HT MSI Mapping
pci 0000:00:0b.0: Enabling HT MSI Mapping
pci 0000:02:00.0: Boot video device
PCI: CLS 64 bytes, default 64
Trying to unpack rootfs image as initramfs...

Mageia 2 / VirtualBox VM
[    0.521173] NET: Registered protocol family 1
[    0.521187] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.521219] pci 0000:00:01.0: Activating ISA DMA hang workarounds
[    0.521250] pci 0000:00:02.0: Boot video device
[    0.521578] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
[    0.521583] PCI: setting IRQ 11 as level-triggered
[    0.522094] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
[    0.522099] PCI: setting IRQ 10 as level-triggered
[    0.522195] PCI: CLS 0 bytes, default 64
[    0.522243] Trying to unpack rootfs image as initramfs...

Mageia 2 / MSI problem motherboard
[    0.??????] MSI quirk detected; subordinate MSI disabled

That message appears briefly before the screen flashes and resets.  (I didn't capture the sequence number, hence the question marks.  The message comes from
linux-3.3.8-2.mga2/drivers/pci/quirks.c.

The next message I expect is the "Boot video device" message which comes from
./linux-3.3.8-2.mga2/arch/x86/pci/fixup.c.

So it would appear that somewhere in the PCI fixup code, the kernel is causing a reset.

I read linux-3.3.8-2.mga2/Documentation/PCI/MSI-HOWTO.txt and, as noted above, tried the pci=nomsi kernel option.  It didn't help, so the problem may or may not be MSI related.  The screen flash before the reset may indicate a PCI/video related problem.

One other thing I should have noted originally; after trying everything else, I flashed the BIOS to the latest revision.  It didn't help.

It would appear this is a kernel/hardware compatibility issue, but I haven't worked at the kernel level and don't know how to chase it.  I'm out of bullets.
Comment 4 Tom Cox 2013-01-01 17:07:05 CET
My last response got me thinking about the LiveCD 64 bit which runs correctly.  Here's the relevant dmesg output from it.

[    0.913451] pci 0000:00:01.0: MSI quirk detected; subordinate MSI disabled
[    1.302081] pci 0000:01:05.0: Boot video device
[    1.302096] PCI: CLS 64 bytes, default 64
[    1.302179] Trying to unpack rootfs image as initramfs...
Comment 5 Manuel Hiebel 2013-01-01 17:39:34 CET
Better try new stuff then, like the beta1 from mageia3
Comment 6 Tom Cox 2013-01-01 19:30:33 CET
I'll give it a shot and report back in a day or two.
Comment 7 Tom Cox 2013-01-02 00:07:20 CET
Mageia 3 beta has the same problem.  I did a standard installation and selected to use non-free packages.  The only installation problem I had was that updates failed.  I think it said aria2 failed to download.  Anyway, I've filed a bug report with MSI.  I'll report back when I hear from them.
Comment 8 Tom Cox 2013-01-06 21:08:41 CET
Created attachment 3321 [details]
PCI debug output without out Realtek device
Comment 9 Tom Cox 2013-01-06 21:09:52 CET
This particular problem has fallen into /dev/null.

Using the rescue CD, I was having trouble building a kernel / initrd that would boot.  I just couldn't get lilo or grub to behave properly and when using grub, it wasn't updating the mbr (kept displaying L 99 99 errors).  So I decided to zero out the first block on both drives and do a fresh install.  For some reason, the installation did not recognize the on board Realtek network device this time, but it did see the Intel network device.  Once the installation completed, the computer booted properly and I was able to build a kernel with CONFIG_PCI_DEBUG turned on and boot the new kernel.

Rebooted, checked the BIOS and the Realtek was enabled.  Disabled it, rebooted to linux, rebooted and again and enabled Realtek in the BIOS and booted linux.  It still didn't see it.  Rebooted using the CD and the Hardware Detection program.  It showed the Realtek was there.  Rebooted to linux and this time is saw the device, but completed the boot normally. Once I was back in, I noted using lspci that the Realtek was now being enumerated at the end of the PCI bus. Comparing part of the PCI fixup routines output between no Realtek and Realtek showed a minor difference.

12a13,15
> pci 0000:00:05.0: calling quirk_cardbus_legacy+0x0/0x30
> pci 0000:00:05.0: calling quirk_usb_early_handoff+0x0/0x659
> pci 0000:00:05.0: calling pci_fixup_video+0x0/0xa9
84a88,90
> pci 0000:03:00.0: calling quirk_cardbus_legacy+0x0/0x30
> pci 0000:03:00.0: calling quirk_usb_early_handoff+0x0/0x659
> pci 0000:03:00.0: calling pci_fixup_video+0x0/0xa9

I'm guessing that moving the Realtek to the end of the bus changed more than just the Realtek and for whatever reason, eliminated the problem. (see attachements)  BTW, the Realtek functions correctly...no errors, runs at full speed.

So now there is nothing to debug and no way to tell what the original problem was.  The only thing I can say for sure is that something in the PCI fixup code was triggering the problem.

I wish that was the end of the story, but it isn't. I can't configure the install the way I need to and I've narrowed the problem down to issues with dracut.  Here's my typical installation:

/dev/md0, ext4, /boot
/dev/md1, lvm, vg0
/dev/vg0/root, ext4
/dev/vg0/swap, swap
/dev/vg0/home, ext4 /home
etc.

I tried scenarios using RAID+LVM and RAID by itself, but couldn't get any of them to work.  A scenario using just ext4 partitions works fine.  In the scenarios using RAID, dracut issues an error about using root=fc00 and either drops to a dracut shell or the kernel panics. Occassionally I would get a 'junk in compressed archive' message.  I tried various methods of rebuilding the initrd, but the results were always the same.  The only exception was if I built an initrd with everything (about 193M).  That would just hand during the initrd load.

This information applies to 64bit Mageia 3 beta, Mageia 2, 32bit Mageia 2 and Mandriva 2011.  BTW, all the initrd images generated during install had the following output from dracut-gencmdline:

  rd_NO_MD rd_NO_LVM root=/dev/vg0/root

When I rebuilt the initrd image, it had the correct command line, but as noted, still wouldn't boot.

Mandriva 2010.1 (pre dracut) works correctly with RAID and RAID+LVM.  Based on that and all the other dracut/raid/lvm related bugs I've seen, I think the problem is either with dracut itself or with how it is being used.  As for the latter, I did try things such as using the --fstab obption, but as noted, never could get it to work correctly.

In summary, since the genesis for this bug report has disappeared, this case can be closed.  I have another week or two before I have to deliver this box to the customer. If there's anything you'd like me to try and/or debug, I'll be happy to.  In this particular case, I can use 2010.0 without compromising this particular box's mission, but I'd prefer to be able to use the most currently available software if possible.
Comment 10 Tom Cox 2013-01-06 21:10:43 CET
Created attachment 3322 [details]
PCI debug output with Realtek device
Comment 11 Tom Cox 2013-01-06 21:12:22 CET
Created attachment 3323 [details]
PCI debug output with Realtek device

Attachment 3322 is obsolete: 0 => 1

Comment 12 Tom Cox 2013-01-10 18:08:31 CET
I found the problem I was having when using LVM and RAID with Mageia 2 and Mageia 3 beta.  I think several other reported bugs are probably related to my problem and may need the same fix.

As noted previously, the layout is:

/dev/md0, ext4, /boot
/dev/md1, lvm, vg0
/dev/vg0/root, ext4
/dev/vg0/swap, swap
/dev/vg0/home, ext4 /home

During installation, Mageia 2 or 3 defaults to lilo for the boot loader.  It generates a lilo.conf that sets root to the device string:

root=/dev/vg0/root

When lilo writes the kernel cmdline, it uses the major/minor numbers of the /dev/vg0/root block device in hex.  So 252, 0 becomes:

root=fc00

During the boot from the hard drive, /usr/lib/dracut/hooks/cmdline/95-parse-block.sh gets called to get the root device from the kernel cmdline. It doesn't know how to handle root=fc00, so rootok is not set to one and the boot fails, dropping to a dracut shell.

There are two fixes for this problem.

1. During installation, change your code to set up lilo.conf using the UUID.
  I used the rescue CD to do the following.  It can probably be done during
  installation when the Reboot screen comes up.

  Ctrl+Alt+F2
  blkid /dev/vg0/root
  Changed lilo.conf to use root="UUID=9499f9f0-3a45-4641-a9ae-94b06ffa65d0"
  lilo -r /mnt (or mount --bind /dev /mnt/dev, chroot /mnt, lilo)
  exit
  reboot

2. Fix 95-parse-block.sh
  Currently there is a section that is not in the dracut git repository.  It
  appears that someone already tried to fix this problem.

   [1-9][0-9][0-9])
        rootdevnum=$root
        root=block:/dev/root
        rootok=1 ;;

  I added the following to process root=fc00:
    [a-f0-9][a-f0-9][a-f0-9][a-f0-9])
        rootdevnum=$root
        root=block:/dev/root
        rootok=1 ;;

  That solved the problem without having to use UUID in lilo.conf.

For those that are curious, fix two went like this:
  From rescue, mounted partitions then dropped to console.
  mount --bind /dev /mnt/dev
  chroot /mnt
  mkdir initrd.tmp; cd initrd.tmp
  gunzip -c /boot/initrd-3.7.0-desktop-1.mga3.img | cpio -i
  vi usr/lib/dracut/hooks/cmdline/95-parse-block.sh and make changes
  find . |cpio -R 0:0 -H newc -o | gzip -9 > /boot/initrd-3.7.0-desktop-1.mga3.img
  lilo
  exit
  reboot

There's also a Mageia 2/3 bug that's been reported elsewhere.  From the rescue CD, it's difficult to mount RAID and/or LVM partitions.

In Mageia 2, I could run it twice and it would mount everything. In Mageia 3, it almost works, but not quite.

Pass 1: dm and raid modules are loaded. The raid drives are assembled and scanned, but root is not found and device nodes are not created for the LVM partitions.

Pass 2: Same as before, but in Magia 2 the dm-? device nodes, the LVM device links are created and the partitions are mounted.  In Mageia 3, only the dm-? device nodes are created and root is not found.

Drop to shell.  Run lvm2 vgmknodes.  Go back to rescue-gui and mount your partitions.  It succeeds this time.
Comment 13 Manuel Hiebel 2013-10-22 12:10:32 CEST
This message is a reminder that Mageia 2 is nearing its end of life.
Approximately one month from now Mageia will stop maintaining and issuing updates for Mageia 2. At that time this bug will be closed as WONTFIX (EOL) if it remains open with a Mageia 'version' of '2'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Mageia version prior to Mageia 2's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Mageia 2 is end of life.  If you would still like to see this bug fixed and are able to reproduce it against a later version of Mageia, you are encouraged to click on "Version" and change it against that version of Mageia.

Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Mageia release includes newer upstream software that fixes bugs or makes them obsolete.

-- 
The Mageia Bugsquad
Comment 14 Manuel Hiebel 2013-11-23 16:13:59 CET
Mageia 2 changed to end-of-life (EOL) status on ''22 November''. Mageia 2 is no
longer maintained, which means that it will not receive any further security or
bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of Mageia
please feel free to click on "Version" change it against that version of Mageia
and reopen this bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

--
The Mageia Bugsquad

Status: NEW => RESOLVED
Resolution: (none) => OLD


Note You need to log in before you can comment on or make changes to this bug.