Bug 4772 - boot hangs at 'Starting cpufreq (via systemctl)'
Summary: boot hangs at 'Starting cpufreq (via systemctl)'
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
Priority: release_blocker critical
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
: 4826 4984 (view as bug list)
Depends on:
Blocks: 2120
  Show dependency treegraph
 
Reported: 2012-03-02 21:17 CET by Florin Pulbere
Modified: 2012-05-12 12:10 CEST (History)
14 users (show)

See Also:
Source RPM: cpufreq
CVE:
Status comment:


Attachments
1st boot (14.03 KB, text/plain)
2012-03-07 01:51 CET, Florin Pulbere
Details
2nd boot (23.33 KB, text/plain)
2012-03-07 01:52 CET, Florin Pulbere
Details
Output of journalctl (29.83 KB, text/plain)
2012-03-22 14:29 CET, Doug Lytle
Details
bug 4772 2012.03.29 log (786.32 KB, application/x-bzip)
2012-03-29 13:45 CEST, Kamil Rytarowski
Details
cpu freq - infinite (?) loop with an error - photo of a monitor (806.61 KB, image/jpeg)
2012-04-02 17:26 CEST, Kamil Rytarowski
Details

Description Florin Pulbere 2012-03-02 21:17:56 CET
Description of problem:

System stops loading after starting cpufreq. After forcing shutdown the system starts normally.

How reproducible:

1. Install using Beta1 64bit DVD
2. Reboot
3. System stops loading after starting cpufreq service
4. Force shutdown by pressing the power button
5. System starts normaly
Comment 1 Manuel Hiebel 2012-03-06 00:53:20 CET
Are you sure it's cpufreq ?

test with disabling the daemon ? (systemctl disable cpufreq.service )

Keywords: (none) => NEEDINFO
Source RPM: (none) => cpufreq

Manuel Hiebel 2012-03-06 23:50:53 CET

Keywords: NEEDINFO => (none)
CC: (none) => eandry, n54
Blocks: (none) => 2120
Summary: Unable to start all services at first boot. => boot hangs at 'Starting cpufreq (via systemctl)'

Comment 2 Manuel Hiebel 2012-03-06 23:51:12 CET
*** Bug 4826 has been marked as a duplicate of this bug. ***

CC: (none) => herbert

Comment 3 Florin Pulbere 2012-03-07 01:51:17 CET
Created attachment 1691 [details]
1st boot
Comment 4 Florin Pulbere 2012-03-07 01:52:51 CET
Created attachment 1692 [details]
2nd boot
Manuel Hiebel 2012-03-15 14:17:10 CET

Priority: Normal => release_blocker

Comment 5 Manuel Hiebel 2012-03-16 18:10:42 CET
*** Bug 4984 has been marked as a duplicate of this bug. ***

CC: (none) => mageiabugs

Comment 6 Wolfgang Bornath 2012-03-16 18:16:48 CET
I can confirm this with Beta2 x86_64.
Logs are identical.

Difference: From the hanging system at first reboot I could shutdown with Ctrl+Alt+Del.

CC: (none) => molch.b

Manuel Hiebel 2012-03-16 18:20:30 CET

CC: (none) => mageia

Comment 7 Colin Guthrie 2012-03-17 00:27:36 CET
I suspect *something* is going on here, even tho' you think it's hanged.

Some questions:
 1. Does it boot if you just leave it long enough (i.e. >5mins (maybe even 15-20 mins!) This should give all the timeouts time to finish.

 2. Can you switch to tty2 and get a login prompt?
Comment 8 Wolfgang Bornath 2012-03-17 07:37:39 CET
(In reply to comment #7)
> I suspect *something* is going on here, even tho' you think it's hanged.

Yes. When I did Ctrl+Alt+Del to reboot out of that I saw a long list of messages flying by (all of them starting with "Starting...." and "Started ....") so I guess you are right.

> Some questions:
>  1. Does it boot if you just leave it long enough (i.e. >5mins (maybe even
> 15-20 mins!) This should give all the timeouts time to finish.
> 
>  2. Can you switch to tty2 and get a login prompt?

1. I waited almost exactly 10 minutes (according to my wrist watch)

2. No. I checked all of the possible ttys (F2 -> F12).
Comment 9 Randy Wilson 2012-03-18 02:32:15 CET
Having just installed Beta 2 three times, I can confirm that the first boot takes over ten minutes.That's from grub to KDM screen.  The "starting cpufreq" does eventually time out and the boot continues. Second boot is about 45 seconds.

CC: (none) => randy

Comment 10 Remco Rijnders 2012-03-21 07:47:58 CET
Just adding a "me too" on this for Beta 2. On a fresh install, it hangs for 10 minutes on the cpufreq message, then continues the boot process. The problem is gone on subsequent boots.

CC: (none) => remco

Comment 11 Sander Lepik 2012-03-21 18:51:18 CET
But what is causing it?

systemd-analyze blame - what's the output after delayed boot?

CC: (none) => sander.lepik

Comment 12 Colin Guthrie 2012-03-22 12:39:22 CET
Yup, we need to do that first boot and get as much debug about it as possible. Finding out exactly what services caused the delay and time out.

This will include the systemd-analyze blame output, but any other log output would be critical too e.g. it's worth doing systemd-journalctl -a >journal.log too and maybe capturing syslog output as well.

That would be most helpful.
Comment 13 Wolfgang Bornath 2012-03-22 12:48:41 CET
Being a newbie I would need a detailed description what to do before, while and after the first boot to get all info you need.

I will do a fresh installation later this afternoon (standard x86_64 KDE with as many defaults as possible), so the opportunity is there. :)
Comment 14 Colin Guthrie 2012-03-22 12:55:53 CET
@wobo: Well pretty much what I said above.

As the first boot seems to be the bad case here, it could be something we only do once (e.g. mdv firsttime service etc.) or it could just be something that needs to write out some config and exits, but in a way that it doesn't track itself properly... who knows.

But let the first boot happen, leave it to time out (as Remco reported and as I suspected earlier, it's really a matter of leaving it for a long time to let the timeouts happen and then letting the boot continue. It's critical that it is this boot that you let run to it's completion such that the data on the problems experienced then can be collated.

The output from "systemd-analyze blame" is useful, as is the output from "systemd-journalctl -a". Also, "systemctl --failed" output and the subsequent "systemctl status xxx.service" for any services listed in the --failed output. Any other syslog output could also be useful.

That's all I can think of for now - hopefully it'll be enough to point the finger.

Thanks as always wobo :) KUTGW!
Comment 15 Wolfgang Bornath 2012-03-22 13:21:14 CET
Sorry to ask again:

1. I do the installation and reboot. I let the machine do what it wants until the system is completely up.
Then I go to a tty and do the following commands:

systemd-analyze blame > blame.txt
systemd-journalctl -a > journalctl.txt
systemctl --failed > failed.txt
systemctl status [failed services].service >> failed_services.txt
syslog (part before and after the delay)

Is that correct?
Comment 16 Colin Guthrie 2012-03-22 13:24:25 CET
@wobo: Perfect! Thanks.

If I think of anything else I'll post it here (hopefully before you reboot :D)
Comment 17 Wolfgang Bornath 2012-03-22 13:29:15 CET
I should add lspci output to give information about the hardware?
I should wrap all files into a bz2?

Don't hurry, on my schedule for now is a nice after-lunch-nap (don't forget, I'm old & retired. :)
Comment 18 Luc Menut 2012-03-22 13:51:47 CET
quickly for your information,
the "faulty" service could be harddrake in mandrake_everytime; recently, I tried to add
  HARDDRAKE_ONBOOT=no
in /etc/sysconfig/system at the end of the install before the first reboot, and in this case, the first boot was not delayed.
Sadly, I have very few time currently, and I didn't have time to retry on a new install and verify that this is repeatable and not just a coincidence.

CC: (none) => lmenut

Comment 19 Wolfgang Bornath 2012-03-22 14:00:11 CET
Ok, I will do that first before the end of the installation. If it works - is all the rest (as described in comment #15) still needed ?
In other words: is that a possible solution or just a workaround until the real cause is found and fixed?
Comment 20 Luc Menut 2012-03-22 14:10:31 CET
(In reply to comment #19)

> In other words: is that a possible solution or just a workaround until the real
> cause is found and fixed?
Harddrake is probably still usefull and needed, so it's probably only an indication about the service/tool that needs a fix.
Comment 21 Doug Lytle 2012-03-22 14:29:03 CET
Created attachment 1824 [details]
Output of journalctl
Comment 22 Doug Lytle 2012-03-22 14:29:45 CET
I ran the following command:

systemd-analyze blame

command not found

updatedb
locate systemd-analyze returns no results

This is Mageia 2 BETA 2 32bit

The system sat for 11 minutes before continuing


cat failed_services.txt
fedora-l...odules.service loaded failed failed        Load legacy module configuration
Comment 23 Doug Lytle 2012-03-22 14:40:04 CET
journalctl shows:

Mar 22 08:57:34 laptop.dd.inet service_harddrake[663]: running: /etc/rc.d/init.d/cpufreq start
Mar 22 09:07:34 laptop.dd.inet service_harddrake[663]: ERROR: killing runaway process (process=/etc/rc.d/init.d/cpufreq, pid=690, args=start, error=ALARM at /usr/lib/libDrakX/run_program.pm line 92.
Comment 24 Colin Guthrie 2012-03-22 14:46:05 CET
This is quite interesting Doug.

For reference, systemd-analyze is in the systemd-tools package so you'll need to install that first (urpmi systemd-tools).

It's interesting that harddrake itself runs the cpufreq stuff and that it ultimately seems to have it's own watchdog that takes 10 minutes.....

All seems to tie in really.

Will take some time to debug.
Comment 25 Wolfgang Bornath 2012-03-22 15:20:29 CET
Just a "me too" confirmation:

Same message as Doug, so I need not upload an attachment. This really seems to be the spot.
Comment 26 Colin Guthrie 2012-03-24 16:28:43 CET
Hi,

Do you have:
cpufreq_performance

in /etc/modprobe.preload.d/cpufreq.conf?

This is no longer wanted. Seems I commented it out on my machine way back in January....

I'm not sure this is fundamentally the issue, but I'll certainly take a look.
Comment 27 Doug Lytle 2012-03-24 16:54:41 CET
I do have a cpufreq (No .conf) in the /etc/modprobe.preload.d

The contents of which are:

acpi-cpufreq
cpufreq_performance
cpufreq_powersave
cpufreq_conservative
cpufreq_ondemand


Doug
Comment 28 Colin Guthrie 2012-03-24 17:28:36 CET
Yup I'm just looking in drakx code... need to remove the performance one. Not sure it's relevant.
Comment 29 Colin Guthrie 2012-03-24 17:55:40 CET
I need to do a test install, but I'm hoping that simply removing the performance module from the cpufreq.pm module in harddrake will avoid this runaway process problem... that's a bit optimistic perhaps however :)
Comment 30 Kamil Rytarowski 2012-03-29 13:40:39 CEST
I can confirm it. I have installed my system twice.
The first boot: it was hanging 15-20 minutes at the cpufreq message.

I think (I'm not sure), that after 50% of this time, there is appearing '[ok]' and new line. This was just an imporession I had.. maybe wrong.
I'm attaching all the needed files (except blame).
Comment 31 Kamil Rytarowski 2012-03-29 13:45:28 CEST
Created attachment 1883 [details]
bug 4772 2012.03.29 log
Comment 32 Kamil Rytarowski 2012-04-02 17:23:10 CEST
I have reproduced it again a few days ago...
I have succeed to switch to TTY2-4 (I'm not sure what was that), and I saw a loop of errors. I have got a photo of it. And then I switched again to TTY1.. and it was hanging for 6 hours.

I can confirm that the "[ok]" part is in the middle of pause.
Comment 33 Kamil Rytarowski 2012-04-02 17:26:34 CEST
Created attachment 1905 [details]
cpu freq - infinite (?) loop with an error - photo of a monitor
Comment 34 Colin Guthrie 2012-04-03 09:52:57 CEST
OK, I think I worked out what is wrong here... it came to me in a dream :p (I fixed a similar issue in networkmanager yesterday, so my dreams were "primed"!!)

It's basically a deadlock as systemctl waits to ensure it's ordering is correct.

In harddrake (via mandrake_everytime) we're telling it to start a service that is order to start after a target that mandriva_everytime itself starts before... this is not good!!!

I've worked around this by adding the --no-block option to systemctl. I've not had time to test, but this should fix the issue.

I'd really appreciate it if people could test an install once the new initscripts rel = 15) is available to them in order to see if this is resolved.
Comment 35 Thierry Vignaud 2012-04-03 10:14:53 CEST
I've reduced harddrake timeout from 10 to 2 mn in my git

CC: (none) => thierry.vignaud

Comment 36 Dave Hodgins 2012-04-03 18:31:41 CEST
(In reply to comment #35)
> I've reduced harddrake timeout from 10 to 2 mn in my git

As per bug 44, the timeout should be increased, or preferably made
into a config variable that can easily be overridden via a
kernel commandline option.

CC: (none) => davidwhodgins

Comment 37 Thierry Vignaud 2012-04-04 11:20:31 CEST
I don't see the link with bug #44.
harddrake service only run a couple basic stuff at startup.
If they take more than a couple minutes, something is very wrong as seen in this bug report
Comment 38 Dave Hodgins 2012-04-04 18:16:32 CEST
(In reply to comment #37)
> I don't see the link with bug #44.
> harddrake service only run a couple basic stuff at startup.
> If they take more than a couple minutes, something is very wrong as seen in
> this bug report

The 10 minute timeout is hard coded in line 87 of
/usr/lib/libDrakX/run_program.pm.  That's the same code that is used
to run mkinitrd during bootloader-config.  If that is reduced to 2
minutes, people with older i586 systems like mine, won't be able to
install a kernel, even on real hardware, let alone under virtualbox.
Comment 39 Thierry Vignaud 2012-04-05 01:17:44 CEST
I'm very aware of where came this timeout as a drakx manager for +10 years.

If you actually read what I wrote or see what has been commited in SVN,
you'll see that default timeout has _NOT_ been changed.
What has been changed is timeout for commands runned by harddrake service and only those.
Which doesn't include running bootloader-config btw.
Comment 40 AL13N 2012-04-08 01:33:50 CEST
(In reply to comment #34)
> OK, I think I worked out what is wrong here... it came to me in a dream :p (I
> fixed a similar issue in networkmanager yesterday, so my dreams were
> "primed"!!)
> 
> It's basically a deadlock as systemctl waits to ensure it's ordering is
> correct.
> 
> In harddrake (via mandrake_everytime) we're telling it to start a service that
> is order to start after a target that mandriva_everytime itself starts
> before... this is not good!!!
> 
> I've worked around this by adding the --no-block option to systemctl. I've not
> had time to test, but this should fix the issue.
> 
> I'd really appreciate it if people could test an install once the new
> initscripts rel = 15) is available to them in order to see if this is resolved.

i have rel = 17

the first boot takes ages, i'm gonna try running it overnight to see if it ever gets out of it.

is there a place where i can find these cyclic dependencies while in a rescue?

CC: (none) => alien

Comment 41 AL13N 2012-04-08 02:10:08 CEST
well, after 30min still no dice...
Comment 42 AL13N 2012-04-08 02:10:40 CEST
btw: bugzilla time seems to be 2hours extra
Comment 43 Colin Guthrie 2012-04-29 13:26:49 CEST
OK, so I think I can reproduce a similar problem in virtualbox. On first boot, there is a large delay (somewhere around 120s) due to numlock.

I think harddrake does similar things with numlock as it does with cpufreq.

I can confirm that it is numlock at fault but chrooting in to the install before reboot and removing the numlock rpm. Simply disabling it (chkconfig --del) will automatically reenable it again (first time boot works out whether or not it should be enabled).

I'll try and dig further.
Comment 44 Colin Guthrie 2012-04-29 14:48:28 CEST
OK, so it seems my systemd support patches likely reintroduced this problem! Hopefully re-resolved in latest drakxtools.

Again, I'm not 100% sure about the cpufreq issue, but I suspect it's the same thing and should be solved too, but I'll leave it open until we can confirm with RC installs on real systems.
Comment 45 Colin Guthrie 2012-05-10 00:08:54 CEST
So final RC is out. Can people experiencing this issue please check it with the RC? I'm hoping it's resolved...
Comment 46 isadora 2012-05-10 08:23:06 CEST
Just installed Mageia 2 RC into VMPlayer virtual machine.
During boot following messages passed by:

Failed to start Load legacy module configuration      [FAILED]

CC: (none) => isis2000

Comment 47 Colin Guthrie 2012-05-10 12:32:24 CEST
(In reply to comment #46)
> Just installed Mageia 2 RC into VMPlayer virtual machine.
> During boot following messages passed by:
> 
> Failed to start Load legacy module configuration      [FAILED]

Completely unrelated and also harmless (it just means your modprobe.preload* files have some strange things in it, but it's not a big problem overall. It's likely that the modules are just already loaded via udev and thus don't need to be manually loaded.

The bigger questions is whether or not you saw *this* bug?
Comment 48 isadora 2012-05-10 13:09:32 CEST
(In reply to comment #47)
> (In reply to comment #46)
> > Just installed Mageia 2 RC into VMPlayer virtual machine.
> > During boot following messages passed by:
> > 
> > Failed to start Load legacy module configuration      [FAILED]
> 
> Completely unrelated and also harmless (it just means your modprobe.preload*
> files have some strange things in it, but it's not a big problem overall. It's
> likely that the modules are just already loaded via udev and thus don't need to
> be manually loaded.
> 
> The bigger questions is whether or not you saw *this* bug?

Thanks for your reply Colin, and apologies for disturbing this thread.
Answering your question: no i didn't see *this* bug ;)
Comment 49 Wolfgang Bornath 2012-05-10 17:57:09 CEST
x86_64:
Tested KDE LiveCD and installation DVD - bug not showing.
Comment 50 Kamil Rytarowski 2012-05-10 22:56:45 CEST
I have not got this bug anymore.
Comment 51 Remco Rijnders 2012-05-12 12:10:51 CEST
Thanks all for testing. Colin, thanks for the time spent on this. On my netbook, which also had this problem, a fresh install of the RC no longer had this problem. Merci!

Status: NEW => RESOLVED
Resolution: (none) => FIXED


Note You need to log in before you can comment on or make changes to this bug.