Bug 5262 - prefdm.service does not start on boot (ordering cycles due to lack of LSB headers in sysvinit scripts)
Summary: prefdm.service does not start on boot (ordering cycles due to lack of LSB hea...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: x86_64 Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Colin Guthrie
QA Contact:
URL:
Whiteboard:
Keywords:
: 5298 6882 (view as bug list)
Depends on: 5148
Blocks: 2120
  Show dependency treegraph
 
Reported: 2012-04-06 23:55 CEST by Colin Guthrie
Modified: 2012-08-03 17:57 CEST (History)
6 users (show)

See Also:
Source RPM: systemd
CVE:
Status comment:


Attachments
Martin's systemctl dump output (430.99 KB, text/plain)
2012-04-08 20:30 CEST, Colin Guthrie
Details
rinetd sysv init script (1.28 KB, text/plain)
2012-04-08 21:39 CEST, Martin Whitaker
Details
dmesg output with rinetd and ativeventsd both enabled (122.19 KB, text/plain)
2012-04-08 22:44 CEST, Martin Whitaker
Details
Copy of /var/log/dmesg (122.33 KB, text/plain)
2012-04-08 23:22 CEST, Martin Whitaker
Details
output from dmesg command (122.19 KB, application/octet-stream)
2012-04-08 23:26 CEST, Martin Whitaker
Details
dmesg, with "systemd.log_level=debug systemd.log_target=kmsg" (122.57 KB, text/plain)
2012-04-11 03:44 CEST, Anssi Hannula
Details
systemctl dump (786.42 KB, text/plain)
2012-04-11 03:46 CEST, Anssi Hannula
Details
Frank's dmesg output (55.24 KB, text/plain)
2012-05-17 15:33 CEST, Frank Griffin
Details
Frank's /var/log/dmesg (52.25 KB, text/plain)
2012-05-17 15:34 CEST, Frank Griffin
Details
Frank's dmesg output (121.84 KB, text/plain)
2012-05-17 15:49 CEST, Frank Griffin
Details
Frank's /var/log/dmesg (122.27 KB, text/plain)
2012-05-17 15:50 CEST, Frank Griffin
Details

Description Colin Guthrie 2012-04-06 23:55:30 CEST
+++ This bug was initially created as a clone of Bug #5148 +++

On some systems, it seems that prefdm.service never starts.
Comment 1 Colin Guthrie 2012-04-06 23:56:35 CEST
Please see the attachments from bug #5148

systemctl show dm.service: https://bugs.mageia.org/attachment.cgi?id=1936
systemctl --all --full: https://bugs.mageia.org/attachment.cgi?id=1937
Colin Guthrie 2012-04-06 23:57:17 CEST

CC: (none) => anssi.hannula
Assignee: bugsquad => mageia

Comment 2 Colin Guthrie 2012-04-08 15:14:34 CEST
I have my suspicions about which patch broke this, please can one of you ping me when you have a moment to test and I'll provide a testing package.
Comment 3 Colin Guthrie 2012-04-08 20:13:29 CEST
OK, we've found the culprit! That pesky ATI driver!

  systemctl disable atieventsd.service 

solves the problem and allows us to boot fine.

Now to work out why (it should start after prefdm. I'm guessing it's also supposed to start before something that prefdm relies on and thus blocks it from starting).
Comment 4 Colin Guthrie 2012-04-08 20:30:56 CEST
Created attachment 1947 [details]
Martin's systemctl dump output
Comment 5 Colin Guthrie 2012-04-08 21:02:02 CEST
OK, so this is fun!

I've still not really worked out why this is.

As this service is configured to run at runlevel5, this translates in a systemd world to "WantedBy=graphical.target".

This inherently means that systemd will start this service in order to get to graphical.target, which means that it is ordered Before=graphical.target.

Now, in the LSB headers, it has "Required-Start: dm". This means that it has to start After=prefdm.service in systemd (the translation is not a hard requires (aka Wants) as sysvinit did not enforce this)

As prefdm is part of graphical.target, it should be perfectly possible to start atieventsd.service and graphical.service. This shouldn't be a problem.And add to this, I installed this package on a test VM. It didn't fail. Both prefdm and atieventd started fine (the latter failed due to t me not having the h/w but that's expected and seems also to be what happens in Martin's case.

So looking at it closer, it seems on Martin's machine there is also an rinetd.service. This is not something supplied in mga (perhaps some leftover from Mandriva days? There is some kind of ghost in the SVN... but no actual package).

Anyway, this service seems to want to start Before: multi-user.target, but keep in mind atieventd also has: Before: rinetd.service. Sadly, even this shouldn't be a problem. graphical.target comes after multi-user.target but this shouldn't prevent everything starting.

So all in all, I'm still not quite "seeing" the problem.



@Martin can you attach your rinetd sysinit script for reference? Can you try re-enabling atievents and disabling rinetd and see if that also works?


@Anssi can you attach your "systemctl dump". As you do not have atievents installed, it cannot be the overall culprit here. Something else must be going on too, and it just so happens that fiddling with the chain in Martin's case worked around the problem.
Comment 6 Martin Whitaker 2012-04-08 21:39:11 CEST
Created attachment 1948 [details]
rinetd sysv init script

I'm using an old RPM to install rinetd as it has not been available in Mageia or Mandriva for a very long time. I keep meaning to update it - but up until now it's worked fine as is, so the incentive hasn't been there...

Disabling the rinetd service and re-enabling atieventsd also works. So on my machine it does seem to be an interaction between these two that is causing the problem.

I do have ATI hardware, and am using the fglrx driver (because the free radeon driver seems to have abandoned any form of power control :-( ), so I assume atieventsd shouldn't fail.
Comment 7 Colin Guthrie 2012-04-08 22:09:38 CEST
Thanks for all your help today Martin.

I'm still not 100% sure why things are failing tho'.

Could you do one last test for me if it's not too much to ask?

Re-enable rinetd such that the problem occurs again and add the following to the kernel command line (as a one-off is OK):

systemd.log_level=debug systemd.log_target=kmsg

Then boot normally, and when things fail, wait for the jobs list to be empty and grab a copy of dmesg output. This should list all the jobs from the start of boot and deal with conflicts etc. I'm hoping this will show explicitly why the job was cancelled.

If the dmesg has been cut off, then the appropriate chunk from /var/log/messages would be good instead.
Comment 8 Martin Whitaker 2012-04-08 22:44:47 CEST
Created attachment 1949 [details]
dmesg output with rinetd and ativeventsd both enabled

Can't see anything very helpful in here. I'll have a look in the other logs.
Comment 9 Colin Guthrie 2012-04-08 23:04:18 CEST
The interesting bits are likely in /var/log/dmesg.
Comment 10 Martin Whitaker 2012-04-08 23:22:24 CEST
Created attachment 1950 [details]
Copy of /var/log/dmesg
Comment 11 Martin Whitaker 2012-04-08 23:26:03 CEST
Created attachment 1951 [details]
output from dmesg command

I had rebooted since attaching the original dmesg output. This version goes with the copy of /var/log/dmesg.

Attachment 1949 is obsolete: 0 => 1

Comment 12 Colin Guthrie 2012-04-09 00:44:39 CEST
Yup, interesting bits indeed:

[    7.203540] systemd[1]: Found ordering cycle on prefdm.service/start
[    7.209944] systemd[1]: Walked on cycle path to plymouth-quit.service/stop
[    7.216330] systemd[1]: Walked on cycle path to rc-local.service/start
[    7.222546] systemd[1]: Walked on cycle path to rinetd.service/start
[    7.228624] systemd[1]: Walked on cycle path to atieventsd.service/start
[    7.234617] systemd[1]: Walked on cycle path to prefdm.service/start
[    7.240501] systemd[1]: Breaking ordering cycle by deleting job plymouth-quit.service/stop
[    7.246492] systemd[1]: Deleting job prefdm.service/start as dependency of job plymouth-quit.service/stop
[    7.246501] systemd[1]: Found ordering cycle on prefdm.service/stop
[    7.252433] systemd[1]: Walked on cycle path to getty@tty1.service/start
[    7.258347] systemd[1]: Walked on cycle path to plymouth-quit-wait.service/start
[    7.264249] systemd[1]: Walked on cycle path to rc-local.service/start
[    7.270058] systemd[1]: Walked on cycle path to rinetd.service/start
[    7.275704] systemd[1]: Walked on cycle path to atieventsd.service/start
[    7.281236] systemd[1]: Walked on cycle path to prefdm.service/stop
[    7.286608] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/start

Now What I'm not sure about is *why* this is an ordering cycle. I don't see the cycle :s

Anssi will likely get different results here but probably something similar will show up.

I think certain bits could be better as e.g. plymouth-quit.service should be fully conflicted here.. not sure why prefdm.service/start was dropped as a dep... seems the wrong way round.
Comment 13 Colin Guthrie 2012-04-09 04:45:18 CEST
I have replicated this deadlock on my VM. I've also got a small fix in systemd that works around the problem, but I still fear both the logging and feedback when such drastic action is taken (dropping of jobs) is seriously lacking.

This is something I've taken up upstream and will try and get a resolution promptly.
Comment 14 Manuel Hiebel 2012-04-09 09:58:52 CEST
*** Bug 5298 has been marked as a duplicate of this bug. ***

CC: (none) => herbert

Comment 15 Anssi Hannula 2012-04-11 03:44:59 CEST
Created attachment 1968 [details]
dmesg, with "systemd.log_level=debug systemd.log_target=kmsg"

As requested, /var/log/dmesg after boot with "systemd.log_level=debug systemd.log_target=kmsg".
Comment 16 Anssi Hannula 2012-04-11 03:46:08 CEST
Created attachment 1969 [details]
systemctl dump
Comment 17 Colin Guthrie 2012-04-11 10:53:15 CEST
@Anssi, that's great! The problem is clearly shown there>

[   23.249958] systemd[1]: Activating default unit: default.target
[   23.249964] systemd[1]: Trying to enqueue job graphical.target/start/replace
[   23.250161] systemd[1]: Found ordering cycle on basic.target/start
[   23.250239] systemd[1]: Walked on cycle path to sockets.target/start
[   23.250315] systemd[1]: Walked on cycle path to syslog.socket/start
[   23.250391] systemd[1]: Walked on cycle path to basic.target/start
[   23.250466] systemd[1]: Breaking ordering cycle by deleting job syslog.socket/start

This one is interesting, but I'll look at that one separately. We have a patch that should prevent this...

I wonder if it's perhaps not needed now:
0504-hack-Fix-syslog.socket-to-not-cause-a-deadlock.patch

I'll do some experiments to see.

But this is not the real problem:

[   23.250562] systemd[1]: Found ordering cycle on prefdm.service/start
[   23.250638] systemd[1]: Walked on cycle path to getty@tty1.service/stop
[   23.250714] systemd[1]: Walked on cycle path to plymouth-quit-wait.service/start
[   23.250806] systemd[1]: Walked on cycle path to rc-local.service/start
[   23.250882] systemd[1]: Walked on cycle path to pure-ftpd.service/start
[   23.250958] systemd[1]: Walked on cycle path to preload.service/start
[   23.251073] systemd[1]: Walked on cycle path to prefdm.service/start
[   23.251149] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/stop
[   23.251242] systemd[1]: Deleting job prefdm.service/start as dependency of job getty@tty1.service/stop

prefdm.service/start is deleted here as a dep, even although it's a a direct part of the cycle itself (and actually the job that caused the cycle in the first place!) and thus should be excluded from deletion (at least that's my proposed solution). It should at very least be excluded if this' the job on which the ordering cycle itself is found!

Now there shouldn't be ordering cycles anyway, so in your case, it's pure-ftpd that's somewhat getting in the way... (much like rinetd was for Martin). I'll need to analyse things to see *why* these apparently innocuous and simple initscripts somehow cause such loops. They should be fixed at source, but by the same token, I want to make how systemd deals with this problem more graceful anyway.

I've proposed several solutions upstream, including logging cancelled jobs in the journal and showing the job history in "systemctl status prefdm.service" output etc.

Anyway, this is very useful :)
Comment 18 Colin Guthrie 2012-04-11 11:03:00 CEST
@Anssi, I think this could be due to the lack of LSB headers in the pure-ftpd init script, and thus lots of very wide-reaching deps have been added to it.

As an experiment, could you add LSB data to:
http://svnweb.mageia.org/packages/cauldron/pure-ftpd/current/SOURCES/pure-ftpd.init?view=markup

(just on your installed system. Perhaps copy them from: http://svnweb.mageia.org/packages/cauldron/proftpd/current/SOURCES/proftpd.init?view=markup)

and see if that avoids the ordering loop?


@Martin, likewise, could you please try looking at adding some LSB headers to the rinetd initscript? Just an almost empty LSB block could be enough here (I will try in my VM when I get a moment) to prevent it getting loads of deps "by default".
Comment 19 Colin Guthrie 2012-04-11 11:32:47 CEST
Yup, looking at the code, (src/service.c@448) If the script has an LSB header, systemd does not care about the priorities defined by sysvinit scripts (startup priority is read via the chkconfig: line by default, but can be overridden by the numbers in the S??script symlinks in rc?.d), and relies solely on the LSB data.

We need to purge all scripts without LSB headers from our system.
Comment 20 Colin Guthrie 2012-04-12 11:57:41 CEST
@Anssi, I've just submitted pure-ftpd with added LSB. Can you test that once it's built and confirm that it solves the problem in your case?

Status: NEW => ASSIGNED

Comment 21 Martin Whitaker 2012-04-12 20:46:32 CEST
@Colin, yes, adding a LSB header to the rinetd initscript avoids the ordering loop. Guess I'm finally going to have to get round to updating the rinetd package.
Comment 22 Colin Guthrie 2012-04-30 01:34:06 CEST
A lot of packages are now fixed for this with only about 15 left to sort out. So nearly done.
Comment 23 Frank Griffin 2012-05-17 15:01:49 CEST
Ping ?

I still see this in current cauldron.

CC: (none) => ftg

Comment 24 Colin Guthrie 2012-05-17 15:06:52 CEST
Hmm really?

So on a fresh boot, you can do: "systemctl show prefdm.service | grep ActiveEnterTimestamp" and it shows the values of 0 for both matching lines?

If there is a value here, then it's not the same bug.
Comment 25 Frank Griffin 2012-05-17 15:10:04 CEST
Do you want that before or after I manually start dm from a getty ?
Comment 26 Colin Guthrie 2012-05-17 15:15:06 CEST
Before.

And if it does give you zeros, can you reboot and add the following to the kernel command line:

systemd.log_level=debug systemd.log_target=kmsg

Then after boot do a "dmesg > dmesg.log2" and attach both /var/log/dmesg.log and the dmesg.log2 files (you can also combine (trim - there will be some duplicated lines) them into one file if you like... the only reason to grab the /var/log/dmesg.log file is if the start of the dmesg buffer has been lost (it's a circular buffer) to get a complete picture of the the boot process).

Hope that makes sense :D
Comment 27 Frank Griffin 2012-05-17 15:24:41 CEST
Doing this once the GUI is up gives two matching lines both with nonzero values.  Doing it from a tty before gives only one matching line (ActiveEnterTimestampMonotonic=0).

The last thing I see on the systemd VC is:
  Failed to start Wait for Plumouth Boot Screen to Quit
  See systemctl status plymouth-quit wait.service for details

Doing that from the tty gives:

plymouth-quit-wait.service - Wait for Plymouth Boot Screen to Quit
          Loaded: loaded (/lib/systemd/system/plymouth-quit-wait.service; static)
          Active: failed (Result: timeout) since Thu, 17 May 2012 09:16:29 -0400; 2min 36s ago
        Main PID: 10981
          CGroup: name=systemd:/system/plymouth-quit-wait.service

I'll try the other test now.
Comment 28 Frank Griffin 2012-05-17 15:33:36 CEST
Created attachment 2327 [details]
Frank's dmesg output
Comment 29 Colin Guthrie 2012-05-17 15:34:21 CEST
Yup, it's the values before starting the GUI that matter hear.

OK the other test will be useful, but in addition to that can you also double check that you are starting graphical.target:

ls -l /etc/systemd/system/default.target

(/me is keen to get this bug fixed as it's the last one in the tracker :D)
Comment 30 Frank Griffin 2012-05-17 15:34:45 CEST
Created attachment 2328 [details]
Frank's /var/log/dmesg
Comment 31 Colin Guthrie 2012-05-17 15:36:39 CEST
(In reply to comment #28)
> Created attachment 2327 [details]
> Frank's dmesg output

Hmmm, I don't see any systemd debug in there and it looks like you did not pass the kernel command line arguments at the grub prompt as mentioned in comment 26. Can you retry with those arguments passed in? Thanks.
Comment 32 Frank Griffin 2012-05-17 15:41:35 CEST
[root@ftgme2 ftg]# ls -l /etc/systemd/system/default.target lrwxrwxrwx 1 root root 36 Mar 20 12:47 /etc/systemd/system/default.target -> /lib/systemd/system/runlevel5.target [root@ftgme2 ftg]# 

As for the kernel parms, you're right.  I pasted them into menu.lst and ran ./install.sh, but they're not there now.  FM.  I'll try again.
Comment 33 Colin Guthrie 2012-05-17 15:43:10 CEST
(In reply to comment #32)
> [root@ftgme2 ftg]# ls -l /etc/systemd/system/default.target lrwxrwxrwx 1 root
> root 36 Mar 20 12:47 /etc/systemd/system/default.target ->
> /lib/systemd/system/runlevel5.target

OK, that's all correct. Thanks for confirming.

> As for the kernel parms, you're right.  I pasted them into menu.lst and ran
> ./install.sh, but they're not there now.  FM.  I'll try again.

You can just add them ad-hoc at boot in grub. Just hit escape, then e and e again, type in the change, then hit b to boot it.
Comment 34 Frank Griffin 2012-05-17 15:49:15 CEST
Created attachment 2331 [details]
Frank's dmesg output

Attachment 2327 is obsolete: 0 => 1

Colin Guthrie 2012-05-17 15:50:21 CEST

Attachment 2331 mime type: application/octet-stream => text/plain

Comment 35 Frank Griffin 2012-05-17 15:50:41 CEST
Created attachment 2333 [details]
Frank's /var/log/dmesg

OK, try these.

Attachment 2328 is obsolete: 0 => 1

Comment 36 Colin Guthrie 2012-05-17 15:57:10 CEST
Yup, that latter one clearly shows the problem:


[    7.250153] systemd[1]: Found ordering cycle on prefdm.service/start
[    7.250156] systemd[1]: Walked on cycle path to plymouth-quit.service/stop
[    7.250158] systemd[1]: Walked on cycle path to rc-local.service/start
[    7.250160] systemd[1]: Walked on cycle path to oracle-xe.service/start
[    7.250163] systemd[1]: Walked on cycle path to atieventsd.service/start
[    7.250165] systemd[1]: Walked on cycle path to prefdm.service/start
[    7.250167] systemd[1]: Breaking ordering cycle by deleting job plymouth-quit.service/stop
[    7.250170] systemd[1]: Deleting job prefdm.service/start as dependency of job plymouth-quit.service/stop
[    7.250177] systemd[1]: Found ordering cycle on prefdm.service/stop
[    7.250179] systemd[1]: Walked on cycle path to getty@tty1.service/start
[    7.250181] systemd[1]: Walked on cycle path to rc-local.service/start
[    7.250183] systemd[1]: Walked on cycle path to oracle-xe.service/start
[    7.250185] systemd[1]: Walked on cycle path to atieventsd.service/start
[    7.250187] systemd[1]: Walked on cycle path to prefdm.service/stop
[    7.250190] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/start


In this case I suspect strongly that it's the oracle-xe initscript that is causing this problem. Chances are it does not have a proper LSB header and thus we can *only* rely on the symlink S number (e.g. the file /etc/rc5.d/S??oracle-xe where the ?? is the start sequence number.

The correct fix here is to add an LSB header to this initscript. As it's a third part script and not something shipped by Mageia (urpmf doesn't find it), there is little we can do about this.



Can you confirm that disabling this service (systemctl disable oracle-xe.service) allows boot to work as expected?
Comment 37 Frank Griffin 2012-05-17 16:02:22 CEST
I'll do that.  What LSB header(s) should be added ?
Comment 38 Colin Guthrie 2012-05-17 16:05:14 CEST
You only need to add enough headers to properly tell systemd what that services dependances are. e.g. if it needs network or whatever.

Have a look in the other scripts in the /etc/init.d/ folder for inspiration. All you really need is to have a comment block near the top saying:


### BEGIN INIT INFO
# Provides: oracle-xe
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Required-Start:
# Required-Stop:
# Short-Description: Whatever
# Description: Whatever Whatever Whatever Whatever Whatever Whatever
#       Whatever Whatever Whatever Whatever Whatever Whatever Whatever
#       Whatever Whatever Whatever Whatever Whatever Whatever Whatever
### END INIT INFO


If it needs e.g. network, then put $network in the Required-Start/Stop lines.
Comment 39 Frank Griffin 2012-05-17 16:08:22 CEST
That did it, thanks.  I'll try it with the LSB comments now.
Comment 40 Colin Guthrie 2012-05-17 16:13:54 CEST
Great, thanks for confirming. Feel free to keep commenting on this bug with your progress, but as we've confirmed that it's an external initscript that's causing issues and as all the initscrpts we provide have either been replaced by native units or had LSB headers added, I think we can close this bug.

Status: ASSIGNED => RESOLVED
Resolution: (none) => FIXED
Summary: prefdm.service does not start on boot. => prefdm.service does not start on boot (ordering cycles due to lack of LSB headers in sysvinit scripts)

Comment 41 Frank Griffin 2012-05-17 16:21:50 CEST
The comments do the trick as well.

Does upstream consider this a bug ?  Because the LSB comments reall have nothing specific to the service in them other than the Provides, and systemd could assume that as a default by itself....
Comment 42 Colin Guthrie 2012-05-17 17:09:48 CEST
(In reply to comment #41)
> The comments do the trick as well.

Glad to hear.

> Does upstream consider this a bug ?  Because the LSB comments reall have
> nothing specific to the service in them other than the Provides, and systemd
> could assume that as a default by itself....

No this isn't really a bug, tho' upstream is open to suggestions on how to handle ordering cycles better (even if most of the suggestions have been discussed already).

So to explain there are three types of "init scripts", sysv, lsb and native systemd units. In the case of lsb and native units the ordering of things (e.g. the fact that a given services needs another to be available) is explicitly stated. With sysvinit, all we have is an "order" as specified by the S??* symlinks.

Orders are horrible to work with and if we're honest much of the ordering information is pretty buggy by design anyway (there is no guarantee that services were really ready even if they were started before).

So systemd does try to use the ordering information to gauge when it should be started, but it often results in some impossible ordering cycle (if you remove the LSB header, do a systemctl --systemd daemon-reload and then a systemctl show oracle-xe.service you'll see the Before and After values that dictate ordering).

The same fundamental deadlock could occur with native units... e.g. 

serviceA.service: Before=serviceB.service
serviceB.service: Before=serviceA.service

This is an impossible situation. This kind of unsatisfiable situation wasn't possible to achieve with non-lsb sysvinit scripts as the worst you could do was set them to start with the same ordering value eg. S10serviceA S10serviceB and both will be started but in a somewhat non-deterministic order.

So in many cases it's just fundamentally hard to resolve the ordered services with dependency services.

The full and complete solution is just to do away completely with ordered services and then all will be well!

Hope that helps explain things.
Comment 43 Frank Griffin 2012-08-03 17:57:05 CEST
*** Bug 6882 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.