+++ This bug was initially created as a clone of Bug #5148 +++ On some systems, it seems that prefdm.service never starts.
Please see the attachments from bug #5148 systemctl show dm.service: https://bugs.mageia.org/attachment.cgi?id=1936 systemctl --all --full: https://bugs.mageia.org/attachment.cgi?id=1937
CC: (none) => anssi.hannulaAssignee: bugsquad => mageia
I have my suspicions about which patch broke this, please can one of you ping me when you have a moment to test and I'll provide a testing package.
OK, we've found the culprit! That pesky ATI driver! systemctl disable atieventsd.service solves the problem and allows us to boot fine. Now to work out why (it should start after prefdm. I'm guessing it's also supposed to start before something that prefdm relies on and thus blocks it from starting).
Created attachment 1947 [details] Martin's systemctl dump output
OK, so this is fun! I've still not really worked out why this is. As this service is configured to run at runlevel5, this translates in a systemd world to "WantedBy=graphical.target". This inherently means that systemd will start this service in order to get to graphical.target, which means that it is ordered Before=graphical.target. Now, in the LSB headers, it has "Required-Start: dm". This means that it has to start After=prefdm.service in systemd (the translation is not a hard requires (aka Wants) as sysvinit did not enforce this) As prefdm is part of graphical.target, it should be perfectly possible to start atieventsd.service and graphical.service. This shouldn't be a problem.And add to this, I installed this package on a test VM. It didn't fail. Both prefdm and atieventd started fine (the latter failed due to t me not having the h/w but that's expected and seems also to be what happens in Martin's case. So looking at it closer, it seems on Martin's machine there is also an rinetd.service. This is not something supplied in mga (perhaps some leftover from Mandriva days? There is some kind of ghost in the SVN... but no actual package). Anyway, this service seems to want to start Before: multi-user.target, but keep in mind atieventd also has: Before: rinetd.service. Sadly, even this shouldn't be a problem. graphical.target comes after multi-user.target but this shouldn't prevent everything starting. So all in all, I'm still not quite "seeing" the problem. @Martin can you attach your rinetd sysinit script for reference? Can you try re-enabling atievents and disabling rinetd and see if that also works? @Anssi can you attach your "systemctl dump". As you do not have atievents installed, it cannot be the overall culprit here. Something else must be going on too, and it just so happens that fiddling with the chain in Martin's case worked around the problem.
Created attachment 1948 [details] rinetd sysv init script I'm using an old RPM to install rinetd as it has not been available in Mageia or Mandriva for a very long time. I keep meaning to update it - but up until now it's worked fine as is, so the incentive hasn't been there... Disabling the rinetd service and re-enabling atieventsd also works. So on my machine it does seem to be an interaction between these two that is causing the problem. I do have ATI hardware, and am using the fglrx driver (because the free radeon driver seems to have abandoned any form of power control :-( ), so I assume atieventsd shouldn't fail.
Thanks for all your help today Martin. I'm still not 100% sure why things are failing tho'. Could you do one last test for me if it's not too much to ask? Re-enable rinetd such that the problem occurs again and add the following to the kernel command line (as a one-off is OK): systemd.log_level=debug systemd.log_target=kmsg Then boot normally, and when things fail, wait for the jobs list to be empty and grab a copy of dmesg output. This should list all the jobs from the start of boot and deal with conflicts etc. I'm hoping this will show explicitly why the job was cancelled. If the dmesg has been cut off, then the appropriate chunk from /var/log/messages would be good instead.
Created attachment 1949 [details] dmesg output with rinetd and ativeventsd both enabled Can't see anything very helpful in here. I'll have a look in the other logs.
The interesting bits are likely in /var/log/dmesg.
Created attachment 1950 [details] Copy of /var/log/dmesg
Created attachment 1951 [details] output from dmesg command I had rebooted since attaching the original dmesg output. This version goes with the copy of /var/log/dmesg.
Attachment 1949 is obsolete: 0 => 1
Yup, interesting bits indeed: [ 7.203540] systemd[1]: Found ordering cycle on prefdm.service/start [ 7.209944] systemd[1]: Walked on cycle path to plymouth-quit.service/stop [ 7.216330] systemd[1]: Walked on cycle path to rc-local.service/start [ 7.222546] systemd[1]: Walked on cycle path to rinetd.service/start [ 7.228624] systemd[1]: Walked on cycle path to atieventsd.service/start [ 7.234617] systemd[1]: Walked on cycle path to prefdm.service/start [ 7.240501] systemd[1]: Breaking ordering cycle by deleting job plymouth-quit.service/stop [ 7.246492] systemd[1]: Deleting job prefdm.service/start as dependency of job plymouth-quit.service/stop [ 7.246501] systemd[1]: Found ordering cycle on prefdm.service/stop [ 7.252433] systemd[1]: Walked on cycle path to getty@tty1.service/start [ 7.258347] systemd[1]: Walked on cycle path to plymouth-quit-wait.service/start [ 7.264249] systemd[1]: Walked on cycle path to rc-local.service/start [ 7.270058] systemd[1]: Walked on cycle path to rinetd.service/start [ 7.275704] systemd[1]: Walked on cycle path to atieventsd.service/start [ 7.281236] systemd[1]: Walked on cycle path to prefdm.service/stop [ 7.286608] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/start Now What I'm not sure about is *why* this is an ordering cycle. I don't see the cycle :s Anssi will likely get different results here but probably something similar will show up. I think certain bits could be better as e.g. plymouth-quit.service should be fully conflicted here.. not sure why prefdm.service/start was dropped as a dep... seems the wrong way round.
I have replicated this deadlock on my VM. I've also got a small fix in systemd that works around the problem, but I still fear both the logging and feedback when such drastic action is taken (dropping of jobs) is seriously lacking. This is something I've taken up upstream and will try and get a resolution promptly.
*** Bug 5298 has been marked as a duplicate of this bug. ***
CC: (none) => herbert
Created attachment 1968 [details] dmesg, with "systemd.log_level=debug systemd.log_target=kmsg" As requested, /var/log/dmesg after boot with "systemd.log_level=debug systemd.log_target=kmsg".
Created attachment 1969 [details] systemctl dump
@Anssi, that's great! The problem is clearly shown there> [ 23.249958] systemd[1]: Activating default unit: default.target [ 23.249964] systemd[1]: Trying to enqueue job graphical.target/start/replace [ 23.250161] systemd[1]: Found ordering cycle on basic.target/start [ 23.250239] systemd[1]: Walked on cycle path to sockets.target/start [ 23.250315] systemd[1]: Walked on cycle path to syslog.socket/start [ 23.250391] systemd[1]: Walked on cycle path to basic.target/start [ 23.250466] systemd[1]: Breaking ordering cycle by deleting job syslog.socket/start This one is interesting, but I'll look at that one separately. We have a patch that should prevent this... I wonder if it's perhaps not needed now: 0504-hack-Fix-syslog.socket-to-not-cause-a-deadlock.patch I'll do some experiments to see. But this is not the real problem: [ 23.250562] systemd[1]: Found ordering cycle on prefdm.service/start [ 23.250638] systemd[1]: Walked on cycle path to getty@tty1.service/stop [ 23.250714] systemd[1]: Walked on cycle path to plymouth-quit-wait.service/start [ 23.250806] systemd[1]: Walked on cycle path to rc-local.service/start [ 23.250882] systemd[1]: Walked on cycle path to pure-ftpd.service/start [ 23.250958] systemd[1]: Walked on cycle path to preload.service/start [ 23.251073] systemd[1]: Walked on cycle path to prefdm.service/start [ 23.251149] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/stop [ 23.251242] systemd[1]: Deleting job prefdm.service/start as dependency of job getty@tty1.service/stop prefdm.service/start is deleted here as a dep, even although it's a a direct part of the cycle itself (and actually the job that caused the cycle in the first place!) and thus should be excluded from deletion (at least that's my proposed solution). It should at very least be excluded if this' the job on which the ordering cycle itself is found! Now there shouldn't be ordering cycles anyway, so in your case, it's pure-ftpd that's somewhat getting in the way... (much like rinetd was for Martin). I'll need to analyse things to see *why* these apparently innocuous and simple initscripts somehow cause such loops. They should be fixed at source, but by the same token, I want to make how systemd deals with this problem more graceful anyway. I've proposed several solutions upstream, including logging cancelled jobs in the journal and showing the job history in "systemctl status prefdm.service" output etc. Anyway, this is very useful :)
@Anssi, I think this could be due to the lack of LSB headers in the pure-ftpd init script, and thus lots of very wide-reaching deps have been added to it. As an experiment, could you add LSB data to: http://svnweb.mageia.org/packages/cauldron/pure-ftpd/current/SOURCES/pure-ftpd.init?view=markup (just on your installed system. Perhaps copy them from: http://svnweb.mageia.org/packages/cauldron/proftpd/current/SOURCES/proftpd.init?view=markup) and see if that avoids the ordering loop? @Martin, likewise, could you please try looking at adding some LSB headers to the rinetd initscript? Just an almost empty LSB block could be enough here (I will try in my VM when I get a moment) to prevent it getting loads of deps "by default".
Yup, looking at the code, (src/service.c@448) If the script has an LSB header, systemd does not care about the priorities defined by sysvinit scripts (startup priority is read via the chkconfig: line by default, but can be overridden by the numbers in the S??script symlinks in rc?.d), and relies solely on the LSB data. We need to purge all scripts without LSB headers from our system.
@Anssi, I've just submitted pure-ftpd with added LSB. Can you test that once it's built and confirm that it solves the problem in your case?
Status: NEW => ASSIGNED
@Colin, yes, adding a LSB header to the rinetd initscript avoids the ordering loop. Guess I'm finally going to have to get round to updating the rinetd package.
A lot of packages are now fixed for this with only about 15 left to sort out. So nearly done.
Ping ? I still see this in current cauldron.
CC: (none) => ftg
Hmm really? So on a fresh boot, you can do: "systemctl show prefdm.service | grep ActiveEnterTimestamp" and it shows the values of 0 for both matching lines? If there is a value here, then it's not the same bug.
Do you want that before or after I manually start dm from a getty ?
Before. And if it does give you zeros, can you reboot and add the following to the kernel command line: systemd.log_level=debug systemd.log_target=kmsg Then after boot do a "dmesg > dmesg.log2" and attach both /var/log/dmesg.log and the dmesg.log2 files (you can also combine (trim - there will be some duplicated lines) them into one file if you like... the only reason to grab the /var/log/dmesg.log file is if the start of the dmesg buffer has been lost (it's a circular buffer) to get a complete picture of the the boot process). Hope that makes sense :D
Doing this once the GUI is up gives two matching lines both with nonzero values. Doing it from a tty before gives only one matching line (ActiveEnterTimestampMonotonic=0). The last thing I see on the systemd VC is: Failed to start Wait for Plumouth Boot Screen to Quit See systemctl status plymouth-quit wait.service for details Doing that from the tty gives: plymouth-quit-wait.service - Wait for Plymouth Boot Screen to Quit Loaded: loaded (/lib/systemd/system/plymouth-quit-wait.service; static) Active: failed (Result: timeout) since Thu, 17 May 2012 09:16:29 -0400; 2min 36s ago Main PID: 10981 CGroup: name=systemd:/system/plymouth-quit-wait.service I'll try the other test now.
Created attachment 2327 [details] Frank's dmesg output
Yup, it's the values before starting the GUI that matter hear. OK the other test will be useful, but in addition to that can you also double check that you are starting graphical.target: ls -l /etc/systemd/system/default.target (/me is keen to get this bug fixed as it's the last one in the tracker :D)
Created attachment 2328 [details] Frank's /var/log/dmesg
(In reply to comment #28) > Created attachment 2327 [details] > Frank's dmesg output Hmmm, I don't see any systemd debug in there and it looks like you did not pass the kernel command line arguments at the grub prompt as mentioned in comment 26. Can you retry with those arguments passed in? Thanks.
[root@ftgme2 ftg]# ls -l /etc/systemd/system/default.target lrwxrwxrwx 1 root root 36 Mar 20 12:47 /etc/systemd/system/default.target -> /lib/systemd/system/runlevel5.target [root@ftgme2 ftg]# As for the kernel parms, you're right. I pasted them into menu.lst and ran ./install.sh, but they're not there now. FM. I'll try again.
(In reply to comment #32) > [root@ftgme2 ftg]# ls -l /etc/systemd/system/default.target lrwxrwxrwx 1 root > root 36 Mar 20 12:47 /etc/systemd/system/default.target -> > /lib/systemd/system/runlevel5.target OK, that's all correct. Thanks for confirming. > As for the kernel parms, you're right. I pasted them into menu.lst and ran > ./install.sh, but they're not there now. FM. I'll try again. You can just add them ad-hoc at boot in grub. Just hit escape, then e and e again, type in the change, then hit b to boot it.
Created attachment 2331 [details] Frank's dmesg output
Attachment 2327 is obsolete: 0 => 1
Attachment 2331 mime type: application/octet-stream => text/plain
Created attachment 2333 [details] Frank's /var/log/dmesg OK, try these.
Attachment 2328 is obsolete: 0 => 1
Yup, that latter one clearly shows the problem: [ 7.250153] systemd[1]: Found ordering cycle on prefdm.service/start [ 7.250156] systemd[1]: Walked on cycle path to plymouth-quit.service/stop [ 7.250158] systemd[1]: Walked on cycle path to rc-local.service/start [ 7.250160] systemd[1]: Walked on cycle path to oracle-xe.service/start [ 7.250163] systemd[1]: Walked on cycle path to atieventsd.service/start [ 7.250165] systemd[1]: Walked on cycle path to prefdm.service/start [ 7.250167] systemd[1]: Breaking ordering cycle by deleting job plymouth-quit.service/stop [ 7.250170] systemd[1]: Deleting job prefdm.service/start as dependency of job plymouth-quit.service/stop [ 7.250177] systemd[1]: Found ordering cycle on prefdm.service/stop [ 7.250179] systemd[1]: Walked on cycle path to getty@tty1.service/start [ 7.250181] systemd[1]: Walked on cycle path to rc-local.service/start [ 7.250183] systemd[1]: Walked on cycle path to oracle-xe.service/start [ 7.250185] systemd[1]: Walked on cycle path to atieventsd.service/start [ 7.250187] systemd[1]: Walked on cycle path to prefdm.service/stop [ 7.250190] systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/start In this case I suspect strongly that it's the oracle-xe initscript that is causing this problem. Chances are it does not have a proper LSB header and thus we can *only* rely on the symlink S number (e.g. the file /etc/rc5.d/S??oracle-xe where the ?? is the start sequence number. The correct fix here is to add an LSB header to this initscript. As it's a third part script and not something shipped by Mageia (urpmf doesn't find it), there is little we can do about this. Can you confirm that disabling this service (systemctl disable oracle-xe.service) allows boot to work as expected?
I'll do that. What LSB header(s) should be added ?
You only need to add enough headers to properly tell systemd what that services dependances are. e.g. if it needs network or whatever. Have a look in the other scripts in the /etc/init.d/ folder for inspiration. All you really need is to have a comment block near the top saying: ### BEGIN INIT INFO # Provides: oracle-xe # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Required-Start: # Required-Stop: # Short-Description: Whatever # Description: Whatever Whatever Whatever Whatever Whatever Whatever # Whatever Whatever Whatever Whatever Whatever Whatever Whatever # Whatever Whatever Whatever Whatever Whatever Whatever Whatever ### END INIT INFO If it needs e.g. network, then put $network in the Required-Start/Stop lines.
That did it, thanks. I'll try it with the LSB comments now.
Great, thanks for confirming. Feel free to keep commenting on this bug with your progress, but as we've confirmed that it's an external initscript that's causing issues and as all the initscrpts we provide have either been replaced by native units or had LSB headers added, I think we can close this bug.
Status: ASSIGNED => RESOLVEDResolution: (none) => FIXEDSummary: prefdm.service does not start on boot. => prefdm.service does not start on boot (ordering cycles due to lack of LSB headers in sysvinit scripts)
The comments do the trick as well. Does upstream consider this a bug ? Because the LSB comments reall have nothing specific to the service in them other than the Provides, and systemd could assume that as a default by itself....
(In reply to comment #41) > The comments do the trick as well. Glad to hear. > Does upstream consider this a bug ? Because the LSB comments reall have > nothing specific to the service in them other than the Provides, and systemd > could assume that as a default by itself.... No this isn't really a bug, tho' upstream is open to suggestions on how to handle ordering cycles better (even if most of the suggestions have been discussed already). So to explain there are three types of "init scripts", sysv, lsb and native systemd units. In the case of lsb and native units the ordering of things (e.g. the fact that a given services needs another to be available) is explicitly stated. With sysvinit, all we have is an "order" as specified by the S??* symlinks. Orders are horrible to work with and if we're honest much of the ordering information is pretty buggy by design anyway (there is no guarantee that services were really ready even if they were started before). So systemd does try to use the ordering information to gauge when it should be started, but it often results in some impossible ordering cycle (if you remove the LSB header, do a systemctl --systemd daemon-reload and then a systemctl show oracle-xe.service you'll see the Before and After values that dictate ordering). The same fundamental deadlock could occur with native units... e.g. serviceA.service: Before=serviceB.service serviceB.service: Before=serviceA.service This is an impossible situation. This kind of unsatisfiable situation wasn't possible to achieve with non-lsb sysvinit scripts as the worst you could do was set them to start with the same ordering value eg. S10serviceA S10serviceB and both will be started but in a somewhat non-deterministic order. So in many cases it's just fundamentally hard to resolve the ordered services with dependency services. The full and complete solution is just to do away completely with ordered services and then all will be well! Hope that helps explain things.
*** Bug 6882 has been marked as a duplicate of this bug. ***