Bug 16260 - systemd detects ordering cycle on prefdm.service/start - in return deletes plymouth-quit.service/stop and getty@tty1.service/stop effectively preventing X to start
Summary: systemd detects ordering cycle on prefdm.service/start - in return deletes pl...
Status: RESOLVED FIXED
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 5
Hardware: All Linux
Priority: Normal major
Target Milestone: ---
Assignee: Colin Guthrie
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-30 21:39 CEST by Florian Hubold
Modified: 2017-01-17 10:29 CET (History)
2 users (show)

See Also:
Source RPM: systemd-217-11.mga5.src.rpm
CVE:
Status comment:


Attachments
journalctl -ab output of boot with ordering cycle issues (111.69 KB, text/plain)
2015-07-01 22:55 CEST, Florian Hubold
Details
journalctl -ab output of boot with ordering cycle issues in debug mode ( boot option systemd.log_level=debug ) (345.14 KB, text/plain)
2015-07-02 16:37 CEST, Florian Hubold
Details
proper journalctl -ab output of boot with ordering cycle issues in debug mode ( boot option systemd.log_level=debug ) (578.54 KB, text/plain)
2015-07-02 19:43 CEST, Florian Hubold
Details

Description Florian Hubold 2015-06-30 21:39:49 CEST
Description of problem:

A user in the german forum reported this after his upgrade to mga5 via https://forums.mageia.org/de/viewtopic.php?f=8&t=2507 . From his point of view, the system boots normally, but plymouth never goes away and X never starts. He has to switch to another tty, login and run "startx" to get his normal desktop.

Looking further, it's due to an ordering cycle where systemd shoots itself in the foot, killing getty on tty1 where X will start and also killing plymouth-wait-quit stop, hence plymouth will display forever. There are some other small issues on his box, but they seem unrelated and I have advised him for solutions on those already, awaiting feedback ...

----

Relevant excerpt from journalctl -ab via grep -i -E "plymouth|prefdm" journal.log


Jun 30 17:26:06 localhost.localdomain systemd[1]: Found ordering cycle on prefdm.service/start
Jun 30 17:26:06 localhost.localdomain systemd[1]: Breaking ordering cycle by deleting job plymouth-quit.service/stop
Jun 30 17:26:06 localhost.localdomain systemd[1]: Job plymouth-quit.service/stop deleted to break ordering cycle starting with prefdm.service/start
Jun 30 17:26:06 localhost.localdomain systemd[1]: Found ordering cycle on prefdm.service/stop
Jun 30 17:26:06 localhost.localdomain systemd[1]: Breaking ordering cycle by deleting job getty@tty1.service/stop
Jun 30 17:26:06 localhost.localdomain systemd[1]: Job getty@tty1.service/stop deleted to break ordering cycle starting with prefdm.service/stop
Jun 30 17:26:06 localhost.localdomain systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway.
 
[snip - boot progresses here]
 
Jun 30 17:26:18 localhost.localdomain shorewall6[2738]: done.
Jun 30 17:26:36 localhost.localdomain systemd[1]: plymouth-quit-wait.service start operation timed out. Terminating.
Jun 30 17:26:36 localhost.localdomain systemd[1]: Failed to start Wait for Plymouth Boot Screen to Quit.
Jun 30 17:26:36 localhost.localdomain systemd[1]: Unit plymouth-quit-wait.service entered failed state.
Jun 30 17:26:36 localhost.localdomain systemd[1]: plymouth-quit-wait.service failed.


I have a complete journalctl -ab log from that system, which I can also attach if necessary. I have also asked him to give me the output of "systemd-delta" and "systemctl show -p Requires,Wants,Requisite,BindsTo,PartOf,Before,After" for prefdm and plymouth-quit-wait. 

Please advise what is required for further analysis and to break this ordering cycle. Maybe a dump via "systemd-analyze dump" at the point when the issue persists or when he has started his X session normally?

Reproducible: 

Steps to Reproduce:
Comment 1 David Walser 2015-06-30 22:10:56 CEST
See also:
Bug 15168
Bug 15691
Bug 16086
Comment 2 Colin Guthrie 2015-07-01 09:58:20 CEST
Ideally a full log with systemd.log_level=debug on the kernel command line from the boot would be good.
Comment 3 Florian Hubold 2015-07-01 22:55:13 CEST
Created attachment 6796 [details]
journalctl -ab output of boot with ordering cycle issues

(In reply to Colin Guthrie from comment #2)
> Ideally a full log with systemd.log_level=debug on the kernel command line
> from the boot would be good.

Yep, would probably be. Unfortunately the issue already vanished, and I've got no clue what particular action fixed it, only vague guess about rc.local. I've attached the journalctl -ab output when all the issues where there, after obtaining that I advised the user to fix some of his basic issues.

This is what I asked him to do:

- assign a hostname via "hostnamectl set-hostname"
- disable rc.local via "systemctl mask rc-local" as it failed beforehand, so might have caused some ordering issues.

This is what he did in addition:

- unplug all external HDDs
- unplug all USB devices
- plug keyboard/mouse receiver directly into mainboard
- move away /etc/rc.d/rc.local file somewhere else


After several reboots and plugging his devices again as they were beforehand, everything is fine again. I assume that systemd doesn't really care about the hostname, so the only relevant change was masking rc.local ...



So my question would be, for such ordering issues, booting with "systemd.log_level=debug" and then "journalctl -ab" output should be sufficient to check why such cycles appeared?

CC: (none) => doktor5000

Comment 4 Florian Hubold 2015-07-01 22:56:29 CEST
Closing as the issue has vanished after masking rc.local ...

:/

Status: NEW => RESOLVED
Resolution: (none) => WORKSFORME

Comment 5 Colin Guthrie 2015-07-02 09:58:09 CEST
(In reply to Florian Hubold from comment #3)
> Created attachment 6796 [details]
> journalctl -ab output of boot with ordering cycle issues
> 
> (In reply to Colin Guthrie from comment #2)
> > Ideally a full log with systemd.log_level=debug on the kernel command line
> > from the boot would be good.
> 
> So my question would be, for such ordering issues, booting with
> "systemd.log_level=debug" and then "journalctl -ab" output should be
> sufficient to check why such cycles appeared?

It can certainly help. For most issues it can provide a lot of insights that regular logs do not, so it's a good one to keep in the armoury. You can just use "debug" on the command line too but the kernel spews out a lot of stuff then too.
Comment 6 Florian Hubold 2015-07-02 14:25:46 CEST
Reopening, after the user unmasked rc-local the issue is back. I cannot really comprehend how rc-local has an effect of the ordering of prefdm and getty/plymouth-quit-wait ... but anyways, at least it's reproducible somehow.

Will ask him to create a debug journal log and attach it here.

Resolution: WORKSFORME => (none)
Status: RESOLVED => REOPENED

Comment 7 Florian Hubold 2015-07-02 16:37:40 CEST
Created attachment 6798 [details]
journalctl -ab output of boot with ordering cycle issues in debug mode ( boot option systemd.log_level=debug )

@Colin: Here's the debug log, although it looks a bit weird. Not one occurence of prefdm in there, no ordering cycle issues.

And when the last child process of plymouth died, it seems to automatically invoke an emergency shell, although the user says he still just saw the regular plymouth bootsplash ...

Attachment 6796 is obsolete: 0 => 1

Comment 8 Florian Hubold 2015-07-02 19:43:33 CEST
Created attachment 6799 [details]
proper journalctl -ab output of boot with ordering cycle issues in debug mode ( boot option systemd.log_level=debug )

(In reply to Florian Hubold from comment #7)
> @Colin: Here's the debug log, although it looks a bit weird. Not one
> occurence of prefdm in there, no ordering cycle issues.

Talked with the user again, this is now the correct debug log, which also shows the prefdm ordering cycle as expected. And it is easily reproducible, unmasking rc-local.server makes it appear, masking it makes it vanish.

Attachment 6798 is obsolete: 0 => 1

Comment 9 Florian Hubold 2015-08-08 13:15:39 CEST
@colin: Another issue, reported via https://forums.mageia.org/en/viewtopic.php?f=7&t=10134 - a restart of prefdm seems to help there for normal login. But this thing with the ordering cycles seems ridiculous to me:


Aug 07 19:26:19 gilraen systemd[1]: Found ordering cycle on preload.service/start
Aug 07 19:26:19 gilraen systemd[1]: Breaking ordering cycle by deleting job prefdm.service/start
Aug 07 19:26:19 gilraen systemd[1]: Job prefdm.service/start deleted to break ordering cycle starting with preload.service/start
Aug 07 19:26:19 gilraen systemd[1]: Found ordering cycle on preload.service/start
Aug 07 19:26:19 gilraen systemd[1]: Breaking ordering cycle by deleting job remote-fs.target/start
Aug 07 19:26:19 gilraen systemd[1]: Job remote-fs.target/start deleted to break ordering cycle starting with preload.service/start
Aug 07 19:26:19 gilraen systemd[1]: Found ordering cycle on preload.service/start
Aug 07 19:26:19 gilraen systemd[1]: Breaking ordering cycle by deleting job time-sync.target/start
Aug 07 19:26:19 gilraen systemd[1]: Job time-sync.target/start deleted to break ordering cycle starting with preload.service/start
Comment 10 Florian Hubold 2015-08-08 14:44:40 CEST
And one more time, masking the triggering unit, in this case preload.service, fixes the issue. 

@colin: Can we please get this fixed? As it doesn't seem like an upstream problem, what can be done about this?
Florian Hubold 2015-08-09 16:48:43 CEST

Blocks: (none) => 15527

Comment 11 Colin Guthrie 2015-08-09 22:03:51 CEST
I think the solution decided upon was just to kill off preload. It's not really been tested and probably provides minimal benefits even if it did work (due to it not being tested or integrated).

We can likely just release a new systemd version that obsoletes it.
Comment 12 Florian Hubold 2015-08-10 21:04:06 CEST
I meant "can we please get this fixed" in regards to the ridiculous ordering cycles, not especially for preload - the initial report was triggered by rc.local causing ordering cycles. I mean, c'mon, how is rc.local, whether it runs or not anyhow relevant to prefdm?

Sorry but I simply fail to see the logic upon which systemd decides what ordering cycles are - those seem completely arbitrary with no actual relation.
Comment 13 Colin Guthrie 2015-08-11 10:06:18 CEST
They are not completely arbitrary. They are very much written down in the rules of the units/init scripts. It's very specifically the exact opposite of arbitrary.

In this case, preload has Required-Start $local_fs and $remote_fs and $time These translate to systemd having an After rule in MGA5 for preload with: 

After=remote-fs.target time-sync.target systemd-journald.socket basic.target system.slice prefdm.service

With prefdm having:
After=livesys-late.service systemd-user-sessions.service getty@tty1.service plymouth-quit.service systemd-journald.socket basic.target system.slice

And rc-local having:
After=network.target systemd-journald.socket basic.target system.slice


And before rules:

preload:
Before=resolvconf.service graphical.target shutdown.target partmon.service network.service numlock.service irqbalance.service postfix.service icecream.service icecream-scheduler.service msec.service virtualbox.service network-up.service


prefdm:
Before=dracut-shutdown.service shutdown.target preload.service graphical.target

rc-local:
Before=plymouth-quit.service shutdown.target getty@tty1.service plymouth-quit-wait.service



Now the main problem here is that prefdm has to start before preload but after after systemd-user-sessions.service. s-u-s has to start after remote-fs.target which has to start after remote-fs-pre.target which has to start after nfs-lock.target which has to start after network.target.

So s-u-s has to start after network.target

So preferdm has to start before preload but after network.target. preload itself has to start after network.target, and thus there is a deadlock (aka an ordering cycle).

All of this only shows up and is activated under certain circumstances - i.e. if remote-fs.target is not started or there are no remote fs's detected then a lot of these loops will be closed. There are likely other ways that generators (such as the rc-local one) can get in the way by adding extra deps.


So these things are far from arbitrary. They are very much determined, although the use of generators (such as in the rc-local case or when remote filesystems are detected in /etc/fstab) can make things harder to spot.

But while they may not be arbitrary, they are bloody annoying :p
Comment 14 Henrik Christiansen 2015-08-18 10:26:12 CEST
(In reply to Florian Hubold from comment #0)
> Description of problem:
> 
> A user in the german forum reported this after his upgrade to mga5 via
> https://forums.mageia.org/de/viewtopic.php?f=8&t=2507 . From his point of
> view, the system boots normally, but plymouth never goes away and X never
> starts. He has to switch to another tty, login and run "startx" to get his
> normal desktop.
> 

I just upgraded from 4 to 5 with
urpmi --replacefiles --auto-update --auto

and got almost the same problem, except that my boot goes directly to login and after login, I have to run startx to get to the desktop.

systemctl mask preload.service
make it boot normal.

CC: (none) => hc

Comment 15 Florian Hubold 2015-12-07 18:47:34 CET
So one half of the bug should be solved by removing preload via Bug 16086
As the originally reported issue also involved preload as a dependency in the  ordering cycle on prefdm, I'll close this one.

But we should reopen against cauldron if we see this again before or after mga6 release ...

Status: REOPENED => RESOLVED
Resolution: (none) => FIXED

Samuel Verschelde 2017-01-17 10:29:39 CET

Blocks: 15527 => (none)


Note You need to log in before you can comment on or make changes to this bug.