Bug 5137

Summary: When uninstalling and reinstalling openssh-server, openssh-server does not start back up.
Product: Mageia Reporter: Remco Rijnders <remco>
Component: RPM PackagesAssignee: Colin Guthrie <mageia>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: Normal CC: gdzien, misc
Version: Cauldron   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: openssh-server CVE:
Status comment:

Description Remco Rijnders 2012-03-27 17:47:02 CEST
Description of problem:
When issueing the following command:
urpme openssh-server; urpmi openssh-server
when logged in remotely, the connection gets dropped, and, worse, openssh does not start back up.

The same might happen when doing a simple upgrade of this package. (I did not check this)

When logged in to the machine, giving the command:
systemctl start sshd.service
gives the following:
Warning: Unit file of created job changed on disk, 'systemctl --system daemon-reload' recommended.

I think when the uninstall and reinstall is done without a remote install, openssh-server starts back up as expected.

As discussed with colin on IRC:

<coling> remmy, enabled yes, (depending on msec security level) but not actually started.
<coling> remmy, so I could see how this might be slightly annoying when doing it in a datacentre :D
<coling> (for various values of "slightly" :p)
* coling wonders if we need some pam_systemd stuff in pam.d/ssh
<remmy> systemctl start sshd.service gives: Warning: Unit file of created job changed on disk, 'systemctl --system daemon-reload' recommended
<remmy> coling: Yeah, I would not be happy if I gave this command on my server in Canada
<coling> remmy, yeah re that warning I've fixed that in rpm-helper now.. just not released.
<coling> Anyway, If my thinking is right, the user session in ssh should change the cgroup from sshd's cgroup to a user cgroup, and thus "escape" from being killed when sshd is shut down.
<coling> That would then prevent this turn of events.
<remmy> And I wonder if one would not run into the same issue with a simple upgrade
<coling> remmy, in retrospect, can you open a bug for this? Basically just say "please check user session cgroup for SSH logins" or something....

Priority set to normal for now as this is not a common use case, but if we want Mageia viable as server platform, this should be fixed.
Comment 1 Grzegorz Dzien 2012-04-12 17:00:02 CEST
This killed my ssh session during upgrade.

CC: (none) => gdzien

Comment 2 Manuel Hiebel 2012-04-12 17:00:11 CEST
*** Bug 5363 has been marked as a duplicate of this bug. ***
Manuel Hiebel 2012-04-12 17:00:56 CEST

CC: (none) => misc
Severity: normal => critical

Comment 3 Colin Guthrie 2012-04-12 17:55:53 CEST
This is not a dupe of #5363.

#5363 relates to killing sshd *sessions*, this bug relates to sshd daemon not restarting properly.
Comment 4 Colin Guthrie 2012-04-12 18:03:03 CEST
Actually apologies to Manuel, it is at least half a dupe :)

Still we should really debug why it didn't start back up, so lets re-target this bug to that issue specifically (it's already what the subject refers to anyway which is why I was a bit overzealous on calling it "not a dupe" before!).
Comment 5 Grzegorz Dzien 2012-04-12 21:34:17 CEST
Its not starting back up because shell that waited to sshd to stop was killed before it had chance to start command after semicolon.
So it leaves my bug as part of this bug, basically duplicate - mine is about session dropping as well, just mine is more acurately pointing to the actual problem.
Comment 6 Grzegorz Dzien 2012-04-12 22:00:37 CEST
My case was solved by changing UsePAM from no to yes. So will be this one, with my explanation added why it was not started again (shell can not fork when it is killed by restarting ssh).

Thanks for help, seems like in mine sshd_config.rpmnew this sshd_config problem is already fixed.

Best Regards
Greg

Status: NEW => RESOLVED
Resolution: (none) => FIXED

Comment 7 Colin Guthrie 2012-04-12 22:08:55 CEST
Well, I'm not really sure that's totally perfect explanation, as the "restart" command issued should have been a "systemctl try-restart sshd.service", which is just one command to issue, and it should be restarted. It actually hands the job over to systemd to do the restarting so even if the shell is killed it should have been restarted OK as it doesn't need to fork etc. This is actually one of the features of systemd. Services are launched from a pure, clean, stable environment, not forked off a user env which could have all manner of weird stuff in it.

So if this problem can be reproduced (with UsePAM=yes) it could still be a potential issue.
Comment 8 Grzegorz Dzien 2012-04-12 22:36:37 CEST
How come it killed my session then when I ran service sshd restart with UsePAM no?
Comment 9 Colin Guthrie 2012-04-12 22:52:17 CEST
Well, basically, if you have UsePAM=no, any child processes of sshd process are part of the same Control Group as the sshd itself. You can verify this by setting UsePAM=no and logging in remotely and doing "systemctl status sshd.service". You will see your user's shell and individial sshd process shown as part of the service. When you set UsePAM=yes, the PAM module pam_systemd will be used and your user's own sshd process and shell etc. will be placed into a user session Control Group, rather than the session control group.

When systemd restarts a service, it ensures that all processes in the Control Group are killed off as part of the "stop job". In the UsePAM=no case, this includes your user shell etc., In the UsePAM=yes case, as explained above, it does not.

Control Groups are pretty great for partitioning things nicely. Keeping track of forked processes in services is really handy (want to know what that rouge CGI script forked off? No problem!). It's also useful when dealing with user sessions too - you can set various options such that all user processes are killed when logging out via SSH etc. (requires pam tweaks) or simply when logging out as a user from a graphical DM.

So that explains why your user session was killed when the service was restarted. But it doesn't explain why the service itself didn't come back up. As I said the jobs should have been scheduled already when the users shell was killed. I guess there is a chance that because the systemctl command (which blocks waiting for the operations to end) was killed that systemd somehow cancelled the start job (I don't think this is how it works, but it's a possibility. This would be easily worked around in rpm-helper by passing the --no-block option to systemctl if this was the case.

If anyone can offer any more insights or do some more tests related to the restarting issue, please do :)
Comment 10 Remco Rijnders 2012-04-12 22:57:45 CEST
@Colin,

Thanks for the kind and professional manner in which you explain this.

For what it is worth, when I updated my Cauldron box over SSH yesterday, which included an update of openssh-server, my connection was killed. However, as compared to the issue we first spoke about two weeks ago, SSH was properly restarted and I could log in again to continue the upgrade process.
Comment 11 Grzegorz Dzien 2012-04-12 23:06:05 CEST
Sorry for my question, you just kicked me off the track for a second. I know how Linux works, thanks for explanation though.

Anyway look for his problem description, let me quote it for you:

"When issueing the following command:
urpme openssh-server; urpmi openssh-server
when logged in remotely, the connection gets dropped, and, worse, openssh does
not start back up."

You see problem yet?

"urpme openssh-server;" (sshd is stopped here, his shell dies, "urpmi openssh-server" - this part never gets executed, because his shell is dead so it can not fork new process. Yes, he is logged up remotely.

BTW: You should modify SPEC file for new openssh-server rpm with some macro (I am not good with specs, so will not give example) that will first run rpm -Qf /etc/ssh/sshd_config and if it finds file to be identical with standard installation, file should be replaced with new fixed config file. As for now it creates .rpmnew file.

Best Regards
Grzegorz Dzien
Comment 12 Colin Guthrie 2012-04-12 23:26:22 CEST
@Grzegorz: Gah! Yeah, you're right that that particular part of this bug report does say that, but I know from previous conversions that this originally stemmed from this problem occurring "in the wild" and that the comments above were really just an example reproduction case. As you point out, it's not really a valid reproduction case. I see where you were coming from now in your analysis tho', and I certainly cannot fault it :D