Bug 32409 - NetworkManager: 100% CPU and 10+ minutes to reboot (no network if NM is enabled)
Summary: NetworkManager: 100% CPU and 10+ minutes to reboot (no network if NM is enabled)
Status: NEW
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: Cauldron
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Mageia Bug Squad
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-18 22:56 CEST by Pierre Fortin
Modified: 2024-02-03 06:35 CET (History)
3 users (show)

See Also:
Source RPM:
CVE:
Status comment:


Attachments
photo of bootup after NM installed (241.32 KB, image/jpeg)
2023-10-18 23:07 CEST, Pierre Fortin
Details
NM even takes a long time to shutdown... (66.19 KB, image/jpeg)
2023-10-18 23:08 CEST, Pierre Fortin
Details
journal (4.57 KB, text/plain)
2023-10-21 06:07 CEST, Pierre Fortin
Details

Description Pierre Fortin 2023-10-18 22:56:55 CEST
Description of problem: Finally got around to switching to NetworkManager (previous attempts months ago were aborted due to issues I don't recall).
Today, I seriously regret enabling it...
NM is running 100%.  
Worse, after rebooting, I had NO networking at all (except "lo"). Got ethernet back up via: https://bugs.mageia.org/show_bug.cgi?id=32373#c5
but still no WiFi...  Below are the most recent updates for networking.  I now have my main system and a laptop with no working WiFi...  While using mcc to try to get them working, they can see the various SSIDs available for connection; but they don't connect. Other devices connect to the same router without issue.

With a process running 100%, I normally expect a LOT of output from strace; but in this case, all I get is:
$ strace -p 248775
strace: Process 248775 attached
restart_syscall(<... resuming interrupted read ...>
^Cstrace: Process 248775 detached

 <detached ...>

gdb gives nothing useful.

These are the most recent network related updates that are likely suspects.

Version-Release number of selected component (if applicable):
$ rpm -qa --last | grep -i network
python3-pyside6-networkauth-6.5.2-2.mga10.x86_64 Tue 17 Oct 2023 07:06:33 AM EDT
python3-pyside6-network-6.5.2-2.mga10.x86_64  Tue 17 Oct 2023 07:06:33 AM EDT
lib64qtnetwork4-4.8.7-46.mga10.x86_64         Fri 13 Oct 2023 10:50:31 AM EDT
python3-qt6-networkauth-6.5.0-1.mga10.x86_64  Wed 11 Oct 2023 02:19:19 AM EDT
lib64qt6network-devel-6.5.2-2.mga10.x86_64    Wed 11 Oct 2023 02:10:47 AM EDT
lib64qt6network6-6.5.2-2.mga10.x86_64         Wed 11 Oct 2023 02:10:47 AM EDT
lib64qt5network-devel-5.15.7-8.mga10.x86_64   Wed 11 Oct 2023 02:10:47 AM EDT
lib64qt5network5-5.15.7-8.mga10.x86_64        Wed 11 Oct 2023 02:10:47 AM EDT
python3-networkx-3.1-1.mga10.noarch           Mon 09 Oct 2023 06:56:45 PM EDT
lib64glib-networking-2.78.0-1.mga10.x86_64    Sun 24 Sep 2023 04:27:40 PM EDT
glib-networking-2.78.0-1.mga10.x86_64         Sun 24 Sep 2023 04:27:40 PM EDT
lib64glib-networking-gnutls-2.78.0-1.mga10.x86_64 Sun 24 Sep 2023 04:27:38 PM EDT
lib64qt6networkauth6-6.5.2-1.mga10.x86_64     Sat 23 Sep 2023 07:27:01 PM EDT
python3-qt6-network-6.5.2-1.mga10.x86_64      Sat 23 Sep 2023 07:25:00 PM EDT
networkmanager-qt-5.110.0-1.mga10.x86_64      Tue 19 Sep 2023 02:37:07 PM EDT
lib64kf5networkmanagerqt6-5.110.0-1.mga10.x86_64 Tue 19 Sep 2023 02:37:07 PM EDT

I was about to open this issue with Major: A major feature is broken; but tried once again to setup WiFi and it worked this time after nearly 2 weeks without WiFi...

Hoewever, NetworkManager is still consuming 100% CPU on one processor..

Also, I still have:
$ cat .net_applet
AUTOSTART=TRUE

$ systemctl status network.service
○ network.service - LSB: Bring up/down networking
     Loaded: loaded (/etc/rc.d/init.d/network; generated)
     Active: inactive (dead)
       Docs: man:systemd-sysv-generator(8)
# unmasked as in https://bugs.mageia.org/show_bug.cgi?id=32373#c5


Will reboot at a convenient time to see if the network comes up on its own.
Surprised to see WiFi working again; but not without lots of manual intervention...

How reproducible: not sure; but having tried to switch to NM months ago, and this near disaster today;  I'm less than impressed...

NM is still at 100% and the big question: why does it take 10+ minutes to boot up???

Steps to Reproduce:
1. apply updates
2. lose WiFi
3. switch to NetworkManager as directed in journal comments
4. reboot
5. undo some NM changes to get networking (partial?)
Comment 1 Pierre Fortin 2023-10-18 23:07:00 CEST
Created attachment 14069 [details]
photo of bootup after NM installed
Comment 2 Pierre Fortin 2023-10-18 23:08:39 CEST
Created attachment 14070 [details]
NM even takes a long time to shutdown...
Comment 3 sturmvogel 2023-10-19 04:23:51 CEST
It seems you did not follow the correct procedure when you still have net_applet services running. See here how to switch properly to NM:
https://wiki.mageia.org/en/Switching_to_networkmanager
sturmvogel 2023-10-19 04:24:14 CEST

Source RPM: networkmanager-qt-5.110.0-1.mga10.x86_64 Tue 19 Sep 2023 02:37:07 PM EDT => (none)

Comment 4 sturmvogel 2023-10-19 04:30:10 CEST
As described in the wiki, long boot times are caused by not properly masking the legacy network startup services. According your output they are not masked at all...
Comment 5 Pierre Fortin 2023-10-19 06:07:55 CEST
(In reply to sturmvogel from comment #3)
> It seems you did not follow the correct procedure when you still have
> net_applet services running. See here how to switch properly to NM:
> https://wiki.mageia.org/en/Switching_to_networkmanager

I followed that EXACT procedure using copy/paste of every step.  When I pasted the last step: (systemctl mask network.service; systemctl mask network-up), the commands did not return the prompt.  
See https://bugs.mageia.org/show_bug.cgi?id=32373#c4 and note the 3 "Created symlink" messages -- maybe they were delayed from a previous command.  

This issue is about 100% CPU which is a bug no matter what.  Was it coincidence that issuing "systemctl mask network.service; systemctl mask network-up" did not return the prompt, or did NetworkManager going 100% CPU cause one of these commands not to return?  Either way, this indicates a flaw in the procedure and/or a bug.

I provided what information I could; Sorry if some of it was after doing what I could to restore networking.  

>As described in the wiki, long boot times are caused by not properly masking
>the legacy network startup services. According your output they are not masked 
>at all...

Already addressed this in https://bugs.mageia.org/show_bug.cgi?id=32373#c5 where making the 2 changes identified therein was the ONLY way I could get mcc to get past "Please wait" to get any network interface up -- I was surprised when it was the WiFi that came up, given the problem I reported in
 https://bugs.mageia.org/show_bug.cgi?id=32373
where I subsequently changed:

2023-10-18 20:23:47 CEST
Summary: no WiFi after reboot => no networking after switch to NetworkManager and reboot

as a result of this issue.

From my perspective, if unmasking got any networking back, the slow boot is secondary.  Quoting the wiki:
     It is also recommended disabling the legacy network startup services by
     running
     # systemctl mask network.service; systemctl mask network-up
     as otherwise this would introduce unnecessary delays during boot. 

OK; but there's something wrong when at the very point of issuing this pair, they didn't return and the NM was now at 100%, which is still the case...
$ top | grep Network
 451003 root      20   0  508104 190592  17472 R 100.0   0.1   7:41.32 NetworkManager                                                                                                                                           
 451003 root      20   0  509104 191552  17472 R 100.0   0.1   7:44.33 NetworkManager
Comment 6 Pierre Fortin 2023-10-21 06:07:44 CEST
Created attachment 14076 [details]
journal

Migrated laptop to NetworkManager (same procedure and it's idle, not 100%)

Found: https://bugs.archlinux.org/task/61688

On main system:
systemctl restart NetworkManager
just goes into S state and only returns when killed. No journal entries occur when the restart command is issued/killed.

https://unix.stackexchange.com/questions/700464/check-current-active-network-manager
$ nmcli connection
Warning: nmcli (1.44.0) and NetworkManager (Unknown) versions don't match. Restarting NetworkManager is advised.
Error: NetworkManager is not running.

$ NetworkManager --version
1.44.0

So I can't issue any nmcli commands.

Atached is what I see in the journal about every 10 minutes.
Marja Van Waes 2023-10-25 22:06:04 CEST

See Also: (none) => https://bugs.mageia.org/show_bug.cgi?id=32373

Comment 7 Lewis Smith 2023-11-03 12:47:03 CET
Having no experience of NetworkManager, nor wishing to try it ... I wonder whether the related bug 32373 is not a variation on this one - at least from its c4.
Both bugs show multiple problems including the 100% CPU usage. The ArchLinux URL given in comment 6 is about that, but dates from 2019 and talks mostly of Curl, also IPv6 and firewalls. It is long & implies various remedies which I do not think apply here.

I am unsure whether Pierre has yet got NM (strictly alone) working on any machine. Ethernet &/or WiFi. Frank does (https://bugs.mageia.org/show_bug.cgi?id=32373#c6) but not without some manipulations apparently complicated by Plasma...

Without seeing the Wiki, common sense says that if you change to NM you should first stop then inhibit our networking as noted, configure NM, and re-boot after the switch. Some comments note trying one alongside the other, which to say the least complicates the issue. Is the Wiki procedure watertight? I see these issues:

* The fact that the command:
 systemctl mask network.service; systemctl mask network-up
did not return. And when interrupted ^Z, shows:
[11]+  Stopped                 systemctl enable --now NetworkManager.service
suggests that NM has already been configured before killing our networking.

* the 100% CPU usage

* The huge & accumulating number of files named ifcfg-veth* in /etc/sysconfig/network-scripts, created at _random_ times; though on average about 4 per minute most of the time; some repeated within one second -- no repeating time pattern (https://bugs.mageia.org/show_bug.cgi?id=32373#c7).

CC: (none) => lewyssmith

Comment 8 Frank Griffin 2023-11-03 13:17:49 CET
You do have to systemctl enable/start NetworkManager since we install it disabled.

When using NM I don't mess with systemctl mask, I just use mcc/drakconnect to remove the ifcfg interfaces that the install created and then remove the ifcfg support from /etc/NetworkManager/NetworkManager.conf.  Finally, execute "nmtui" to activate the interfaces you want to use.  NM will remember them thereafter.

The Plasma bit is that you have to install plasma-nm-applet because we don't do that by default.  NM support is enabled automatically for GNOME, but not for Plasma.

CC: (none) => ftg

Comment 9 Lewis Smith 2023-11-06 22:14:55 CET
(In reply to Lewis Smith from comment #7)
> I see these main issues:
> 
> * the 100% CPU usage
> 
> * The huge & accumulating number of files named ifcfg-veth* in
> /etc/sysconfig/network-scripts, created at _random_ times; though on average
> about 4 per minute most of the time; some repeated within one second -- no
> repeating time pattern (https://bugs.mageia.org/show_bug.cgi?id=32373#c7).
Is this the ongoing situation?
Does any other NetworkManager user here see these things, or just Pierre (and are you still seeing them?).
Comment 10 Pierre Fortin 2023-11-07 16:00:13 CET
(In reply to Frank Griffin from comment #8)
> You do have to systemctl enable/start NetworkManager since we install it
> disabled.

Enabling it is covered in https://wiki.mageia.org/en/Switching_to_networkmanager

> When using NM I don't mess with systemctl mask, I just use mcc/drakconnect
> to remove the ifcfg interfaces that the install created and then remove the
> ifcfg support from /etc/NetworkManager/NetworkManager.conf.  Finally,
> execute "nmtui" to activate the interfaces you want to use.  NM will
> remember them thereafter.

Sounds like https://wiki.mageia.org/en/Switching_to_networkmanager needs some updating as it contains:
systemctl mask network.service; systemctl mask network-up

nmtui: first time hearing about this; maybe add it to https://wiki.mageia.org/en/Switching_to_networkmanager ...  Also, in nmtui, what does "Please select an option: Radio" even mean?  "Radio" needs an action; guessing: "enable/disable Radio"...?

> The Plasma bit is that you have to install plasma-nm-applet because we don't
> do that by default.  NM support is enabled automatically for GNOME, but not
> for Plasma.

That should also be in: https://wiki.mageia.org/en/Switching_to_networkmanager
but plasma-nm-applet is not available:

$ urpmi plasma-nm-applet
No package named plasma-nm-applet

Did you mean?:
urpmi plasma-applet-nm
To satisfy dependencies, the following packages are going to be installed:
  Package                        Version      Release       Arch    
(medium "Core Release")
  plasma-applet-nm               5.27.9       1.mga10       x86_64  
  plasma-applet-nm-libreswan     5.27.9       1.mga10       x86_64  
  plasma-applet-nm-openvpn       5.27.9       1.mga10       x86_64  
4.1KB of additional disk space will be used.
1.2MB of packages will be retrieved.

Installed it. 

$ findcmd applet
/usr/bin:
    mate-panel-test-applets
    mgaapplet
    mgaapplet-config
    mgaapplet-update-checker
    mgaapplet-upgrade-helper
    net_applet
    nm-applet

$ nm-applet --help
Usage: nm-applet

This program is a component of NetworkManager (https://networkmanager.dev).
It is not intended for command-line interaction but instead runs in the GNOME desktop environment.

So, plasma-applet-nm installed the Gnome stuff?

Not seeing any network applet in systray; is there a command to get a systray applet like I used to see before NetworkManager?
Comment 11 Pierre Fortin 2023-11-07 16:21:59 CET
(In reply to Lewis Smith from comment #9)
> (In reply to Lewis Smith from comment #7)
> > I see these main issues:
> > 
> > * the 100% CPU usage
> > 
> > * The huge & accumulating number of files named ifcfg-veth* in
> > /etc/sysconfig/network-scripts, created at _random_ times; though on average
> > about 4 per minute most of the time; some repeated within one second -- no
> > repeating time pattern (https://bugs.mageia.org/show_bug.cgi?id=32373#c7).
> Is this the ongoing situation?

No; no idea where those came from. They only appeared between these reboots:
Fri Sep 29 11:35:01 PM EDT 2023
Wed Oct 11 02:34:24 AM EDT 2023
See https://bugs.mageia.org/show_bug.cgi?id=32373#c0

I keep track of all installed RPMs via this in my root crontab:
@reboot rpm -qa | sort > /home/ROOT/RPM.history/RPMS.`/bin/date +%Y%m%d`
Hmmm...  I apparently had @daily until:
-rw-r--r-- 1 root root 176811 May 24 00:00 RPMS.20230524
-rw-r--r-- 1 root root 176811 May 25 00:00 RPMS.20230525
-rw-r--r-- 1 root root 176811 May 26 00:00 RPMS.20230526
but when I changed it to @reboot -- the command has not worked since; though running it manually works:
-rw-r--r-- 1 root root 196093 Nov  7 09:19 RPMS.20231107


> Does any other NetworkManager user here see these things, or just Pierre
> (and are you still seeing them?).

Maybe ask on the 'discuss' mailing list...
Comment 12 Dave Hodgins 2023-11-07 17:48:30 CET
"man veth" - Virtual Ethernet Device has very little info on the uses.

Are you using docker, golang, or fop-javadoc? I suspect the devices are
being created by badly configured containers of one sort or another, and then
being detected by network manager which automatically adds a configuration
file for any un-configured network device.

CC: (none) => davidwhodgins

Comment 13 Pierre Fortin 2023-11-07 20:45:30 CET
I tried with a docker-compose image briefly. My journal for that period is expunged; but from 'history', I started docker-compose Sep 24 and installed jitsi, days before the Sep 29 reboot; and left it running until reboot on Oct 11.  Haven't done anything with docker since.
Comment 14 Lewis Smith 2023-11-07 22:19:39 CET
(In reply to Pierre Fortin from comment #11)
> > > * The huge & accumulating number of files named ifcfg-veth* in
> > > /etc/sysconfig/network-scripts, created at _random_ times; though on average
> > > about 4 per minute most of the time; some repeated within one second -- no
> > > repeating time pattern (https://bugs.mageia.org/show_bug.cgi?id=32373#c7).
> > Is this the ongoing situation?
> No; no idea where those came from. They only appeared between these reboots:
> Fri Sep 29 11:35:01 PM EDT 2023
> Wed Oct 11 02:34:24 AM EDT 2023

(In reply to Pierre Fortin from comment #13)
> I tried with a docker-compose image briefly. My journal for that period is
> expunged; but from 'history', I started docker-compose Sep 24 and installed
> jitsi, days before the Sep 29 reboot; and left it running until reboot on
> Oct 11.  Haven't done anything with docker since.

(In reply to Dave Hodgins from comment #12)
> "man veth" - Virtual Ethernet Device has very little info on the uses.
> Are you using docker, golang, or fop-javadoc? I suspect the devices are
> being created by badly configured containers of one sort or another, and then
> being detected by network manager which automatically adds a configuration
> file for any un-configured network device.
This corresponds to Dave's suggestion.
So at least they have gone.

Accepting the need to refine the NM Wiki, does that leave just the 100% CPU utilisation?
Ah - but what about the very long startup time? Gone or ongoing?

(In reply to Pierre Fortin from comment #10)
>  urpmi plasma-nm-applet
> No package named plasma-nm-applet
> Did you mean?:
>  urpmi plasma-applet-nm
A fair cop!
It seems that all these Plasma NM pkgs exist:
plasma-applet-nm
plasma-applet-nm-fortisslvpnui
plasma-applet-nm-l2tp
plasma-applet-nm-libreswan
plasma-applet-nm-openconnect
plasma-applet-nm-openvpn
plasma-applet-nm-pptp
plasma-applet-nm-ssh
plasma-applet-nm-strongswan
plasma-applet-nm-vpnc
Comment 15 Pierre Fortin 2024-02-03 05:49:27 CET
(In reply to Dave Hodgins from comment #12)
> "man veth" - Virtual Ethernet Device has very little info on the uses.
> 
> Are you using docker, golang, or fop-javadoc? I suspect the devices are
> being created by badly configured containers of one sort or another, and then
> being detected by network manager which automatically adds a configuration
> file for any un-configured network device.

Docker removed months ago.  Now, I found another crazy issue:  while I'm not seeing new veth* interfaces being created in /etc/sysconfig/network-scripts, iptables has bee growing with (example):
$ iptables -L -n | grep vetha5188d3
vetha5188d3_in  0    --  0.0.0.0/0            0.0.0.0/0           
vetha5188d3_fwd  0    --  0.0.0.0/0            0.0.0.0/0           
vetha5188d3_out  0    --  0.0.0.0/0            0.0.0.0/0           
Chain vetha5188d3_fwd (1 references)
Chain vetha5188d3_in (1 references)
Chain vetha5188d3_out (1 references)

One wouldn't be a big deal; but:
$ iptables -L -n | grep Chain | grep veth | grep _fwd | wc -l
5066
^^^^!!
$ grep -Rls vetha5188d3 /etc
/etc/shorewall/interfaces
$ ll /etc/shorewall/interfaces
-rw------- 1 root root 117027 Sep 29 12:51 /etc/shorewall/interfaces
$ grep veth /etc/shorewall/interfaces | wc -l
5066
$ grep -v veth /etc/shorewall/interfaces 
net     p5p1    detect                   # ethernet - not connected
net     br-b8ea9ef8ed7d detect  bridge
net     enp5s0  detect
net     docker0 detect  bridge
net     wlp10s0 detect                   # WiFi
net     br-935570e85ea1 detect  bridge
net     enp9s0  detect
net     vboxnet0        detect

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: p5p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 74:86:e2:14:83:3d brd ff:ff:ff:ff:ff:ff
    altname enp9s0
3: wlp10s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 44:e5:17:fd:11:87 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.46/24 brd 192.168.1.255 scope global noprefixroute wlp10s0
       valid_lft forever preferred_lft forever
    inet6 fe80::46e5:17ff:fefd:1187/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

So why does iptables contain all these rules with no actual interfaces?
# from: https://stackoverflow.com/questions/31989426/how-to-identify-orphaned-veth-interfaces-and-how-to-delete-them
$ for name in $(ifconfig -a | sed 's/[ \t].*//;/^\(lo\|\)$/d' | grep veth)
do
    echo $name
    # ip link delete $name # uncomment this 
done
#  nothing...
but...
$ wc -l /var/lib/shorewall/.iptables-restore-input 
76226 /var/lib/shorewall/.iptables-restore-input

Deleted all the veth interaces from /etc/shorewall/interfaces and rebooting...
Comment 16 Pierre Fortin 2024-02-03 06:20:39 CET
iptables -L -n  is now clean...
Comment 17 Dave Hodgins 2024-02-03 06:35:34 CET
If it doesn't stay clean, try uninstalling mandi-ifw and mandi.

Note You need to log in before you can comment on or make changes to this bug.