26154 – Internet connection doesn't work, but it works after sleep return

Bug 26154 - Internet connection doesn't work, but it works after sleep return

Summary: Internet connection doesn't work, but it works after sleep return

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	Installer (show other bugs)
Version:	7
Hardware:	x86_64 Linux

Priority:	Normal Severity: minor
Target Milestone:	---
Assignee:	Mageia Bug Squad
QA Contact:

URL:
Whiteboard:
Keywords:	IN_ERRATA7

Depends on:
Blocks:

Reported:	2020-01-30 21:06 CET by Etienne Etienne
Modified:	2020-02-04 00:23 CET (History)
CC List:	3 users (show)

See Also:
Source RPM:
CVE:
Status comment:

Attachments
command # dmesg (60.59 KB, text/plain) 2020-02-03 18:37 CET, Etienne Etienne	Details
command # journalctl -ab (202.30 KB, text/plain) 2020-02-03 18:38 CET, Etienne Etienne	Details
View All Add an attachment (proposed patch, testcase, etc.)

Description Etienne Etienne 2020-01-30 21:06:34 CET

Description of problem:

My network card is a Nvidia corporation MCP61 ethernet.
This computer was before with a windows OS.
Internet connexion didn't work with live-dvd Mageai7, even live-dvd ubuntu 18.04.
It didn't work after a full install with dvd mageia7.

But when I put the computer to sleep, connection is working when it come back from sleep.

I fixed the bug like this :
I created a file /etc/modprobe.d/forcedeth.conf with inside :
options forcedeth msi=0 msix=0

Or it may be those codes who fixed it :
[root@localhost tieno]# ifconfig enp0s7 down
[root@localhost tieno]# modprobe -r forcedeth
[root@localhost tieno]# modprobe forcedeth msi=0 msix=0
[root@localhost tieno]# dhclient enp0s7
RTNETLINK answers: File exists

For me now it's OK but the MLO forum (french) advise me to create a bug report for fixing this bug definitively.

Thank you !

Here are some informations about my hardware and release :
[tieno@localhost ~]$ lspcidrake -v |grep -i net
forcedeth       : NVIDIA Corporation|MCP61 Ethernet [BRIDGE_OTHER] (vendor:10de device:03ef subv:1849 subd:03ef) (rev: a2)
[tieno@localhost ~]$ uname -a
Linux localhost.localdomain 5.4.12-desktop-1.mga7 #1 SMP Tue Jan 14 21:14:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


And here is the journal when the bug was not fixed, when I started computer, and put it on sleep, and returned from sleep :

journalctl -x -b 0 >journal.txt
chown 1000:1000 journalctl.txt

(sorry the text report is too long for here, I have an error message when I want to post it)

Comment 1 Lewis Smith 2020-01-31 11:33:53 CET

Thank you for reporting this. The solutions you have found are very expert, and point to the 'forcedeth' driver.
It is surprising that the journal was so large. Do the sleep-wakeup manipulation quickly after booting to minimise it.
For information to attach, whenever it is large, compress the file first. We recommend 'xz' :
 $ xz <filename>        [creates filename.xz]
---
Please post the trimmed output (just the section for the Ethernet controller) from :
 $ lspci -v

Then please, from the UNcorrected system with the fault :
1.
 After booting, before sleep, save to post just the [dead] ethernet section from:
 # ifconfig
2.
 Do the sleep-wakeup manipulation [which you say kick-starts the connection].
3.
 Save to post just the [live] ethernet section from:
 # ifconfig
4.
 $ dmesg > dmesg.txt
to attach compressed to this bug.
5.
 # journalctl -ab > journal.txt          [as root gives all messages]
to attach compressed to this bug.

CC: (none) => lewyssmith

Comment 2 Etienne Etienne 2020-02-03 18:36:48 CET

Oh I'm sorry I didn't see how attach a file to my post and I tryed to copy all the text inside the post.

Anyway, here are the commands you ask to me :

[tieno@localhost ~]$ lspci -v
[...]
00:07.0 Bridge: NVIDIA Corporation MCP61 Ethernet (rev a2)
        Subsystem: ASRock Incorporation 939NF6G-VSTA Board
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 28, NUMA node 0
        Memory at edffd000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at d080 [size=8]
        Capabilities: <access denied>
        Kernel driver in use: forcedeth
        Kernel modules: forcedeth

(Idem when the bug is on or off)

Then, from uncorrect system :

[root@localhost tieno]# ifconfig
enp0s7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether d0:50:99:82:1a:03  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13  bytes 3487 (3.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

After sleep-wakeup :

[root@localhost tieno]# ifconfig
enp0s7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.11  netmask 255.255.255.0  broadcast 192.168.0.255
        ether d0:50:99:82:1a:03  txqueuelen 1000  (Ethernet)
        RX packets 1  bytes 590 (590.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 25  bytes 5493 (5.3 KiB)
        TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0

Then :

# dmesg > dmesg.txt
[see the attachment]

# journalctl -ab > journal.txt
[see the attachment]

Good Luck !

Comment 3 Etienne Etienne 2020-02-03 18:37:56 CET

Created attachment 11493 [details]
command # dmesg

Comment 4 Etienne Etienne 2020-02-03 18:38:56 CET

Created attachment 11494 [details]
command # journalctl -ab

Comment 5 Dave Hodgins 2020-02-03 19:24:19 CET

The critical line in the output appears to be
forcedeth 0000:00:07.0 enp0s7: Got tx_timeout. irq status: 00000032

As per https://forums.gentoo.org/viewtopic-t-860574-start-0.html
try adding pci=nomsi to the kernel command line parameters.

The easiest way to do that is using mcc/boot/Set up boot system, on the
second screen presented in that function.

If that works, an entry describing the problem/fix should be added to the
Mageia 7 errata.

CC: (none) => davidwhodgins

Comment 6 Etienne Etienne 2020-02-03 20:27:30 CET

Ok, it works like this.

I open the CCM, choosed section boot/set up boot system, went on the second screen (just clic "next" on the first) and just add at the end of the kernel command line parameters pci=nomsi

Restarted my computer and the internet connection is working.

Thank you !

Comment 7 Dave Hodgins 2020-02-03 20:50:43 CET

Added to errata https://wiki.mageia.org/en/Mageia_7_Errata#Nvidia_corporation_MCP61_ethernet_fails_to_connect

Closing the bug report. Please reopen if the problem does show up again.

Resolution: (none) => FIXED
Keywords: (none) => IN_ERRATA7
Status: NEW => RESOLVED

Comment 8 Frank Griffin 2020-02-03 21:26:08 CET

Looking at the journalctl, it appears that the problem is with your DHCP server during initialization.  If you search for "link beat", you'll see that link beat is detected during initialization (so there isn't a problem with the NIC or the driver), but then it issues a DHCPREQUEST for 192.168.0.11 (because this is the IP it had last) and gets no response.  Then it tries DHCPDISCOVER, which means "I'll take any IP", and gets no response.  Then it seems to give up.

Later, when you sleep and wake up, it dropped link beat and immediately finds it again and goes through the DHCP process above, except now the DHCPREQUEST works on the first try.

So it appears that during initialization, your DHCP server isn't available, but becomes available at some later time, and is available when you do the sleep/wake activity.

So there are two issues here.  

One is why the DHCP server isn't there during initialization.  Why it isn't at exactly the time you boot the machine is a mystery, unless you're running it on this system and it just hasn't started yet.  You know where your DHCP server is, and presumably why it's not available during boot, but *is* available and working properly later.

The other is why net-applet is giving up and not continuing to retry DHCP until it manages to connect.  Possibly there is something in your dhclient conf or even in the ifcfg file that shuts off retry after one or two tries.

Comment 9 Dave Hodgins 2020-02-03 21:37:52 CET

As I see it, the dhcp lookup fails because the ethernet device is timing out
trying to send the lookup packet, which is fixed by using nomsi.

Comment 10 Lewis Smith 2020-02-03 21:55:38 CET

Well solved!
Thank you Etienne for all the evidence you provided, and the work for that.
And Dave & Frank for your inputs.

That Gentoo thread (comment 5) is 9y old; and looks a different problem.
The suggestion "try booting with "pci=nomsi" appended to the kernel commandline" was not verified there; but is here in comment 5.

The last word was "forcedeth did not like my switch being hard coded to 100M full duplex even though the NIC is still hard coded. On a whim I decided to auto-negotiate the port speed and it hasn't happened since"

I was puzzled by the differences before/after sleep/wakeup from ifconfig, which agrees with Frank's (and my) supposition about the Internet 'box' synchronisation not happening initially; from comment 2:
BEFORE
enp0s7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500    [RUNNING?]
        ether d0:50:99:82:1a:03  txqueuelen 1000  (Ethernet)
AFTER
enp0s7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.11  netmask 255.255.255.0  broadcast 192.168.0.255
        ether d0:50:99:82:1a:03  txqueuelen 1000  (Ethernet)

Comment 11 Frank Griffin 2020-02-03 22:14:54 CET

(In reply to Dave Hodgins from comment #9)
> As I see it, the dhcp lookup fails because the ethernet device is timing out
> trying to send the lookup packet, which is fixed by using nomsi.

Dave, I don't dispute that this fixed it, but I don't understand why.  MSI, IIRC, is just a replacement for hardware IRQs.  Why has this anything to do with whether the lookup packet gets to the DHCP server ?  Or does MSI not initialize in time to service dhclient ?

CC: (none) => ftg

Comment 12 Dave Hodgins 2020-02-03 22:33:25 CET

In the dmesg log the line
[ 31.994626] NETDEV WATCHDOG: enp0s7 (forcedeth): transmit queue 0 timed out
is the first indication of a problem. The corresponding line in the journal is
févr. 03 18:08:00 localhost.localdomain kernel: NETDEV WATCHDOG: enp0s7 (forcedeth): transmit queue 0 timed out

The dhcp lookup has been sent from the dhcp client to the kernel, but the
kernel module is failing to transmit that over the pci bus to the ethernet
device. The ethernet device doesn't get the lookup to try and send over the
network, so the dhcp server never sees it.

When the system is recovering from sleep, it doesn't time out sending the
packet from the kernel to the ethernet device, so the lookup is then sent
to the network, and works. Why it works when recovering from sleep and not
during normal boot, is not clear to me, but likely due to the order devices
are powered up, or recovering from sleep needing fewer amps than booting.

Using the kernel option pci=nomsi disables the use of MSI interrupts
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts
That forces the kernel to fall back to using the older (slightly slower) APIC
method of handling interrupts. The APIC method is more reliable for some
pci devices, as appears to be the case here.
https://www.tldp.org/HOWTO/Plug-and-Play-HOWTO-7.html

In the transmission chain, when working the steps involved are
1. dhcp client program sends lookup request to kernel
2. kernel calls the appropriate module to handle the packet
3. forcedeth module sends the packet to the ethernet device over the pci bus
4. pci ethernet device sends the packet over the network to the server

During boot, step 3 is timing out, not step 4, but works when recovering from
sleep. Using APIC interrupt handling instead of MSI allows step 3 to work
during boot.

Comment 13 Dave Hodgins 2020-02-03 22:46:35 CET

It's the timeout message coming from the network device watchdog, not from
the dhcp client that indicates it's the pci bus that's timing out, not the
ip network.

Comment 14 Frank Griffin 2020-02-03 23:55:54 CET

So, it sounds like there is a bug here.  I get that the DHCP packet never gets sent, so the fault is not in the network.  But is the fault in the forcedeth driver timing out too soon, the NETDEV watchdog timing out too soon, or the MSI support not initializing soon enough to service what the driver is asking of it ?

Regardless of whether the OP's problem is solved by the nomsi workaround, it seems like somebody's dropping the ball here and we should find out why.

Comment 15 Dave Hodgins 2020-02-04 00:23:23 CET

Given that it works when recovering from sleep, but not on boot, I think
it's more likely to be a problem with the hardware or the firmware on the
ethernet card rather then the kernel module. Confirming that would require
resources I don't have.

Note You need to log in before you can comment on or make changes to this bug.