Description of problem: Got lucky for a while watching for bug 30405 to trigger. Decided to remove some old kernels (desktop[+vbox] and server). Removed 5.16.11 ok. Removed 5.16.{12,13} ok. Tried to remove 3 sets (5.16.{14,17,18}), and mcc quit responding after a while leaving: $ rpm -qa | grep kernel | grep 5.16 kernel-desktop-5.16.14-1.mga9 kernel-server-5.16.14-1.mga9 kernel-server-5.16.17-1.mga9 kernel-desktop-5.16.17-1.mga9 kernel-desktop-5.16.17-2.mga9 kernel-server-5.16.17-2.mga9 kernel-server-5.16.18-1.mga9 kernel-desktop-5.16.18-1.mga9 Trying to kill mcc failed. Killed the main mcc and package removal progress windows with xkill. mcc splash stuck on screen, no way to get rid of it. Decided to reboot via App Menu->Power/Session-.Restart -- now the menu is stuck on screen. $ ps aux | grep -e mcc -e drak root 953592 0.0 0.1 356240 86784 pts/5 D 23:33 0:00 /usr/bin/perl /usr/libexec/drakconf root 969807 0.0 0.1 356240 86804 pts/5 D 23:37 0:00 /usr/bin/perl /usr/libexec/drakconf root 978895 0.0 0.1 356236 86456 pts/5 D 23:38 0:00 /usr/bin/perl /usr/libexec/drakconf root 1074571 0.1 0.0 74324 45700 pts/5 T 23:56 0:00 /usr/bin/perl /usr/bin/mcc root 1074572 0.2 0.1 356216 86652 pts/5 D 23:56 0:00 /usr/bin/perl /usr/libexec/drakconf r Version-Release number of selected component (if applicable): How reproducible: Not sure. Removing one or two sets of kernels at a time works; but three must have been too much... Will issue reboot after submitting this info. Steps to Reproduce: 1. probably remove old kernels 2. 3.
Rebooted into $ uname -a Linux pf.pfortin.com 5.17.6-server-1.mga9 #1 SMP PREEMPT Mon May 9 18:34:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Removing 5.16.14 quick and ok. Removing 5.16.17.1 quick and ok. Removing 5.16.17.2 is hung. $ ps aux | grep -e mcc -e drak root 56923 0.0 0.0 74312 45900 pts/4 S 00:15 0:00 /usr/bin/perl /usr/bin/mcc root 56937 0.0 0.1 103715496 106920 pts/4 Sl 00:15 0:00 /usr/bin/perl /usr/libexec/drakconf root 57029 3.6 0.8 1039644 566544 pts/4 Sl 00:15 0:34 /usr/bin/perl /usr/libexec/drakrpm --embedded 92274873 Looks like I'll have to reboot again... Waiting to see if bug 30405 will kick in...
$ rpm -qa | grep kernel | grep 5.16 kernel-desktop-5.16.17-2.mga9 kernel-server-5.16.18-1.mga9 kernel-desktop-5.16.18-1.mga9 Removed server; but apparently hung on removing desktop.
Surprise... finally got this in the journalctl window: May 10 00:38:42 pf.pfortin.com [RPM][57029]: erase kernel-desktop-5.16.17-2.mga9.x86_64: success May 10 00:38:42 pf.pfortin.com [RPM][57029]: Transaction ID 6279e79e finished: 0 Confirmed: $ rpm -qa | grep kernel | grep 5.16 kernel-server-5.16.18-1.mga9 kernel-desktop-5.16.18-1.mga9 First 2 removes completed in about a minute. This one, nearly 10 minutes. :?
Is there a difference to your bug 29422 is it a dup?
I'd be checking the sata cables for a poor connection. Not actually failing, just causing massive i/o delays.
CC: (none) => davidwhodgins
29422 was on Dell M6800 running mga8. This one is on Dell XPS 8950 running Cauldron. Current situation is that, at first, it appeared de-selecting multiple kernels was a problem. So I went to de-selecting only one version at a time (kernel-desktop which auto-de-selected virtualbox; and kernel-server); that worked for a couple, then the 3rd took a very long time. It's interesting that 5.16.14 and 5.16.17.1 were quick and 5.16.17.2 was really slow. Had me wondering if .1 removed components that .2 couln't find to remove... just a gut feeling... Oh... 5.16.17.1 and .2 were in the previous multi-de-select that never finished previously. Is each kernel completely self-contained, or could some (.1 and .2) possibly be sharing files? Interesting corner case if so... $ inxi -CMSGxx System: Host: pf.pfortin.com Kernel: 5.17.6-server-1.mga9 x86_64 bits: 64 compiler: gcc v: 12.1.1 Desktop: MWM wm: kwin_x11 dm: LightDM, LXDM, SDDM Distro: Mageia 9 mga9 Machine: Type: Desktop System: Dell product: XPS 8950 v: N/A serial: 14FDLM3 Chassis: type: 3 serial: 14FDLM3 Mobo: Dell model: 0R6PCT v: A01 serial: .14FDLM3.CNFCW0021L00P3. UEFI: Dell v: 1.0.8 date: 12/22/2021 CPU: Info: 10-Core model: 12th Gen Intel Core i7-12700K bits: 64 type: MT MCP arch: Alder Lake rev: 2 cache: L1: 1024 KiB L2: 25 MiB L3: 25 MiB flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 144383 Speed: 2447 MHz min/max: 800/5000 MHz Core speeds (MHz): 1: 2447 2: 1142 3: 1369 4: 801 5: 807 6: 781 7: 812 8: 817 9: 852 10: 800 11: 800 12: 800 13: 803 14: 898 15: 891 16: 800 17: 800 18: 800 19: 1450 20: 850 Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Dell driver: amdgpu v: kernel bus-ID: 0000:03:00.0 chip-ID: 1002:73ff Device-2: YGTek Webcam type: USB driver: snd-usb-audio,uvcvideo bus-ID: 1-5.4.3:6 chip-ID: 1d6c:1278 Display: server: Mageia X.org 1.21.1.3 compositor: kwin_x11 driver: loaded: amdgpu,ati unloaded: fbdev,modesetting,radeon,vesa resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz s-dpi: 96 OpenGL: renderer: AMD Radeon RX 6600 XT (dimgrey_cavefish LLVM 14.0.0 DRM 3.44 5.17.6-server-1.mga9) v: 4.6 Mesa 22.0.3 direct render: Yes
(In reply to Dave Hodgins from comment #5) > I'd be checking the sata cables for a poor connection. Not actually failing, > just causing massive i/o delays. The EFI, /, swap, /home are on a PCIe M.2 1TB stick factory mounted to the MB (no cables); never removed or even touched. $ fdisk -l /dev/nvme0n1 Disk /dev/nvme0n1: 953.87 GiB, 1024209543168 bytes, 2000409264 sectors Disk model: PC SN810 NVMe WDC 1024GB Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 63B074DD-9E88-4F24-B13E-DCF7AEBAA67A Device Start End Sectors Size Type /dev/nvme0n1p1 2048 614433 612386 299M EFI System /dev/nvme0n1p2 616448 106086433 105469986 50.3G Linux filesystem /dev/nvme0n1p3 106088448 114475041 8386594 4G Linux swap /dev/nvme0n1p4 114477056 2000409230 1885932175 899.3G Linux filesystem $ hdparm -tT /dev/nvme0n1p3 /dev/nvme0n1p3: Timing cached reads: 38492 MB in 2.00 seconds = 19271.12 MB/sec Timing buffered disk reads: 4094 MB in 1.69 seconds = 2417.00 MB/sec Fastest 'disk' I have; over 90x faster than fastest platter... System has above 1TB SSD & 2TB platter internally and 4 6TB USB 3.2 Gen 2 (5GBps) external drives. All external drives contain no code; data only -- never accessed while using mcc.
Shortly after boot, run "time fstrim -av". While fstrim is running, any other process trying to access the disk will remain in a device wait state. How long does it take?
I just checked. fstrim.timer is disabled by default. Systems using ssd drives need to either enable the timer or add the discard option to each fstab entry for file systems on an ssd drive. If neither option is used, the ssd drive will perform at it's maximum rate until it runs out of unused pages. Then every physical write will have to wait for an unused page to be found, and and discarded, making it available again. It slows down writes to be similar to a floppy disc. Using the fstab entries slows down every write (slightly), so it's preferable in my opinion to enable the fstrim.timer, which will cause longer waits, but only once a week. Assigning to the kernel team to decide whether or not fstrim.timer should be enabled by default (in my opinion, it should be).
Assignee: bugsquad => kernel
@Dave: Our wiki about SSD has already this hint: fstrim.timer Since several versions systemd manages the trim command via a timer. In Mageia the timer is set to run the command once a week for all SSD, but is not set to enabled by default. Activating the timer. # systemctl enable fstrim.timer Start timer. # systemctl start fstrim.timer Make sure the timer is active: # systemctl status fstrim.timer The term "active" appears after this command. Finally, these stages can be reduced to a single command to activate and launch the timer: # systemctl enable --now fstrim.timer
In my opinion, the wiki entry is not enough. People who are new to using ssd drives will not think of searching for it, especially since the system will perform extremely well until it runs out of unused pages, with no obvious reason as to why it's suddenly like using floppies instead of ssd drives.
People who prefer to use the discard option can still disable the timer.
The discard option is no longer recommended. Intel as example strongly recommends to not use it as SSD controler functions are way more advanced nowadays. https://www.intel.com/content/dam/support/us/en/documents/ssdc/data-center-ssds/Intel_Linux_NVMe_Guide_330602-002.pdf
We need more uptodate informations. Some years ago emerged that without (or with) trim an SSD disk on linux would exaust very quickly (but not on Windows and OSX). IIRC the news was this https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid/; since then mostly enabled trim in fstrim. Then emerged that trim was not necessary, and fstrim even detrimental for the life of a SSD. That's why we need more up-to-date informations. It should be noticed also that fstrim doesn't work if the disk is under some device mapper device (e.g. luks, lvm, etc.) and ends with error that discard option is not supported.
CC: (none) => ghibomgx
It works under lvm if lvm.conf has issue_discards = 1 and under luks if --allow-discards is used.
As to the recommendation from Intel not to use discard, that may be true for the newest intel controllers, but from what I've seen, does not apply to most currently in use systems, such as my amd systems. I've been using a 256GB OCZ-AGILITY4 as my main drive since 2013. smartctl shows Lifetime_Writes 85470642116. I've been running fstrim -a in cron.daily since shortly after I started using it. The Media_Wearout_Indicator shows 92, so it's getting close to time to retire it. I've since added 3 additional sata ssd drives that I use for backup of my primary install, and new release testing. I've yet to have an ssd drive fail, though I expect when it does it will not be usable at all, so I back it up frequently. On my laptop, I have two pcie nvme ssd drives. Pierre, does running fstrim -a once appear to resolve the lockups for now?
(In reply to Pierre Fortin from comment #0) > Description of problem: Got lucky for a while watching for bug 30405 to > trigger. Decided to remove some old kernels (desktop[+vbox] and server). Does this "hang" only happend when your system is overloaded like in bug 30405? running out of cpu and memory can cause processes to get "stalled" for a long time if the other processes are getting (or demanding) higher priority...
(In reply to Dave Hodgins from comment #9) > Assigning to the kernel team to decide whether or not fstrim.timer should be > enabled by default (in my opinion, it should be). Nope. most modern ssds (since several years back) do their garbage collection quite nicely on their own... (with pretty much no notable slowdowns for normal users) Then there is the fact that several ssds actually can get in trouble with os/userpace trim, wich has lead to the kernel having to collect list of known borked ssds: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c#n3938 Technically the best "designed use" of ssds are to ensure enough free space so the filesystem "wear levelling" spreads out the write... and the ssd firmware also does its own wear levelling... depending on how much "reserved space" the manufacturer has designed the disc with... as for using the "discard" mount option, that also has a side effect of slowing down some i/o paths... and not all ssds play nicely with that either... :/ so there is no nice "one size fits all" solution...
Does that infos is related only to internal SSD controller, or also in some way is involved also the onboard NVMe or SATA controller? In other words according to this, is endurance of an SSDs disk exactly the same when connected to the SATA or NVMe port, as well as, when moved to an external USB port (supposing with UASP) inside an enclosure (either for NVMe or SATA SSDs)?
This machine is two months old, and so fast, I never noticed 7 procs pegged until I saw them in htop. I used to run gkrellm on my laptop; 8 CPUs nearly filled the vertical space -- but 20 CPU displays is too much for the available vertical real estate -- I miss gkrellm on the side... Interestingly, when 7 CPUs were pegged, memory didn't call on swap... You mentioned smartctl, so here's the output. My first time seeing this, so posting it all for now. The power cycles is a surprise... unless the factory does a number of cycles. I wouldn't have guessed beyond a couple dozen... You say your drive is at 92; which value would that be here...? 1759 hours is 73 days, yet the system was delivered 64 days ago... $ smartctl -a /dev/nvme0n1 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.6-server-1.mga9] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: PC SN810 NVMe WDC 1024GB Serial Number: 220144803278 Firmware Version: 61510012 PCI Vendor/Subsystem ID: 0x15b7 IEEE OUI Identifier: 0x001b44 Total NVM Capacity: 1,024,209,543,168 [1.02 TB] Unallocated NVM Capacity: 0 Controller ID: 8224 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 001b44 8b45be5075 Local Time is: Tue May 10 16:07:30 2022 EDT Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Maximum Data Transfer Size: 128 Pages Warning Comp. Temp. Threshold: 83 Celsius Critical Comp. Temp. Threshold: 88 Celsius Namespace 1 Features (0x02): NA_Fields Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 8.25W 8.25W - 0 0 0 0 0 0 1 + 3.50W 3.50W - 0 0 0 0 0 0 2 + 2.60W 2.60W - 0 0 0 0 0 0 3 - 0.0250W - - 3 3 3 3 5000 10000 4 - 0.0035W - - 4 4 4 4 3900 45700 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 38 Celsius Available Spare: 100% Available Spare Threshold: 50% Percentage Used: 0% Data Units Read: 599,473 [306 GB] Data Units Written: 4,333,080 [2.21 TB] Host Read Commands: 6,750,226 Host Write Commands: 53,903,044 Controller Busy Time: 39 Power Cycles: 55 Power On Hours: 1,759 Unsafe Shutdowns: 42 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 256 entries) No Errors Logged This issue was seen while I was waiting for the system to lockup (bug 30205); I know, highly risky -- gotta have faith... :) Right now, it feels like it's back to normal... I often get gut feelings when things are abnormal; this is going to sound strange;but... While watching more closely, spotted an IPv6 packet from my phone dropped in journal. Turned off phone's WiFi. Then, I decided to drop BT (it's back up now). KDEconnect was connected to the old laptop; disconnected it too. This morning when I took the picture of the sysstat processes, I was having serious problems with my phone being really slow, wouldn't even save the first pictures of htop until the system locked up... I'll have to watch for interactions... At least, for now, looking normal again...
(In reply to Thomas Backlund from comment #17) > Does this "hang" only happend when your system is overloaded like in bug > 30405? No. Since I got it 2 months ago, I was keeping it very busy processing several TB of data. Lately, was waiting on new drives (now have 4 external 6TB platters so I can get back to processing the data. so in comparison, system was mostly idling when this started. The lockups were most unexpected -- probably my worst in 20 years... The SSD only handles the OS and minimal /home data -- most of /home data is still on the laptop. The hardest hit drive has been the internal 2TB platter; with the new 6TB platters, most work will be drive to different drive, not much within a single drive.
(In reply to Pierre Fortin from comment #21) > (In reply to Thomas Backlund from comment #17) > > > Does this "hang" only happend when your system is overloaded like in bug > > 30405? > > No. Since I got it 2 months ago, I was keeping it very busy processing > several TB of data. Lately, was waiting on new drives (now have 4 external > 6TB platters so I can get back to processing the data. so in comparison, > system was mostly idling when this started. The lockups were most > unexpected -- probably my worst in 20 years... Ok, I wonder if you hit an issue with core scheduling on Alder Lake. It was iirc supposed to be fixed in 5.17.x series, but maybe you found another "corner case"...
(In reply to Thomas Backlund from comment #22) > Ok, I wonder if you hit an issue with core scheduling on Alder Lake. > It was iirc supposed to be fixed in 5.17.x series, but maybe you found > another "corner case"... Hmm... I've been wondering why my gut was telling me something was different when watching htop. Dunno if this matches your comment; but last week and before, I noticed that most tasks seemed to use just one CPU. Now, most of the CPUs showing activity simultaneously... Saw firefox peaks today (only a few times I happened to be looking) reach around 170% with no single CPU higher than about 25%... Would that be the result of this new core scheduling? Sounds like it...
Just uninstalled 5.16.18.1 and 5.17.2.2 -- well under a minute each. Sounds like fixed core scheduling may have solved both 30405 & 30406... I'm going to keep working normally in faith that all is good again... :)
(In reply to Pierre Fortin from comment #20) > dozen... You say your drive is at 92; which value would that be here...? Not all ssd drives report the wear value. My oldest and newest sata drives do, the other two sata drives and the two pcie nvme drives don't.
Closing
Status: NEW => RESOLVEDResolution: (none) => OLD