in reply to Re: [OT] Reminder: SSDs die silently
in thread [OT] Reminder: SSDs die silently
I'm considering repeating that process on my machine at work. That machine is basically a copy of my main computer, using a slightly newer mainboard and a slightly faster CPU. And it has no backup at all.
Well, this becomes a kind of "Alexander's blog of dying SSDs".
I installed the Urbackup client on the PC at work, so it will be backed up by our regular backup server. Let the machine run for a weekend to have both an image and a file backup. Backup problem solved.
Originally, the PC at work had a relatively cheap, well used small SSD for the system and a major brand large SSD for the data, the latter was recently upgraded. I ordered two more major brand SSDs (one small, one large), started creating two RAIDs (using the onboard SATA fake RAID) and copying the data and system drives to the RAIDs. Things got slightly wrong right at the start. The first RAID for the data drive switched to degraded mode (i.e. one drive was set to failed) while still copying data to the RAID. Oh well, I used a slightly wonky setup with cables and SSDs dangling around, using old abused data cables, so it might have been my fault.
But during the next days, the data RAID failed almost every day, and almost always on one of the new major brand SSDs. Rebooting made the disks show up again, and I could start RAID reconstruction. I changed cables, I changed the SATA ports, I even replaced the power supply because it made some load-related noise. No change. Almost every day, the RAID failed, with one of the new SSDs disappearing. I changed the RAID drivers and the RAID monitor/controlling software to the newest one available from the chipset manufacturer, and back to exactly the versions running at home. No change. I ordered a new large SSD, again the same major brand, to replace the SSD that seemed to be broken.
I booted Linux from a USB stick and luckily, that Linux did not know anything about the fake RAID. It saw just four SSDs, and using smartmontools, I confirmed that all four SSDs were healthy and diagnosed themselves as healthy.
I googled A LOT while testing around. Many, many people suggesting using almost any other RAID solution, but not using THIS onboard SATA fake RAID (AMD RAIDXpert). From what I read from generated logs, change logs from the drivers, and behaviour of my work PC, my guess is that that the RAID driver very aggressively reads the disk identification and maybe also SMART data. If one disk fails to response in very short time, it is considered offline and the RAID switches to degraded mode. RAID failed especially on high disk load (e.g. booting two VMs at the same time).
So I decided to order another fake RAID controller, using a relatively cheap SATA controller, but from a manufacturer with a good reputation and a lot of RAID experience. It came with a set of new SATA cables. I did some more disk juggling and copying, and finally got my system and data volumes to the new controller. During setup, the new RAID controller complained about the old cheap SSD. It failed to execute the TRIM command needed to get rid of the onboard RAID metadata on the SSD. So I used the SSD I ordered last to replace that SSD instead of the one I suspected to be broken.
The work PC has worked for two weeks without a single complaint. The RAID monitor software of the new controller logs some issues at boot up, but it does not complain about them. So it seems the SSDs may show some slightly unexpected, but completely tolerable behaviour at boot.
So, I was left with the old, cheap SSD, containing my entire OS including my user profile. That won't go into the junk bin in that state. I grabbed another computer, pushed in another fake RAID controller from the same manufacturer, and used its extension ROM to try deleting the SSD again. It failed even in that machine, that has almost nothing in common with my work PC. So, I removed the fake RAID controller from the temporary PC, connected the SSD and the spare harddisk used while copying my data to the onboard SATA controller, fired up Linux, tried and failed one more to use the TRIM command, and finally deleted both using ddrescue, writing /dev/zero to each of the disks until the disk is full. The HDD wrote at about 130 MB/s, while the SSD had a hard time reaching even 30 MB/s. Both were successfully filled with zeros. The HDD goes back to the cold spare shelf, the SSD will be subject to a nice 4 kV burn-in test (suggested by our hardware expert) before going to the junk bin.
So what happened? Why did the data RAID break when the system RAID had a malfunctioning SSD?
My guess is that this is an issue of the mainboard fake RAID drivers. I guess that they use a timer to poll the identification and/or SMART data, plus async I/O. Once the timer is expired, a single(!) timeout timer is started, and results are read from all disks. The cheap SSD was the first drive, and took quite long to answer, but just not long enough for a timeout. The other three SSDs answered quicker, but with high I/O load, the SSDs were busy doing other stuff. And in that case, due to the slow first SSD, the timeout timer expired before the third (unlucky) SSD could answer. Sometimes, the fourth SSD was unlucky, rarey even the second SSD.
Why is that not a known bug? I guess using two RAID-1s of two SSDs each is not a common use case with that mainboard. Having one of the SSDs slow down is probably even more rare. And having people with SSD problems reporting to the mainboard and/or chipset manufacturer is very unlikely.
Will it be fixed? Unlikely. I did not bother to report the problem, the chipset is 12 years old, I guess no one will fix drivers for consumer hardware that old.
Lessons learned:
Alexander
A note on RAID jargon:
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: [OT] Reminder: SSDs die silently
by afoken (Chancellor) on Jan 12, 2024 at 10:15 UTC | |
by NERDVANA (Priest) on Jan 12, 2024 at 18:36 UTC | |
by afoken (Chancellor) on Jan 12, 2024 at 22:33 UTC | |
by NERDVANA (Priest) on Jan 15, 2024 at 06:49 UTC | |
Re^3: [OT] Reminder: SSDs die silently
by hippo (Archbishop) on Jun 10, 2023 at 22:39 UTC | |
by afoken (Chancellor) on Jun 11, 2023 at 10:42 UTC |