in reply to Re: [OT] Reminder: SSDs die silently
in thread [OT] Reminder: SSDs die silently

I'm considering repeating that process on my machine at work. That machine is basically a copy of my main computer, using a slightly newer mainboard and a slightly faster CPU. And it has no backup at all.

Well, this becomes a kind of "Alexander's blog of dying SSDs".

I installed the Urbackup client on the PC at work, so it will be backed up by our regular backup server. Let the machine run for a weekend to have both an image and a file backup. Backup problem solved.

Originally, the PC at work had a relatively cheap, well used small SSD for the system and a major brand large SSD for the data, the latter was recently upgraded. I ordered two more major brand SSDs (one small, one large), started creating two RAIDs (using the onboard SATA fake RAID) and copying the data and system drives to the RAIDs. Things got slightly wrong right at the start. The first RAID for the data drive switched to degraded mode (i.e. one drive was set to failed) while still copying data to the RAID. Oh well, I used a slightly wonky setup with cables and SSDs dangling around, using old abused data cables, so it might have been my fault.

But during the next days, the data RAID failed almost every day, and almost always on one of the new major brand SSDs. Rebooting made the disks show up again, and I could start RAID reconstruction. I changed cables, I changed the SATA ports, I even replaced the power supply because it made some load-related noise. No change. Almost every day, the RAID failed, with one of the new SSDs disappearing. I changed the RAID drivers and the RAID monitor/controlling software to the newest one available from the chipset manufacturer, and back to exactly the versions running at home. No change. I ordered a new large SSD, again the same major brand, to replace the SSD that seemed to be broken.

I booted Linux from a USB stick and luckily, that Linux did not know anything about the fake RAID. It saw just four SSDs, and using smartmontools, I confirmed that all four SSDs were healthy and diagnosed themselves as healthy.

I googled A LOT while testing around. Many, many people suggesting using almost any other RAID solution, but not using THIS onboard SATA fake RAID (AMD RAIDXpert). From what I read from generated logs, change logs from the drivers, and behaviour of my work PC, my guess is that that the RAID driver very aggressively reads the disk identification and maybe also SMART data. If one disk fails to response in very short time, it is considered offline and the RAID switches to degraded mode. RAID failed especially on high disk load (e.g. booting two VMs at the same time).

So I decided to order another fake RAID controller, using a relatively cheap SATA controller, but from a manufacturer with a good reputation and a lot of RAID experience. It came with a set of new SATA cables. I did some more disk juggling and copying, and finally got my system and data volumes to the new controller. During setup, the new RAID controller complained about the old cheap SSD. It failed to execute the TRIM command needed to get rid of the onboard RAID metadata on the SSD. So I used the SSD I ordered last to replace that SSD instead of the one I suspected to be broken.

The work PC has worked for two weeks without a single complaint. The RAID monitor software of the new controller logs some issues at boot up, but it does not complain about them. So it seems the SSDs may show some slightly unexpected, but completely tolerable behaviour at boot.

So, I was left with the old, cheap SSD, containing my entire OS including my user profile. That won't go into the junk bin in that state. I grabbed another computer, pushed in another fake RAID controller from the same manufacturer, and used its extension ROM to try deleting the SSD again. It failed even in that machine, that has almost nothing in common with my work PC. So, I removed the fake RAID controller from the temporary PC, connected the SSD and the spare harddisk used while copying my data to the onboard SATA controller, fired up Linux, tried and failed one more to use the TRIM command, and finally deleted both using ddrescue, writing /dev/zero to each of the disks until the disk is full. The HDD wrote at about 130 MB/s, while the SSD had a hard time reaching even 30 MB/s. Both were successfully filled with zeros. The HDD goes back to the cold spare shelf, the SSD will be subject to a nice 4 kV burn-in test (suggested by our hardware expert) before going to the junk bin.

So what happened? Why did the data RAID break when the system RAID had a malfunctioning SSD?

My guess is that this is an issue of the mainboard fake RAID drivers. I guess that they use a timer to poll the identification and/or SMART data, plus async I/O. Once the timer is expired, a single(!) timeout timer is started, and results are read from all disks. The cheap SSD was the first drive, and took quite long to answer, but just not long enough for a timeout. The other three SSDs answered quicker, but with high I/O load, the SSDs were busy doing other stuff. And in that case, due to the slow first SSD, the timeout timer expired before the third (unlucky) SSD could answer. Sometimes, the fourth SSD was unlucky, rarey even the second SSD.

Why is that not a known bug? I guess using two RAID-1s of two SSDs each is not a common use case with that mainboard. Having one of the SSDs slow down is probably even more rare. And having people with SSD problems reporting to the mainboard and/or chipset manufacturer is very unlikely.

Will it be fixed? Unlikely. I did not bother to report the problem, the chipset is 12 years old, I guess no one will fix drivers for consumer hardware that old.

Lessons learned:

Alexander


A note on RAID jargon:

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^3: [OT] Reminder: SSDs die silently
by afoken (Chancellor) on Jan 12, 2024 at 10:15 UTC

    Time to really finish this story:

    the SSD will be subject to a nice 4 kV burn-in test

    That was spectacularly unspectacular. A few sparks from the 4 kV probe, but no burn marks, no fire, no exploding parts. Our 4 kV supply is just way too limited. It can deliver just a few mA. The next misbehaving SSD will just see plain mains voltage. 230 V with a slow-blow 16 A fuse.

    I decided to order another fake RAID controller, using a relatively cheap SATA controller, but from a manufacturer with a good reputation and a lot of RAID experience.

    That fake RAID controller is really a nice piece of hard- and software. But it is not completely free of problems. It still had trouble when running more than one VirtualBox VMs at the same time in the factory default configuration, both on my work machine and on my home machine. So I finally called tech support. The manufacturer insists on phone calls, which is a little bit odd, but it took just one phone call to get rid of my problem. The supporter told me, no, that should not happen, not with my machines, and not with any other. I was using the newest firmware and drivers available, and so I was told to try disabling Native Command Queuing for all SSDs right in the controller's BIOS. The drivers will respect that setting. I also disabled sleep mode, just to be sure. Disabling NCQ costs a little bit of performance, but both machines now work fine. I don't care if disk performance goes down by a few percents, the SSDs are sufficiently fast even without NCQ. If the onboard SATA fake RAID had a way to disable NCQ, I would try to go back to the onboard RAID. It is there, it has power, it has a sufficient number of SATA ports, and it does not need a PCIe slot.

    A little detail: The RAID software does write a log file, to aid debugging. But that does not help if the log file is written to the RAID volume that has problems and needs to be debugged. The supporter proposed the obvious solution: Add a USB flash drive and have the RAID software log to that drive instead. I don't do that, my problem is solved.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      For local development work, I'd recommend skipping raid altogether. RAID is all about uptime, and development work doesn't really benefit from that, much. Just pop in a 2TB NVMe drive and make daily backups. As for speed, the other week I had a new amazing experience: 1.2GB/s data xfer between NVMe drives, and that was over 700GB, so not just into cache. That's almost double the maximum theoretical SATA III speed, and my NVMe drive was able to *write* that fast.

      On my backups server I started using zfs, which can do raid on its own. I'm doing the equivalent of raid 5 across 3 10TB drives for 20TB of storage.

        For local development work, I'd recommend skipping raid altogether. RAID is all about uptime, and development work doesn't really benefit from that, much. Just pop in a 2TB NVMe drive and make daily backups.

        Handling a failed SSD:

        RAID Backup
        Urgency of getting a new SSD low high
        System downtime half an hour for swapping the SSD hours
        Downtime plannable yes no
        Annoyance low very high

        Yes, I do have backups of all SSDs, at home and at work. But the RAID buys me time to fix the hardware problem. At home, its mostly annoying to have to restore a backup. At work, the delay is simply not acceptable on some days, when projects are on fire.


        So, let's assume a major brand NVMe SSD. c't has recently (issue 1/2024) tested some 2 TB SSDs. I will pick the one with the best sequential write performance for the entire SSD, because I would need to write my backup to a fresh SSD. The write performance for a 5 min run is way higher for all SSDs, but caches and other trincks won't help when writing a large part of the capacity. The fastest one is a "Gigabyte Aorus Gen5 12000 SSD" at 2350 MByte/s. This is a big hunk of metal, two heatpipes, and a tiny PCB, with a street price of about 300 €, more than double of the cheapest tested SSD (138 €). The three slowest SSDs tested can can write only 134 MByte/s. The cheapest SSD tested can write 1140 MByte/s.

        I will also assume that the SSD was filled up to 75% before it died (that's how full my SSDs at home are). So we'll need to write 1.5 TByte = 1_500 GByte = 1_500_000 MByte. Assuming a sufficiently fast backup source (i.e. an equivalent SSD in a PCIe slot), the slowest SSD will finish after 11195 sec = 3 hours, the cheapest one after 1316 sec = 22 min, the fastest one after 638 sec = 11 min. Impressive.

        But unfortunately, my backup is not on another expensive SSD. It's on a cheap harddisk RAID on the network. Both at home and at work, it's on a small server on a switched Gigabit network, so we can't get faster than 1 GBit/s without changing hardware. Completely ignoring any protocol overhead, network usage by other users, and assuming sufficiently fast disks in the server, we'll max out at 125 MByte/s. That's even slower than the slowest SSDs in the test, and needs 12000 sec = 3 hours 20 min. With protocol overhead and real hard disks, that's more like four or five hours, perhaps even more.

        Yes, I could upgrade to 10 GBit/s at home, but we won't get 10 GBit/s ethernet any time soon at work. But let's pretend we would upgrade cabling, switches, and servers, and workstations to 10 GBit/s. Again ignoring protocol overhead, network usage by other users, we'll max out at 1250 MByte/s, 1200 sec = 20 min. I'm quite sure the server harddisks won't be able to deliver 1000 MByte/s, so these numbers are just nonsense.

        I could connect the NVMe SSD using USB 3.0 (that's the limit of the server). USB 3.0 runs at 5 GBit/s. Again, I'm completely ignoring any protocol overhead, so that's 625 MByte/s, 2400 sec = 40 min. This is not supported by the backup software, but let's pretend it could restore that way. Again, the server harddisks probably won't be able to deliver 500 MByte/s.

        Getting a new SSD by driving to the next computer store might take an hour, and I'll have to take any SSD that's available.

        So, to sum up, restoring a 2 TB NVMe SSD that was filled to 75 % will take more than half of a work day, and it needs to be done ASAP, even if other things are burning.

        The RAID solution needs a few clicks on my favorite web shop, the replacement SSD is my favorite model, and it's on my desk within two work days. I can delay that while things are burning. Some time later, I'll have a planned downtime of half an hour for swapping the SSD, and can continue working right after that. Reconstructing the RAID-1 can happen at almost maximum write performance, if I allow that to happen. During that time, disk performance will suffer. With the crazy fast SSD, that would be done entirely within the lunch break. With my existing SATA SSDs, it will take two or three hours, with acceptable remaining disk performance. It does not really matter at all. What matters is that I can continue working even when an SSD fails.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^3: [OT] Reminder: SSDs die silently
by hippo (Archbishop) on Jun 10, 2023 at 22:39 UTC

    In response to your sig, here is my experience. Always use software RAID. Always. I have never had an issue with Linux software RAID that was not ultimately down to a failure of the drives themselves and thus easily rectifiable. OTOH, hardware RAID (real or fake) has caused no end of problems and given the choice I would never go back there again.


    🦛

      In response to your sig, ....

      You are sooooo right.

      ... here is my experience. Always use software RAID.

      I do, whereever I set up a Linux server. I "inherited" one of my first Linux servers, and it came with two IDE hardware (or fake, I don't remember) RAID controllers. Of course, both were incompatible with each other, and caused nothing but trouble with old cables and old disks. I switched both controllers to be plain old IDE controllers, set up a software RAID, and never returned back to hardware or fake RAID.

      One other server (major brand) I managed to get my hands on to be used as a development server at my previous job could use only SCSI disks, and only as RAID. The onboard RAID controller refused all attempts to just pass each individual disk to the OS (Linux, of course). Imagine the joy when one day the f-ing onboard RAID controller died and I lost access to all data on the server. I used parts of the server from the previous paragraphs, one or two really old PCI SCSI controllers, a bunch of SCSI cables, terminators, and adapters from 50 pin SCSI to the all-in-one Power-and-SCSI connectors on the disks, and got all SCSI disks running. Somehow, Linux could read the RAID, because it followed an old industry standard for the meta data, and so I spent a long night copying system and data to that old server and made the system boot from IDE disks in a software RAID.

      Windows machines are a different thing. In my mind, there shouldn't be any valuable data on a Windows machine, so it can be reinstalled from scratch. Or restored from a backup happening in the background. That takes some time and usually does not hurt, because there is always another Windows machine available.

      Having some kind of RAID never sounded reasonable, because harddisks are friendly devices. They start shaking heads, ticking, and finally grinding before separating you from your data. SMART data on newer disks is usually reliable, and warns you even before the disk starts making noises. You really have a lot of time before you need to copy your data to another disk. No wonder after decades of development. As explained in [OT] Reminder: SSDs die silently and Re^2: [OT] Reminder: SSDs die silently, SSDs are faster, but evil. No warnings, SMART data is unreliable at best, just what you would expect from a much younger technology.

      Working with VMs on Windows (instead of running them on a server) suddenly kills several machines (one host and several VMs) when a disk dies. Due to the size of the VM disk images, restoring a backup takes a lot of time. And the next junk PC with a running Windows usually does NOT have those VMs available. So suddenly, having an exact and current copy of the disk online in no time (i.e. a RAID-1 mirror) becomes a real time saver.

      Yes, fake RAIDs suck. But if the fake RAID does its one and only job, keeping a copy of the disk available even if the disk dies, it avoids restoring from backup, and that's good enough.

      In the case of my work PC, the fake RAID had way too many false alarms, and failed to properly detect a dying SSD, wrongly claiming a different SSD had problems. I replaced one fake RAID with another fake RAID. It has better software, and I don't have to dig into how Windows implements its software RAID. Getting Windows to boot after hardware changes is hard enough, I don't want a software RAID layer on top of that nightmare.

      Yes, I should really get rid of Windows on physical hardware. But that's a completely different story.

      And there is another completely different story why I started to run VMs on Windows.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)