The RAID Crisis: Recovering 22.8TB from a Dead Array

The RAID Crisis

Time: 10:47 AM, December 17, 2024 Location: Home office Heart rate: Elevated Coffee consumed: Not enough

It started with a beep.

That shrill, incessant beep of a Synology NAS warning you that your life is about to get complicated. I was mid-sentence in a work call when I heard it. Muted myself. Walked to the closet where the NAS lives.

The front panel was blinking red.


The Discovery

I logged into DSM, already knowing it would be bad. The dashboard confirmed it:

Volume 1: CRASHED
Drive 2: FAILED
Drive 3: I/O Errors (Multiple Sectors)

I had a RAID 5 array running Synology Hybrid RAID (SHR). You can lose one drive. You cannot lose two.

I was looking at 22.8 terabytes of data. Family photos going back to 2008. Project archives. Years of accumulated media. Things that existed nowhere else.

The volume wouldn’t mount. DSM just showed a red icon and the word “Crashed.”


The Anatomy of the Problem

Here’s the thing about Synology: it doesn’t use a simple filesystem. It layers technologies like an onion, and if you don’t understand the layers, you can’t recover anything.

The Synology Stack:

Layer 7: Your Files

Layer 6: Btrfs Filesystem

Layer 5: Logical Volume (vg1-volume_1)

Layer 4: Volume Group (vg1)

Layer 3: LVM Physical Volume

Layer 2: MD RAID Array (/dev/md127)

Layer 1: Partitions (/dev/sdX5 - linux_raid_member)

Layer 0: Physical Disks (/dev/sda, /dev/sdb, /dev/sdc, /dev/sdd)

Miss any layer? You see nothing. The filesystem tools can’t find files because they’re looking at raw blocks. The RAID tools can’t help because LVM is corrupted. LVM can’t help because Btrfs metadata is scrambled.

My situation: The MD RAID was technically intact. The LVM physical volume was visible. But the Btrfs filesystem on top? Corrupted beyond what mount could handle.


The Mistake

I did what everyone tells you not to do.

I panicked.

I clicked “Repair” in DSM. It failed. I rebooted the NAS. It failed to mount. I tried again. Same result.

Every action I took was potentially writing to the drives. Every write was potentially destroying evidence that recovery tools would need later.

The first rule of data recovery is STOP TOUCHING THINGS.

I learned this rule about 45 minutes too late.


The Recovery Strategy

Once I calmed down (read: after pacing around my office for 20 minutes), I formulated a plan.

Native Synology tools were useless. The volume was crashed at a level DSM couldn’t comprehend. I needed to pull the drives and work with them directly from a Linux recovery environment.

Phase 1: Get the Drives Out

I shut down the NAS properly. Pulled all four 6TB HGST drives. Labeled them with masking tape: “Bay 1”, “Bay 2”, “Bay 3”, “Bay 4.”

This matters. RAID reconstruction cares about drive order. Mix them up and you’re reconstructing garbage.

Phase 2: Imaging (The Patience Test)

Before doing anything else, I needed bit-for-bit copies of every drive. Running recovery tools directly on failing hardware is how you turn a partial failure into a total loss.

I bought four 6TB external drives. Yes, this was expensive. No, I didn’t have a choice.

Connected everything to my Gentoo workstation via USB docks (one drive at a time—the dock couldn’t handle multiple simultaneous loads).

dd if=/dev/sdb of=/backup/disk1.img bs=4M status=progress conv=sync,noerror

The conv=sync,noerror flags are critical. They tell dd to keep going when it hits bad sectors, padding with zeros instead of stopping. Without these flags, a single bad sector kills the entire clone.

Time elapsed: 14 hours per drive. 56 hours total.

I slept in shifts.

Phase 3: The First Attempt (Failed)

With the images safely on my backup drives, I tried the standard Linux approach.

# Assemble the RAID in read-only mode
sudo mdadm --assemble --readonly /dev/md127 \
    /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3

# Scan for LVM
sudo pvscan
sudo vgscan
sudo vgchange -ay

# Try to mount
sudo mount -o ro,recovery /dev/vg1/volume_1 /mnt/recovery

The RAID assembled. LVM found the volume group. But the mount?

mount: /mnt/recovery: wrong fs type, bad option, bad superblock

Btrfs was corrupted. The metadata that tells the filesystem where files live was scrambled.

Do NOT run btrfs check --repair on a broken filesystem without a backup.

I repeat: DO NOT. That flag can make things worse. It “repairs” by deleting things it can’t understand. I’ve seen it turn recoverable situations into total losses.

Phase 4: Hardware Betrayal

I tried to run some deeper analysis tools, but hit a different problem entirely.

My USB dock kept disconnecting under sustained read load. dmesg filled with errors:

[42069.123456] usb 3-1: reset high-speed USB device number 4 using xhci_hcd
[42069.234567] usb 3-1: device descriptor read/64, error -71
[42071.345678] usb 3-1: USB disconnect, device number 4

Consumer USB enclosures aren’t built for continuous multi-hour reads. The controller overheats. The connection drops. Your recovery session dies.

The fix: I swapped to a USB NVMe enclosure with a Samsung 970 EVO Plus. Solid-state, no moving parts, better thermal management. The disconnects stopped.

Lesson: USB is fine for transport. It’s not fine for torture.

Phase 5: UFS Explorer (The Nuclear Option)

Linux native tools couldn’t mount the Btrfs volume safely. I needed something that could understand the entire Synology stack and reconstruct the directory tree from raw blocks.

I bought UFS Explorer Standard Recovery. $70. Proprietary software. Runs on Linux.

The thing about UFS Explorer: it doesn’t care if your Btrfs superblock is garbage. It scans the raw disk for file signatures, reconstructs the directory structure from whatever metadata survived, and presents you with a browsable tree.

I loaded all four disk images. Told it the RAID type (RAID 5), let it auto-detect the stripe size and parity rotation.

Ten minutes of scanning.

And then… a directory tree appeared.

/Photos/2023/Christmas/
/Documents/Projects/
/Media/Movies/

It was there. All of it.

I may have said some words that aren’t appropriate to print.


The Extraction

I didn’t have a 22TB drive lying around. I had to extract data in chunks to whatever storage I could scavenge.

Prioritization saved me:

Tier 1 - Irreplaceable (1TB) Family photos. Personal documents. Code repositories with unpushed commits. These went straight to an NVMe SSD. No compression, no delays, just raw speed.

Tier 2 - Hard to Replace (10TB) Rare media. Linux ISOs I’d spent hours finding. Project archives that would be painful to recreate. These went to slower USB HDDs.

Tier 3 - Replaceable (12TB) Movies and TV shows that exist on every torrent tracker. Music I could re-download. These got queued last, and honestly, some of it I just let go.

Total extraction time: 3 days of continuous copying, monitoring for USB disconnects, and praying.


The Aftermath

Final tally:

  • 100% of Tier 1 recovered
  • 98% of Tier 2 recovered
  • ~80% of Tier 3 recovered (some files had unrecoverable corruption)

The 2% I lost from Tier 2 were files that had been on the failing Drive 3 and had accumulated enough bad sectors that even UFS Explorer couldn’t reconstruct them.


What I Learned

RAID is not a backup.

I knew this intellectually. I’d read the articles. I’d nodded along to the podcasts. And then I treated my RAID array like it was invincible because “redundancy.”

RAID gives you uptime. If a drive fails, your data stays online while you replace it. That’s it. RAID does not protect against:

  • Controller failures
  • Filesystem corruption
  • Accidental deletion
  • Ransomware
  • Fire
  • Theft
  • Your own stupidity

Scrub your arrays.

I hadn’t run a data scrub in 6 months. Btrfs has a scrub command that reads every block and verifies checksums. If I’d been running monthly scrubs, I would have caught Drive 3’s developing bad sectors before Drive 2 died.

btrfs scrub start /volume1

Schedule this. Cron it. Don’t be me.

Have a recovery plan before you need it.

Knowing how to use dd, mdadm, and losetup before the crisis would have saved me an hour of panic-Googling. I was learning recovery procedures while my data was in jeopardy.

Now I have documented runbooks. Step-by-step procedures. Tested quarterly.

Commercial tools have their place.

I’m a Linux purist. I believe in open source. I also believe that when 22TB of irreplaceable data is on the line, you pay for the tool that works.

UFS Explorer was worth every penny of that $70 license. It understood the Synology stack better than I did. It found files that btrfs restore couldn’t see.

Never skip layers.

You can’t mount /dev/sdb5 directly and expect to see files. You have to assemble the full stack: partition → RAID → LVM → filesystem. Each layer depends on the one below.

When I first started, I kept trying to mount the wrong device and getting confused by the errors. Understanding the layer model would have saved hours.


The New Strategy

I now follow 3-2-1 backup religiously:

  • 3 copies of important data
  • 2 different media types (SSD + HDD, or local + cloud)
  • 1 offsite (encrypted to Google Drive via rclone)

Critical files get backed up daily. The full NAS gets a weekly snapshot sent to a different physical location.

I also:

  • Run btrfs scrub monthly
  • Check SMART data weekly
  • Replace drives proactively at 4 years, not “when they fail”
  • Keep UFS Explorer installed and licensed, just in case

The Cost

Financial:

  • 4x 6TB external drives for imaging: $400
  • UFS Explorer license: $70
  • USB NVMe enclosure: $40
  • Total: ~$510

Emotional:

  • Three nights of bad sleep
  • One very understanding spouse
  • A new appreciation for mortality (of hard drives, anyway)

Educational:

  • Priceless, unfortunately

Would I Do It Again?

I mean, I’d rather not. But if another array fails, I know exactly what to do now.

  1. Stop. Don’t click repair. Don’t reboot. Don’t panic (okay, panic a little, then stop).
  2. Image everything before touching the original drives.
  3. Work from images, not originals.
  4. Understand the layer stack before trying to mount anything.
  5. Use the right tools, even if they cost money.
  6. Prioritize extraction by replaceability.

And most importantly: have backups so you never need to do this again.


December 17, 2024. The day I learned that RAID is not a backup. The hard way.