user@argobox:~/journal/2025-12-06-the-raid-that-refused-to-rebuild
$ cat entry.md

The RAID That Refused to Rebuild

○ NOT REVIEWED

The RAID That Refused to Rebuild

Date: December 5-6, 2025 Duration: Two days Messages: 86 (across Claude sessions) Issue: RAID array wouldn’t accept replacement drive Result: Recovered, but not how I expected


The Alert

Storage Manager: “Storage Pool 1 is degraded.”

One of the four drives in the Synology had failed. Normal enough - drives die. That’s why we have RAID.

Pulled the dead drive. Inserted a new one. Waited for the rebuild.

And waited.

And waited.


The Symptom

Storage Manager showed the new drive. It was detected. It was healthy. But the “Repair” option was grayed out.

The array refused to accept its replacement.


First Hypothesis: Wrong Drive Size

The original array was 4x4TB drives. The replacement was… also 4TB. Same model family, even.

# SSH to NAS
cat /proc/partitions

New drive showed fewer sectors. A few GB smaller. Close enough for most purposes. Not close enough for Synology’s RAID implementation.

The array wanted exactly the same size or larger. This drive, despite being “4TB,” had slightly fewer usable sectors.


Second Hypothesis: Bad Sectors

Maybe the new drive had issues.

smartctl -a /dev/sdd

Clean. Zero reallocated sectors. Zero pending sectors. The drive was healthy.

Not the problem.


Third Hypothesis: Partition Table Corruption

The failed drive might have left garbage in the partition scheme.

cat /proc/mdstat
md2 : active raid5 sdc5[2] sdb5[1] sda5[0]
      11708923392 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

[UUU_] - three drives active, one missing. But no fourth drive was trying to join.

The array knew a drive was missing. It just didn’t want the replacement.


The Actual Problem

Deep in DSM’s logs:

Volume1: Drive 4 partition mismatch - expected 3907018584 sectors, got 3907018240

The replacement drive was 344 sectors smaller than the original. About 176KB.

176KB difference on a 4TB drive. 0.000004% smaller.

And that was enough to fail the rebuild.


The Solution Options

Option 1: Find an identical drive

Hunt down the exact same model with the exact same sector count. Possible, but annoying.

Option 2: Shrink the existing partitions

Theoretically possible. Practically terrifying on a live RAID.

Option 3: Use a larger drive

A 6TB drive would definitely have enough sectors. Wasteful, but works.


What I Actually Did

Checked my spare drives. Found a 5TB that I’d forgotten about.

# Check sector count
smartctl -i /dev/sdd | grep "User Capacity"

5TB = 9,767,541,168 sectors. Way more than needed.

Swapped in the 5TB. Storage Manager immediately offered the Repair option.

Repairing Storage Pool 1...
Time remaining: 18 hours

The extra 1TB would go unused (RAID 5 matches the smallest drive), but the array accepted it.


The 18-Hour Wait

Rebuild started at 11 PM. Finished the next afternoon.

During rebuild:

  • Read/write performance dropped to maybe 30% of normal
  • System stayed accessible
  • Didn’t lose any data

The whole time, I had one functional redundancy removed. If another drive failed during rebuild, total loss.


Post-Rebuild Verification

cat /proc/mdstat
md2 : active raid5 sdd5[4] sdc5[2] sdb5[1] sda5[0]
      11708923392 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

[UUUU] - all four drives active. Array healthy.

Scrub to verify:

echo check > /sys/block/md2/md/sync_action

Scrub completed clean. No mismatches.


Lessons Learned

Sector count matters. “4TB” isn’t a precise specification. Different models, different manufacturers, even different batches can have different sector counts. Always use the same size or larger for RAID replacements.

Check before you buy. Look up the exact sector count of your existing drives. Match or exceed.

Keep a larger spare. My 5TB spare saved the day. The cost difference between 4TB and 5TB is nothing compared to the convenience of “definitely fits.”

RAID is not backup. During the 18-hour rebuild, I had no redundancy. If I’d lost another drive, the array would have been gone. The important data was also on a different NAS. RAID protects against drive failure, not against data loss.


The Hardware Lesson

Drive manufacturers advertise capacity, not sectors. Two “4TB” drives can differ by millions of sectors.

For RAID:

  • Use drives from the same batch when possible
  • When replacing, go larger
  • Never assume “same capacity” means “same size”

176 kilobytes. That’s all it took to fail a 16 terabyte array rebuild.