How I Recovered My Proxmox Ceph Cluster After Multiple OSD Failures

The AI Infrastructure Boom: From Nokia’s $4B US Bet to Foxconn’s Supercomputer Cluster

November 23, 2025

EU Digital Omnibus: Is Europe Quietly Dialling Back Data and AI Rules?

December 9, 2025

How I Recovered My Proxmox Ceph Cluster After Multiple OSD Failures

Published by fogerlog at December 7, 2025

What Went Wrong: Ceph HEALTH_WARN and Single Replica Risk

My starting point looked like this:

ceph -s

HEALTH_WARN
Only 1 OSD up and in (for example, osd.4)
Pool DiskPool had size 1, min_size 1
All VM disks were stored on that one remaining OSD

In other words:

No redundancy, every RBD image (VM disk) existed on a single physical disk.
If that disk failed, Ceph had no other copies. The cluster would not be able to recover that data.

For Ceph newbies, the important terms are:

size – how many copies of each object Ceph keeps.
- size 1 = one copy only (very risky).
- size 2 = two copies (safer).
min_size – minimum number of copies that must be available for Ceph to serve I/O.
- With min_size 1, Ceph will still read/write even with a single copy.
- With min_size 2, Ceph refuses I/O if it cannot guarantee at least two copies.

I was running DiskPool size 1 / min_size 1, so the cluster was “working” but one more disk failure would have meant complete data loss.

Step-by-Step: How I Recovered the Ceph Cluster

Always make sure you have good backups before doing any Ceph surgery.

1. Safely remove the broken OSDs from the cluster

On each node with a failed OSD (for example osd.1, osd.2, osd.3), I:

ceph osd out <id>

ceph osd crush remove osd.<id>

ceph auth del osd.<id>

ceph osd rm <id>

systemctl stop ceph-osd@<id>

This tells Ceph:

These OSDs are gone. Stop trying to use them.

On lsblk I could still see the OSD block devices (for example /dev/sdb mapped to a Ceph LV), but they were now free to be re-used.

2. Recreate fresh OSDs on the same drives

On each node (s2, s3, s4), I created a brand-new OSD using ceph-volume:

ceph-volume lvm create –data /dev/sdb

Ceph did the following automatically:

Created a new VG and LV for the OSD block
Generated a new OSD ID (0, 1, 2, etc.)
Wrote a fresh Bluestore filesystem
Enabled and started ceph-osd@<id>.service

After doing this on all three nodes, my ceph osd tree looked like:

ceph osd tree

4 OSDs: osd.0, osd.1, osd.2, osd.4

All up and in

3. Allow Ceph to rebalance and backfill

Earlier, I had set some safety flags during the disaster:

noout – stops Ceph from marking OSDs as out
nobackfill and norecover – stops rebalancing and recovery

Once the fresh OSDs were ready, I removed those flags:

ceph osd unset noout
ceph osd unset nobackfill
ceph osd unset norecover

Ceph immediately started backfilling and remapping PGs to spread data from the full OSD to the new ones. During this time, ceph -s showed:

Many PGs in active+remapped+backfill_wait / backfilling
High recovery bandwidth in MiB/s

I just let it run. Depending on your network and disk speed, this can take hours.

4. Increase pool size to get real redundancy

Once everything was stable and all OSDs were up, I changed the pool replication:

ceph osd pool set DiskPool size 2
ceph osd pool set DiskPool min_size 2

This did two important things:

size 2 – Ceph now keeps two copies of every object in DiskPool.
min_size 2 – Ceph will not serve I/O if it cannot guarantee two copies.

This protects you from silent “single copy” mode. If one OSD disappears, Ceph will complain loudly instead of quietly running with no redundancy.

After the rebalancing finished, my final ceph -s looked like:

health: HEALTH_OK
osd: 4 osds: 4 up, 4 in
pools: 33 pgs
pgs: 33 active+clean

And:

ceph osd pool ls detail | grep -A3 DiskPool
pool 2 'DiskPool' replicated size 2 min_size 2 ...

Now every VM disk stored on DiskPool has two copies on different OSDs.

Key Lessons for Anyone Running Proxmox and Ceph

Don’t run production workloads with Ceph pool size 1 – it’s basically no better than a single disk.
Always check ceph osd pool ls detail to see your size and min_size values.
After OSD failures, it’s often safer to remove and recreate OSDs than fight with heavy corruption.
Let Ceph handle backfill and recovery; do not rush or interrupt unless absolutely necessary.

By rebuilding the failed OSDs and increasing the pool size to 2, my Proxmox Ceph storage moved from “one disk away from disaster” to a proper redundant Ceph cluster.