Rescuing a locked-out EC2 instance

Recovery · Draft

Losing SSH to a running instance feels like losing the instance. It isn't. If the workload is healthy and only the access path is broken, you can repair it by treating the root disk as data.

The idea

An EBS root volume is just a disk. Detach it from the broken instance, attach it to a working one, fix the file, and put it back.

The steps

Stop the affected instance (don't terminate it).
Detach its root EBS volume.
Attach that volume to a temporary helper instance in the same availability zone.
Mount it and correct the offending file:

sudo mkdir -p /mnt/rescue
sudo mount /dev/xvdf1 /mnt/rescue
# fix sshd_config, authorized_keys, or permissions under /mnt/rescue
sudo umount /mnt/rescue

Detach from the helper, reattach to the original instance as its root device, and start it.

What to watch

The availability-zone constraint trips people up — a volume can only attach to an instance in its own AZ. Device names and the exact partition to mount are the other easy mistakes.

The real win is having written the procedure down before you need it. An outage is a bad time to be improvising.