EC2 SSH Disaster Recovery

Incident Response · GitHub repository

A broken SSH configuration can make an EC2 instance unreachable while the workload itself is perfectly healthy. This writeup walks through recovering access by treating the root disk as data: detach it, attach it to a helper instance, repair the configuration, and reattach.

Technologies

EC2 · EBS · Linux · SSH · Incident Response

Problem

After a configuration change, SSH access to a running instance was lost. The instance could not be reached, but terminating and rebuilding it was not acceptable — the data and state on the root volume had to be preserved.

Architecture

The recovery uses the EBS volume-rescue pattern: stop the affected instance, detach its root EBS volume, attach that volume as a secondary disk on a temporary helper instance in the same availability zone, mount it, correct the offending configuration, then detach and reattach it to the original instance as root.

Security considerations

Access was restored without exposing the instance to additional risk — no password authentication was enabled as a shortcut, and the helper instance was disposable.

Challenges

The availability-zone constraint on attaching volumes, and getting the device names and mount points right under pressure, were the main friction points.

Lessons learned

Treating the root volume as recoverable data changes how you think about a locked instance. Writing the procedure down beforehand turns a stressful outage into a checklist.

View the repository on GitHub →