Fly.io Postgres failover fix (flyctl pg failover)

This is a note to myself, meant to be succinct and helpful. I’m sharing it publicly to save others time.

Most of the time Fly.io works as I expect it to, but occasionally there are edge cases that lack documentation, public announcements, or both.

It’s possible that at some point Fly.io announced a breaking change and I missed it, but the behavior I observed deserves more than an announcement or silently released documentation.

Today, I wanted to perform a manual leader failover to one of my cluster’s followers, but it quickly failed in a way that wasn’t immediately clear.

The problem

Performing flyctl pg failover can fail for opaque reasons because the underlying error is eaten.

Example

$ flyctl pg failover -a <APP> --debug
Performing a failover
Connecting to fdaa:9:1d32:a7b:94:1aeb:7b94:2... complete
Stopping current leader...  328725ec309d85
Starting new leader
Promoting new leader...  e784126feee248
Connecting to fdaa:9:1d32:a7b:94:1aeb:7b94:2... complete
WARNING: unable to connect to remote host "fdaa:9:1d32:a7b:e:e63c:7516:2" via SSH
WARNING: unable to connect to remote host "fdaa:9:1d32:a7b:2b5:4c5e:f3b8:2" via SSH
...

Now this isn’t entirely opaque – we can see that something is having a problem connecting to two cluster nodes via SSH. But what is trying to connect via SSH? My local machine, or one of the replicas in the Postgres cluster?

If you aren’t aware of how Fly Postgres clustering works, under the hood it’s simply repmgr, which under its hood uses passwordless SSH sessions to orchestrate changes with cluster members.

fly pg failover executes repmgr standby switchover --siblings-follow (amongst other things) on one of the cluster’s follower nodes, which takes over the primary role in the cluster.

So the errors above – those are coming from a follower cluster node attempting connections to other cluster nodes. Failover must fail because without SSH sessions to other cluster nodes, repmgr cannot orchestrate any changes.

Aside: fly pg failover should absolutely accept repmgr’s --dry-run switch for performing dry runs. Currently, it does not. Typically fly pg failover stops the leader machine to make way for the new leader. A --dry-run switch should prevent any such service disruption, just as repmgr’s dry run behavior prevents service disruption.

Unfortunately, repmgr can eat errors from ssh and won’t show you exactly why some connections to remote hosts aren’t possible.

By using SSH directly (ssh postgres@fdaa:9:1d32:a7b:e:e63c:7516:2) I quickly spotted the underlying error.

postgres@fdaa:9:1d32:a7b:e:e63c:7516:2: Permission denied (publickey).

At this point, the problem was clearly a SSH public key problem, but I wasn’t aware of the below command that’ll quickly fix your cluster members up with working ssh keys.

The fix

To fix this issue, re-distribute SSH public keys throughout the cluster with

fly pg renew-certs -a <APP>

Your SSH certificate(s) have been renewed are set to expire in 36525 day(s)
Run fly deploy --app <APP> --image docker-hub-mirror.fly.io/flyio/postgres-flex:15.8@sha256:5016ffb34e66eca43d4f9ef2f898c166257bd28bd5095c41d049a5e3be15caf5 to apply the changes!

Don’t forget to re-deploy your app after renewing certificates.