← Writing
Aurora, MySQL & PostgreSQL

Aurora MySQL Lessons From Real Production Systems

Replication lag, failover timing, schema migration safety, and monitoring thresholds worth setting.


Introduction

Running Aurora MySQL in production teaches you things the documentation doesn’t cover directly. This article collects the lessons that came up repeatedly across real systems.

Replication lag deserves its own alert

Reader endpoints in Aurora use replicas. When a replica falls behind the writer, reads from that endpoint return stale data — silently. Set a CloudWatch alarm on AuroraReplicaLag with a threshold appropriate for your application’s tolerance, not just AWS defaults.

For most transactional workloads, lag above 500ms should trigger a warning. Above 5 seconds should page someone.

Failover timing is not instantaneous

Aurora’s automatic failover typically completes in 30–60 seconds. Your application needs to handle that window. This means:

  • Connection retry logic with exponential backoff
  • Queue workers that tolerate temporary write failures
  • Health checks that don’t kill containers faster than the cluster recovers

Schema migrations need a checklist

Zero-downtime migrations require more discipline than just running ALTER TABLE on a staging environment. Before any schema change ships:

  1. Verify the migration is backwards-compatible with the current code version
  2. Check whether the table is large enough to require pt-online-schema-change or similar
  3. Confirm binlog retention is long enough to replay changes if something goes wrong
  4. Have an explicit rollback query ready before you start

Monitoring thresholds worth setting

Beyond replication lag, the following CloudWatch metrics are worth alarming on:

  • DatabaseConnections — alert before you hit max_connections
  • FreeLocalStorage — Aurora local storage for temp tables and binary logs
  • DMLThroughput — sudden spikes often precede lock contention incidents
  • DeadlockCount — any deadlocks in production are worth investigating, not just accepting

Conclusion

Most Aurora incidents are preventable with the right monitoring in place before something goes wrong. The time to set thresholds is during a calm week, not during an incident.