Aurora MySQL Lessons From Real Production Systems
Replication lag, failover timing, schema migration safety, and monitoring thresholds worth setting.
Introduction
Running Aurora MySQL in production teaches you things the documentation doesn’t cover directly. This article collects the lessons that came up repeatedly across real systems.
Replication lag deserves its own alert
Reader endpoints in Aurora use replicas. When a replica falls behind the writer, reads from
that endpoint return stale data — silently. Set a CloudWatch alarm on AuroraReplicaLag
with a threshold appropriate for your application’s tolerance, not just AWS defaults.
For most transactional workloads, lag above 500ms should trigger a warning. Above 5 seconds should page someone.
Failover timing is not instantaneous
Aurora’s automatic failover typically completes in 30–60 seconds. Your application needs to handle that window. This means:
- Connection retry logic with exponential backoff
- Queue workers that tolerate temporary write failures
- Health checks that don’t kill containers faster than the cluster recovers
Schema migrations need a checklist
Zero-downtime migrations require more discipline than just running ALTER TABLE on a staging
environment. Before any schema change ships:
- Verify the migration is backwards-compatible with the current code version
- Check whether the table is large enough to require pt-online-schema-change or similar
- Confirm binlog retention is long enough to replay changes if something goes wrong
- Have an explicit rollback query ready before you start
Monitoring thresholds worth setting
Beyond replication lag, the following CloudWatch metrics are worth alarming on:
DatabaseConnections— alert before you hitmax_connectionsFreeLocalStorage— Aurora local storage for temp tables and binary logsDMLThroughput— sudden spikes often precede lock contention incidentsDeadlockCount— any deadlocks in production are worth investigating, not just accepting
Conclusion
Most Aurora incidents are preventable with the right monitoring in place before something goes wrong. The time to set thresholds is during a calm week, not during an incident.