~/blog/aws-elb-mysql-replication-lessons
Infrastructure2012·20 November 2012

AWS Load Balancers and MySQL Replication: What Nobody Tells You

Running MySQL replication and AWS ELB in production teaches you things the documentation doesn't cover. Here are the hard-won lessons from scaling TouchNote through the 2012 Olympics period.

AWSELBMySQLreplicationproduction-lessons

Connection Draining Is Not Optional

The first thing ELB beginners get wrong: not enabling connection draining. When an instance is deregistered from the ELB (during scale-in, deployments, or instance termination), without draining, in-flight requests are immediately severed. Users get errors.

Enable connection draining with a sensible timeout (60-120 seconds for most web applications). This tells the ELB to stop sending new connections to the deregistering instance while allowing existing connections to complete.

Health Check Tuning Matters

The default ELB health check settings are often wrong for production:

  • ·Check interval: Default 30 seconds. Consider 10 seconds for faster failure detection.
  • ·Unhealthy threshold: Default 2 consecutive failures. Fine for most cases.
  • ·Healthy threshold: Default 10 consecutive successes to mark an instance healthy. This means a newly launched instance takes 100 seconds minimum before receiving traffic. In an auto-scaling event, that's 100 seconds of capacity you're waiting for.

We reduced our healthy threshold to 3 (30 seconds) after the Olympics pre-warm exercise revealed this delay.

Sticky Sessions: Usually Wrong

ELB sticky sessions (session affinity) route a user to the same EC2 instance for their session duration. This sounds convenient — no need for distributed session storage. It's a trap.

The problem: uneven load distribution. If you have 3 instances and sticky sessions are enabled, a user with a long session on instance 1 keeps that instance busy while instances 2 and 3 sit lighter. Scale-in is also complicated — you can't terminate an instance with active sticky sessions without disrupting those users.

Use Redis or ElastiCache for session storage. Stateless application servers are infinitely easier to scale.

MySQL Replication Gotchas

Binary log format: Use ROW-based binary logging (binlog_format=ROW) rather than STATEMENT-based for replication. Statement-based replication can produce different results on replicas if queries use non-deterministic functions (NOW(), RAND(), etc.). ROW-based logging captures the actual row changes — deterministic and safe.

innodb_flush_log_at_trx_commit: Setting this to 1 (the safe default) means MySQL flushes the InnoDB log to disk on every transaction commit. On a write-heavy replica, this creates I/O pressure. Some teams set it to 2 on replicas (flush to OS cache, not disk — a small data loss risk on OS crash but not on MySQL crash). Know the tradeoff before you change it.

Read your replicas: If you're not routing read queries to replicas, you're not getting the benefit of replication. This requires connection pool logic that distinguishes reads from writes. We wrapped our database access layer with a simple read/write router.

These aren't glamorous lessons. But they're the difference between an architecture that holds under pressure and one that fails when it matters.