London Olympics 2012: Handling 100K Transactions Per Hour on AWS

The Samsung Campaign

In summer 2012, Samsung partnered with TouchNote to run a campaign tied to the London Olympics. Users could send branded Samsung/Olympics postcards to friends and family. The campaign was heavily promoted on Samsung devices during the Games — and when the marketing kicked in, so did the traffic.

At peak, we were processing approximately 100,000 transactions per hour. For context, our normal throughput was a fraction of that. We had a few weeks of warning. Here's how we prepared — and what we learned.

The Baseline Architecture

TouchNote ran on AWS in 2012, which already put us ahead of many peers. Our baseline stack:

·EC2 instances: PHP application servers behind an Elastic Load Balancer
·MySQL: Master-replica setup for read scaling
·S3: Image storage (user photos, rendered cards)
·SES: Email delivery
·Redis: Session storage and caching

The ELB handled horizontal scaling of application servers reasonably well. The database was the concern.

Database Scaling: Master-Master MySQL Replication

For the campaign, we moved to a Master-Master MySQL replication setup. Unlike standard master-replica (where replicas are read-only), Master-Master allows writes to either node — providing both redundancy and a failover path.

The configuration required careful attention to:

Auto-increment offsets: With two masters, you need them to generate non-conflicting primary keys. We set auto_increment_offset=1 on master 1 and auto_increment_offset=2 on master 2, both with auto_increment_increment=2. Master 1 generates odd IDs, master 2 generates even IDs.

Write routing: Despite both nodes being writable, we routed all writes to master 1 normally — master 2 was for failover only. Concurrent writes to both masters create conflict resolution headaches. Keep it simple.

Replication lag monitoring: We added alerting on replication lag. High lag means your "replica" is serving stale data. During the campaign, we had alerts fire twice — both times from I/O spikes that resolved within minutes.

Auto-Scaling EC2 Instances

We configured EC2 Auto Scaling with CloudWatch metrics triggers. When average CPU across the ASG exceeded 70% for 5 minutes, we'd add instances. Below 30% for 10 minutes, we'd remove them.

The gotcha: EC2 instance boot time. A new instance takes 3-4 minutes to boot, install configuration, and pass health checks before the ELB sends traffic. That's 3-4 minutes of degraded performance if traffic spikes suddenly. We pre-warmed the ASG before expected traffic peaks.

What Actually Stressed the System

The bottleneck during the campaign peak wasn't the application servers — it was the image rendering pipeline. Each postcard required compositing a user photo with the campaign template. This was CPU-intensive, slow, and blocking.

We moved rendering to background jobs (queued via a simple MySQL job table — SQS didn't exist yet in the form it does now). Users got immediate acknowledgment; cards were rendered and dispatched asynchronously. Queue depth became our key operational metric during the campaign.

The Numbers

Peak: ~100K transactions/hour

Peak concurrent users: ~8,000

Application servers at peak: 12 EC2 instances (from a normal baseline of 3)

Rendering queue backlog at peak: ~45 minutes (acceptable for physical card delivery)

Uptime during campaign: 99.96%

The Olympic campaign was the most demanding operational period we'd had. It gave us confidence in the auto-scaling architecture — and taught us that the application server tier is rarely the first bottleneck.