The Samsung Campaign
In summer 2012, Samsung partnered with TouchNote to run a campaign tied to the London Olympics. Users could send branded Samsung/Olympics postcards to friends and family. The campaign was heavily promoted on Samsung devices during the Games — and when the marketing kicked in, so did the traffic.
At peak, we were processing approximately 100,000 transactions per hour. For context, our normal throughput was a fraction of that. We had a few weeks of warning. Here's how we prepared — and what we learned.
The Baseline Architecture
TouchNote ran on AWS in 2012, which already put us ahead of many peers. Our baseline stack:
- ·EC2 instances: PHP application servers behind an Elastic Load Balancer
- ·MySQL: Master-replica setup for read scaling
- ·S3: Image storage (user photos, rendered cards)
- ·SES: Email delivery
- ·Redis: Session storage and caching
The ELB handled horizontal scaling of application servers reasonably well. The database was the concern.
Database Scaling: Master-Master MySQL Replication
For the campaign, we moved to a Master-Master MySQL replication setup. Unlike standard master-replica (where replicas are read-only), Master-Master allows writes to either node — providing both redundancy and a failover path.
The configuration required careful attention to:
Auto-increment offsets: With two masters, you need them to generate non-conflicting primary keys. We set auto_increment_offset=1 on master 1 and auto_increment_offset=2 on master 2, both with auto_increment_increment=2. Master 1 generates odd IDs, master 2 generates even IDs.
Write routing: Despite both nodes being writable, we routed all writes to master 1 normally — master 2 was for failover only. Concurrent writes to both masters create conflict resolution headaches. Keep it simple.
Replication lag monitoring: We added alerting on replication lag. High lag means your "replica" is serving stale data. During the campaign, we had alerts fire twice — both times from I/O spikes that resolved within minutes.
Auto-Scaling EC2 Instances
We configured EC2 Auto Scaling with CloudWatch metrics triggers. When average CPU across the ASG exceeded 70% for 5 minutes, we'd add instances. Below 30% for 10 minutes, we'd remove them.
The gotcha: EC2 instance boot time. A new instance takes 3-4 minutes to boot, install configuration, and pass health checks before the ELB sends traffic. That's 3-4 minutes of degraded performance if traffic spikes suddenly. We pre-warmed the ASG before expected traffic peaks.
What Actually Stressed the System
The bottleneck during the campaign peak wasn't the application servers — it was the image rendering pipeline. Each postcard required compositing a user photo with the campaign template. This was CPU-intensive, slow, and blocking.
We moved rendering to background jobs (queued via a simple MySQL job table — SQS didn't exist yet in the form it does now). Users got immediate acknowledgment; cards were rendered and dispatched asynchronously. Queue depth became our key operational metric during the campaign.
The Numbers
Peak: ~100K transactions/hour
Peak concurrent users: ~8,000
Application servers at peak: 12 EC2 instances (from a normal baseline of 3)
Rendering queue backlog at peak: ~45 minutes (acceptable for physical card delivery)
Uptime during campaign: 99.96%
The Olympic campaign was the most demanding operational period we'd had. It gave us confidence in the auto-scaling architecture — and taught us that the application server tier is rarely the first bottleneck.