The Synchronous API Trap
A common antipattern in web APIs is treating every operation as synchronous: the HTTP request comes in, everything executes, the HTTP response goes out. This works fine when "everything" is fast. In a subscription platform, "everything" is not fast.
A typical subscription checkout might involve:
- ·Creating the subscription record
- ·Capturing the payment (Stripe API call — 200-500ms)
- ·Sending a welcome email (Brevo API call — 100-300ms)
- ·Syncing the customer to the support CRM (100-200ms)
- ·Triggering the first order creation workflow
Done synchronously: potentially 1-2 seconds of third-party API calls blocking the response — and any one failing could fail the entire checkout.
Done asynchronously: the checkout response returns in ~150ms. Everything else happens in the background.
The Queue Architecture
We use dedicated SQS queues for domain-specific async processing. Each queue has a corresponding Lambda worker.
| Queue | Domain | Purpose |
| --- | --- | --- |
| AccountQueue | User management | Account creation, updates, deletion |
| EmailsQueue | Communications | Transactional emails via Brevo |
| OrdersQueue | Order processing | Order creation and status transitions |
| ChargesQueue | Payments | Payment capture — decoupled from order creation |
| SmsQueue | Communications | SMS notifications |
| BrevoQueue | Marketing | Webhook processing and contact sync |
| DixaQueue | Customer support | Customer record sync to CRM |
The critical one is ChargesQueue: payment capture is decoupled from order creation. The order is created synchronously (the customer needs confirmation), but the actual charge is enqueued and processed asynchronously. This means transient Stripe errors don't fail subscription creation — they result in a retry 60 seconds later, invisible to the user.
The Reliability Benefits
Message persistence: A message in SQS persists until successfully processed. If a downstream service is temporarily unavailable, the message stays in the queue and processes when the service recovers. With synchronous calls, a temporary downstream outage causes user-visible failures.
Dead Letter Queues (DLQ): Failed messages move to a DLQ rather than disappearing silently. Failed customer syncs or email sends are visible for investigation.
Exponential backoff: SQS retry policies provide automatic exponential backoff for transient failures. Your welcome email will retry at 1 minute, 5 minutes, 25 minutes — no custom retry logic required.
Independent scaling: Each queue worker can be configured with its own Lambda concurrency settings. During peak periods, the payment worker can scale without affecting the main API.
Why SQS Over RabbitMQ, EventBridge, or Kafka
| Alternative | Why Not Chosen |
| --- | --- |
| RabbitMQ | Requires operating a message broker. SQS is fully managed. |
| AWS EventBridge | Designed for event routing across multiple consumers. Our use case is simpler point-to-point queuing. |
| Apache Kafka | Designed for millions of messages per day. ~10,000 messages/day doesn't justify Kafka's operational overhead. |
The Pattern
typescriptexport const handler = async (event: SQSEvent): Promise<void> => {
for (const record of event.Records) {
const message = JSON.parse(record.body);
try {
await processMessage(message);
} catch (error) {
// Throwing causes SQS to retry (up to maxReceiveCount)
// After maxReceiveCount, message moves to DLQ
logger.error('Message processing failed', { error, message });
throw error;
}
}
};Important: design for idempotency from the start. When processing a batch, partial failures cause retries — some messages may be processed twice. Your handlers must handle this gracefully.
The Surprising Finding
Moving payment capture to an async queue improved our payment success rates. When capture was synchronous, transient Stripe errors during high-traffic periods caused user-visible checkout failures. Async capture means transient errors result in a retry — invisible to the user, and ultimately successful.
The general principle: if an operation involves a third-party API call and the result is not needed for the user's immediate next action, it should be async. This is the right default for subscription platforms.