The Infrastructure Decision No One Talks About Honestly
Most infrastructure discussions assume a level of scale and team capability that doesn't match the reality of most engineering teams. The Kubernetes discourse in particular is dominated by the experience of organisations with dedicated SRE/DevOps functions, where the operational overhead is distributed across specialists.
For a lean team building a subscription platform — where engineering capacity is the most constrained resource — the infrastructure decision has to account for operational overhead as seriously as it accounts for performance or cost.
The Options
| Option | Headline Claim | Reality |
| --- | --- | --- |
| EC2 | Full control, familiar | Pay for idle capacity. Manage patching, security, monitoring. Someone is on-call for the host layer. |
| ECS / Fargate | Containers without K8s complexity | More operational overhead than Lambda. Cost advantage only at sustained high throughput. |
| EKS / Kubernetes | Scalable, orchestrated containers | Designed for orgs with dozens of microservices and dedicated DevOps. Overkill at 4 services. |
| AWS Lambda | Scale to zero, pay per invocation | Cold starts. 15-min limit. Stateless. But: no servers to manage. |
The Workload Profile That Matters
A subscription e-commerce platform has a distinctive traffic pattern: spiky, predictable peaks, with significant quiet periods. Order processing and subscription renewals cluster around specific windows — delivery cutoffs, end of month. Between those windows, the system is relatively quiet.
Lambda's pay-per-invocation model is economically well-suited to this profile. You don't pay for idle compute. The spiky periods auto-scale without pre-provisioning.
Lambda Trade-offs and Mitigations
| Trade-off | Mitigation |
| --- | --- |
| Cold starts (~200ms) | Acceptable — not a real-time trading system. Provisioned concurrency available if SLAs demand sub-50ms. |
| 15-minute max execution limit | Long-running operations are handled via SQS queues with dedicated Lambda workers. The API Lambda never has long-running tasks. |
| Statelessness | All session state in Redis (ElastiCache). No in-process state assumed anywhere. |
| ARM64 vs x86 | Not a trade-off — ARM64 (Graviton2) provides ~20% better price/performance. Use it by default. |
Why Not Kubernetes
I've run Kubernetes in production before on self-managed EC2 (see my earlier posts). The decision framework is simple: does the complexity of Kubernetes serve this team and this workload?
For a platform with 4 services, a team without a dedicated DevOps function, and a workload Lambda handles well — the answer is no. The engineering time spent managing cluster upgrades, pod disruption budgets, etcd backups, and CNI issues is time not spent building features.
Kubernetes earns its complexity at 20+ microservices with sustained high throughput and the team to support it.
The Managed Services Principle
The infrastructure decision connects to a principle I apply consistently: a small team's most constrained resource is engineering time, not infrastructure cost.
Paying AWS for managed RDS, managed Redis, and managed SQS is a sensible trade-off. The alternative — self-managing these on EC2 — requires engineers monitoring and patching database hosts, managing Redis replication and failover, and being on-call for infrastructure incidents that AWS now handles.
The cost differential between managed and self-managed is rarely as large as teams assume. The hidden cost of self-management in engineer time and on-call burden is consistently underestimated.
The Result
Lambda on ARM64, managed RDS for MySQL, ElastiCache for Redis, SQS for async processing. GitHub Actions CI/CD with Docker image tagging (commit SHA traceability). Serverless Framework for multi-stage Lambda deployments.
Eighteen months in: the infrastructure decision has not been a constraint. No engineer has been paged for a host-level infrastructure incident. That was the goal.