Microservices - Small Services, Large Coordination Problems // Megha Bose

Helpful context:

Message Queues - Decoupling Services by Delaying Their Conversations

On August 11, 2008, Netflix went down for eleven hours. The cause was a database corruption in the company’s single Oracle database - the one that held everything. User accounts, rental queues, billing, content metadata. All of it lived in that one instance, and when it failed, all of it stopped working simultaneously. Netflix had several million subscribers at the time, and every single one of them got the same error page.

That outage was the beginning of a seven-year migration to microservices. By 2015, Netflix was running over 500 independent services across AWS. The question worth asking isn’t whether they should have done it - they clearly had to. The question is whether you should.

The Monolith Isn’t Wrong

A monolith deploys as a single unit. All components share a process, a codebase, and often a database. This is simple, and simple is a feature.

A developer can run the entire system on a laptop. Transactions cross domain boundaries trivially - updating a user’s balance and posting a notification happens in a single database transaction, with real ACID guarantees. Refactoring is an IDE operation: rename a function and the compiler finds every call site. Debugging follows a single call stack.

The monolith’s problems are not imaginary, but they are problems of scale. Deploying any component requires deploying all components. A memory leak in the analytics module affects payment processing. Scaling the recommendation engine means scaling the user authentication service too, even if auth is sitting idle. A large enough codebase starts to exceed what any team can reason about.

None of these problems appear in a team of five engineers. They start appearing around fifty.

What Microservices Actually Buy You

Microservices enforce boundaries at the process level. Each service has its own repository, its own deployment pipeline, its own runtime. Teams deploy independently. Services fail independently. A crash in the notification service does not take down checkout.

The real benefit is organizational, not technical. Amazon famously uses the “two-pizza team” rule: if a team can’t be fed by two pizzas, it’s too large. Teams that own a service - from writing it to running it in production at 3 AM - build better software and have clearer accountability. Martin Fowler and James Lewis gave this pattern a name and a formal treatment in 2014, but the idea was already operational at Amazon and eBay before they wrote about it.

The technical benefits follow from the organizational ones: independent deployment, technology choice per service (use Go for the high-throughput data pipeline, Python for the ML service), and the ability to scale individual components based on their actual load rather than the load of the entire application.

The cost: every service boundary is now a network call. Network calls have latency, fail in ways function calls cannot, require serialization, need timeouts and retries, and are dramatically harder to test than in-process calls. What was a one-line function call is now an HTTP request with circuit breakers, distributed tracing, and operational overhead.

The Distributed Monolith Antipattern

Here is the failure mode that doesn’t get talked about enough: the distributed monolith.

A distributed monolith has all the operational complexity of microservices with none of the independence benefits. Services are deployed separately, but they’re so tightly coupled at the data level that deployments still have to be coordinated. Service A reads directly from service B’s database. Service C’s schema change breaks service D at runtime, discovered in production. Services call each other in synchronous chains so deep that a slowdown in the user-profile service cascades into payment timeouts.

The tell: if your services always deploy together, you have a distributed monolith. If two services share a database table, you have a distributed monolith. If fixing a bug in one service requires changes in three others before it can go out, you have a distributed monolith.

Domain-driven design (DDD) gives vocabulary for avoiding this. A bounded context is a part of the domain where a particular model applies consistently. The Order context has a concept of “order” that the Inventory context doesn’t share, even if both refer to the same physical goods. Services map to bounded contexts - one service owns one bounded context, and no service reaches into another’s data store directly.

Service Communication: Sync vs Async

Synchronous communication (HTTP REST, gRPC) is the obvious starting point. Service A calls service B, waits for a response. Use it when A genuinely needs B’s result to proceed. gRPC is the right choice for internal service-to-service calls: binary protocol, strongly typed contracts via Protocol Buffers, bidirectional streaming, and generated client libraries that make contract changes visible at compile time.

The coupling is temporal: if B is slow, A is slow. If B is down, A must decide what to do. Circuit breakers (popularized by Netflix’s Hystrix library) prevent A from hammering a failing B - after a threshold of failures, the circuit “opens” and A fails fast without waiting for B to time out.

Asynchronous communication via events decouples services in time. A publishes an event (“order placed”) and continues; B consumes it when ready. Multiple services can react to the same event independently. This is how Kafka earns its place in microservices architectures - the order service publishes to an orders topic, and the inventory service, the notification service, and the analytics pipeline all consume from it independently.

The price of async is eventual consistency. The inventory service will update, but not necessarily before the user’s confirmation page loads.

The rule of thumb: prefer async for workflows that cross multiple services and don’t need an immediate response. Prefer sync for queries that need fresh data right now.

Service Discovery: How Services Find Each Other

In a containerized environment, service instances come and go. Pods restart, scale out, move between nodes. You cannot hardcode IP addresses.

Consul is a service registry: services register themselves on startup with an address and a health check endpoint. Clients query Consul for the current list of healthy instances. Consul also provides distributed configuration storage and integrates with DNS.

Kubernetes DNS handles this inside a cluster more simply: every Kubernetes Service gets a stable DNS name (payment-service.default.svc.cluster.local). Kubernetes' control plane handles routing to healthy pods. No separate registry required.

The API Gateway as the Edge

External clients should not know which microservice handles which request. The API gateway is the single entry point - it handles routing, authentication (validating JWTs before requests ever reach internal services), rate limiting, and request transformation.

AWS API Gateway, Kong, and nginx-based setups are common choices. The gateway becomes critical infrastructure: it needs redundancy, careful capacity planning, and monitoring. A gateway that becomes a bottleneck has created a new single point of failure - not eliminated the old one.

Where the Cloud Lives: AWS ECS vs EKS

AWS gives you two main paths for running microservices. ECS (Elastic Container Service) is the simpler managed option - you define tasks and services in JSON, AWS handles scheduling containers across EC2 instances (or Fargate for serverless compute). ECS is a good choice when you want managed infrastructure without learning Kubernetes.

EKS (Elastic Kubernetes Service) gives you Kubernetes - the de facto standard for container orchestration at scale. Kubernetes handles deployment rollouts, auto-scaling, self-healing (restarting failed pods), and service discovery. The tradeoff is operational complexity: Kubernetes has a steep learning curve, significant surface area to configure, and its own failure modes.

Stripe, Uber, and Airbnb all run their service meshes on top of Kubernetes. A service mesh (Istio, Linkerd) adds a sidecar proxy to each pod - all network traffic goes through the proxy, which enforces mTLS between services, handles retries and circuit breaking, and emits telemetry. It moves cross-cutting concerns out of application code. It also adds latency and significant operational complexity - the “service mesh tax.”

Observability: You Cannot Manage What You Cannot See

In a monolith, a stack trace is diagnostic. In microservices, a request might touch eight services before returning. Without tooling, debugging is close to impossible.

Distributed tracing assigns a trace ID at the entry point. Every service propagates it in outbound calls (via HTTP headers, following the W3C Trace Context standard). Each service emits spans - records of work done - associated with the trace ID. Jaeger and Zipkin collect spans and reconstruct the full call graph. A flame graph shows which service and which operation consumed the most latency.

Centralized logging with structured logs (JSON) shipped to a log aggregator (Datadog, AWS CloudWatch Logs Insights, Elasticsearch) makes it possible to query across services. Correlation: the trace ID in the log line connects a log entry to its trace.

Metrics per service - request rate, error rate, latency percentiles (p50, p99) - feed dashboards and alert on anomalies. The “four golden signals” (latency, traffic, errors, saturation) apply to each service independently.

When Not to Use Microservices

Most startups should not start with microservices. The “microservices-first” antipattern is common and expensive. You don’t know your domain boundaries until you’ve built the product and discovered which parts change together and which are independent. Premature decomposition locks in wrong boundaries - boundaries that are expensive to change because they’re now encoded in separate services with separate teams.

A well-structured monolith with clean module boundaries - what some call the “Majestic Monolith” - gives you most of the organizational benefits without the operational overhead. When the team grows large enough, when specific components have genuinely different scaling needs, when deployment independence becomes a real bottleneck - that is when you extract services. This is the strangler fig pattern: route one domain at a time from the monolith to a new service, keeping the monolith functional while new services grow around it.

The real prerequisite for microservices is operational maturity: each service needs its own CI/CD pipeline, deployment manifests, monitoring dashboards, and on-call rotation. A team of ten engineers cannot own fifteen services responsibly.

Failure Modes Worth Knowing

Cascading failures: Service A calls B calls C calls D. D slows down, C starts queuing requests, B’s connection pool exhausts, A times out for all users. Without bulkheads (limiting how many resources one dependency can consume) and circuit breakers, a single downstream slowdown can collapse the entire request path.

Data consistency across services: Operations that span services cannot use database transactions. A “saga” pattern manages distributed transactions as a sequence of local transactions with compensating rollbacks if any step fails - but it’s complex to implement correctly and debug when something goes wrong.

The Monday morning problem: Every deployment is now an independent deployment. Coordinating a backwards-incompatible schema change across three services that call each other requires careful versioning, a deployment order, and a rollout period where both old and new behavior coexist. This is not impossible, but it is genuinely harder than running a migration in a monolith.

Multi-Region Considerations

At global scale, microservices architectures must also answer where each service runs. AWS’s multi-region deployment model has services deployed per-region, with regional routing via Route 53 latency-based or geolocation policies. Data sovereignty requirements (GDPR in Europe, data localization in India) mean some services must be region-specific and some data must never leave a geographic boundary.

Service meshes in a multi-region configuration need to handle cross-region failover. If the EU region for the payment service goes down, does traffic fail over to US-EAST-1? The answer affects your data processing agreements and latency SLAs simultaneously.

Future Outlook

Serverless functions (AWS Lambda, Google Cloud Functions) push the microservices model to its logical extreme - individual functions as the unit of deployment, with no container to manage. The operational overhead nearly disappears; the vendor lock-in risk increases. WebAssembly as a deployment target (WASI) promises function-level isolation lighter than containers, potentially reshaping what a “service” means.

The pendulum is also swinging back in some engineering organizations. Modular monoliths with well-defined internal module APIs are seeing renewed interest - the recognition that organizational and technical complexity can sometimes be managed within a single deployable unit, and that the distributed systems problems introduced by microservices are genuinely hard.

Summary

Dimension	Monolith	Microservices
Deployment	Single unit, all or nothing	Per-service, independent
Failure isolation	Poor - one crash affects all	Good - services fail independently
Operational overhead	Low	High
Debugging	Stack trace	Distributed tracing required
Team scalability	Bottlenecks at scale	Enables team autonomy
Data consistency	ACID transactions available	Eventual consistency, sagas
Good fit	Startups, small teams, unknown domains	Large orgs, mature domains, scale

Read Next: