systemsMarch 21, 20264 min read

Circuit Breakers: Stopping Cascading Failures Before They Spread

What a circuit breaker does, how it protects your system from a single slow dependency taking everything down, and when it's worth the added complexity.

Circuit Breakers: Stopping Cascading Failures Before They Spread

What Is a Circuit Breaker?

A circuit breaker is a pattern that stops calls to a failing service before they pile up and take your whole system down.

Named after the electrical component. When too much current flows through an electrical circuit, the breaker trips and cuts the connection — protecting everything downstream. Same idea in software: when a dependency is failing, stop calling it. Fail fast instead of waiting.


The Problem It Solves

Imagine Service A calls Service B. Service B starts responding slowly — 30 seconds instead of 200ms. Service A's threads are now stuck waiting. New requests come in, create new threads waiting on B. Thread pool exhausts. Service A stops responding to its callers. They pile up. Now the entire system is down because one dependency got slow.

This is a cascading failure. One slow service takes down everything upstream.

A circuit breaker short-circuits this pattern. Instead of waiting 30 seconds for B to respond, Service A detects the failure pattern and immediately returns an error (or a fallback) — in milliseconds. No waiting. No thread exhaustion. No cascade.


How It Works

The circuit breaker has three states:

Closed (normal operation) — requests pass through to the downstream service. The breaker monitors success/failure rates.

Open (failing) — too many failures occurred. The breaker trips open. All requests immediately return an error without calling the downstream service. No actual calls are made.

Half-open (recovery check) — after a timeout, the breaker allows one test request through. If it succeeds, the breaker closes again. If it fails, it opens again.

plain
1Closed → (failures exceed threshold) → Open
2Open → (timeout expires) → Half-open
3Half-open → (test succeeds) → Closed
4Half-open → (test fails) → Open

Common thresholds: open after 5 consecutive failures or 50% failure rate in a 10-second window. Timeout before half-open: 30–60 seconds.


When to Add It

Add a circuit breaker when:

  • Your service has external dependencies — third-party APIs, other internal services, databases
  • High failure tolerance is required — 99.9%+ uptime matters
  • You're in a microservices architecture — each service call is a potential failure point
  • A slow dependency would otherwise exhaust your thread pool or connection pool

Rule of thumb

Any outbound network call that crosses a service boundary is a candidate for a circuit breaker. If it can fail, it will eventually fail, and you want to control what happens when it does.

When NOT to Add It

  • Simple monolithic applications — fewer service boundaries, less need
  • In-process function calls — circuit breakers are for network calls
  • When you're at early stage — the operational overhead isn't justified yet

Note

A circuit breaker adds complexity. You need to think about what the fallback behavior is when the circuit is open. Return a cached result? Return an empty list? Return an error? The breaker only helps if the fallback is defined.

Fallback Behavior — the Part That Matters

A circuit breaker without a fallback strategy just converts a slow failure into a fast failure. That's better, but incomplete.

Good fallback patterns:

  • Cached response — return the last known good response
  • Degraded response — return an empty result or default value
  • Queueing — buffer the request and retry when the circuit closes
  • User messaging — show the user a "this feature is temporarily unavailable" message

Common mistake

Opening the circuit and returning a generic 500 error. Users see "something went wrong" with no explanation and no retry path. Design the fallback first, then add the circuit breaker.

Real World

Netflix built Hystrix specifically for this problem — they had hundreds of microservices and a single slow dependency was taking down unrelated features. Hystrix let them isolate failures: if the recommendations service was slow, it didn't affect video playback.

Netflix's rule: every outbound call in their stack runs through a circuit breaker with a defined fallback.


Takeaways

  • Circuit breakers detect failing dependencies and short-circuit calls to them — fast failure instead of slow failure
  • Three states: closed (normal) → open (failing) → half-open (recovery check)
  • They prevent cascading failures in distributed systems
  • Requires a defined fallback — what happens when the circuit is open?
  • Essential in microservices; overkill in simple monoliths
  • Netflix's Hystrix, Resilience4j (Java), and Polly (.NET) are common implementations
Share