Getting Started Guide
This guide will walk you through using Polite Retry to build resilient distributed systems.
Installation
npm install polite-retry
Quick Start
The simplest way to use Polite Retry is with the retry() function:
import { retry } from 'polite-retry';
const data = await retry(
async () => {
const response = await fetch('https://api.example.com/data');
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
},
{
maxRetries: 3,
initialDelayMs: 100,
jitter: 'full',
}
);
Core Concepts
Retry Amplification
When services fail, naive retry policies can make things worse. Consider a 3-tier system:
- Tier 3 starts failing 50% of requests
- Tier 2 retries those failures (load doubles)
- Tier 1 retries Tier 2 failures (load doubles again)
- Result: Tier 3 receives 6.6x normal load
This is called retry amplification, and it's why Polite Retry exists.
Retry Budget
A retry budget limits the total retry overhead. If your budget is 20%, you can only have 20 retries for every 100 original requests, preventing amplification.
Backpressure
Downstream services can signal when they're overloaded. Polite Retry respects these signals and stops retrying when a service is struggling.
Basic Retry
Use retry() for simple scenarios with exponential backoff:
import { retry } from 'polite-retry';
const result = await retry(
async () => fetchData(),
{
maxRetries: 3, // Try up to 4 times total
initialDelayMs: 100, // Start with 100ms delay
maxDelayMs: 10000, // Cap at 10 seconds
backoffMultiplier: 2, // Double delay each time
jitter: 'full', // Add randomization
timeoutMs: 5000, // 5s timeout per attempt
retryIf: (error) => {
// Only retry server errors, not client errors
return !error.message.includes('4');
},
onRetry: (error, attempt, delay) => {
console.log(`Retry ${attempt} in ${delay}ms: ${error.message}`);
}
}
);
Circuit Breaker Pattern
The circuit breaker stops all requests when a service is down, giving it time to recover:
import { retryWithCircuitBreaker, CircuitBreaker } from 'polite-retry';
// Create one circuit breaker per downstream service
const paymentBreaker = new CircuitBreaker({
failureThreshold: 0.5, // Open after 50% failures
windowSize: 10, // Look at last 10 requests
resetTimeoutMs: 30000, // Test again after 30s
onStateChange: (state) => {
console.log(`Circuit is now: ${state}`);
}
});
// Use for all requests to this service
const result = await retryWithCircuitBreaker(
async () => chargePayment(amount),
paymentBreaker,
{ maxRetries: 3 }
);
- Closed: Normal operation, requests pass through
- Open: Service is down, requests fail immediately
- Half-Open: Testing if service recovered
Adaptive Retry Budgeting
This is the recommended approach for production. ARB dynamically limits retries based on observed failure rates:
import { retryWithBudget, AdaptiveRetryBudget } from 'polite-retry';
// Create one budget per downstream service
const apiBudget = new AdaptiveRetryBudget({
initialBudget: 0.2, // Allow 20% retry overhead
highFailureThreshold: 0.3, // Reduce budget when >30% failing
lowFailureThreshold: 0.05, // Restore budget when <5% failing
adjustmentIntervalMs: 1000,
onBudgetChange: (budget, failureRate) => {
metrics.gauge('retry_budget', budget);
metrics.gauge('failure_rate', failureRate);
}
});
// Use for all requests to this service
const data = await retryWithBudget(
async () => fetchFromAPI(),
apiBudget,
{ maxRetries: 3, jitter: 'full' }
);
// Clean up when shutting down
process.on('SIGTERM', () => {
apiBudget.dispose();
});
How ARB Works
| Situation | Budget Action |
|---|---|
| Failure rate < 5% | Increase budget (up to initial) |
| Failure rate 5-30% | Keep budget stable |
| Failure rate > 30% | Decrease budget by 50% |
| Backpressure signal | Stop retries immediately |
Combined Protection
For critical systems, combine both circuit breaker and adaptive budget:
import {
retryWithProtection,
CircuitBreaker,
AdaptiveRetryBudget
} from 'polite-retry';
const breaker = new CircuitBreaker({ failureThreshold: 0.5 });
const budget = new AdaptiveRetryBudget({ initialBudget: 0.2 });
const result = await retryWithProtection(
async () => criticalOperation(),
{ circuitBreaker: breaker, budget },
{ maxRetries: 3, jitter: 'full' }
);
Backoff Strategies
Backoff determines how long to wait between retries:
// Delays: 100ms, 200ms, 400ms, 800ms...
{
initialDelayMs: 100,
backoffMultiplier: 2,
maxDelayMs: 30000 // Cap at 30 seconds
}
Jitter Explained
Jitter adds randomness to prevent synchronized retries:
| Strategy | Formula | Best For |
|---|---|---|
'none' |
delay (no change) | Testing only |
'full' |
random(0, delay) | General use (recommended) |
'equal' |
delay/2 + random(0, delay/2) | When minimum delay matters |
'decorrelated' |
random(base, prev * 3) | Correlated sequences |
jitter: 'none' in production. It causes synchronized retry storms.
Backpressure Signaling
Allow downstream services to signal when they're overloaded:
Server Side (Express)
import { RequestCounter, createBackpressureMiddleware } from 'polite-retry';
const counter = new RequestCounter();
// Track active requests
app.use(counter.middleware());
// Add backpressure headers to responses
app.use(createBackpressureMiddleware({
getLoadLevel: () => counter.getCount() / 100,
overloadThreshold: 0.8,
}));
Client Side
import { BackpressureManager, AdaptiveRetryBudget } from 'polite-retry';
const backpressure = new BackpressureManager();
const budget = new AdaptiveRetryBudget({
checkBackpressure: () => backpressure.isOverloaded('api-service'),
});
// Record backpressure from responses
const response = await fetch('/api/data');
backpressure.recordFromHeaders('api-service', response.headers);
Metrics and Monitoring
const budget = new AdaptiveRetryBudget({
onBudgetChange: (budget, failureRate) => {
// Send to your metrics system
statsd.gauge('retry.budget', budget);
statsd.gauge('retry.failure_rate', failureRate);
}
});
// Periodically export metrics
setInterval(() => {
const m = budget.getMetrics();
statsd.gauge('retry.amplification_factor', m.retryAmplificationFactor);
statsd.counter('retry.total_requests', m.totalRequests);
statsd.counter('retry.total_retries', m.totalRetries);
}, 10000);
Dos and Don'ts
✅ Do
- Use jitter: Always use
jitter: 'full' - Share budgets: One budget per downstream service
- Limit retries: 3 retries is usually enough
- Set timeouts: Don't wait forever for a response
- Be selective: Only retry transient errors
- Monitor: Track retry rates and amplification
❌ Don't
- Immediate retries: Always use backoff
- Retry everything: Don't retry 4xx errors
- Infinite retries: Cap at 3-5 attempts
- Ignore backpressure: Respect overload signals
- Create budget per request: Share across requests
Production Tips
- Start conservative: Begin with
initialBudget: 0.1(10%) - Monitor amplification: Alert if RAF > 1.5
- Implement backpressure: Add headers to your services
- Clean up: Call
budget.dispose()on shutdown - Test failure scenarios: Use chaos engineering