Polite Retry - Adaptive retry orchestration for distributed systems

Why Polite Retry?

🛡️

Retry Storm Prevention

Adaptive Retry Budgeting limits retry volume when downstream systems are already struggling.

⚡

Circuit Breaker

Built-in circuit breaker pattern stops requests to failing services, allowing them time to recover.

🎲

Smart Jitter

Multiple jitter strategies prevent synchronized retries that cause periodic load spikes.

📊

Backpressure Aware

Respects backpressure signals from downstream services to avoid overwhelming them.

📈

RAF Metrics

Track retry rates, success rates, and retry amplification factor for observability.

🔧

AI Infra Ready

Use retry budgets for LLM APIs, embedding services, vector databases, and agent tool calls.

Retries are distributed congestion control

Most retry libraries optimize for one question: how can this request succeed?

Polite Retry optimizes for a different question: how can the overall system remain stable?

When a service experiences partial failure, naive retry policies can make things much worse:

Service starts failing 50% of requests
All clients retry failed requests
Service now receives 2x the load
More requests fail, triggering more retries
Cascade collapse

In a 3-tier system with 50% failure rate and 3 retries per tier, request volume can amplify by 6.6x.

Normal:     100 req → Service → 100 responses ✓

With naive retries during 50% failure:
            100 req ──┐
            100 retry ┼──► Service ──► Overload! 💥
            100 retry ┘

The Solution: Adaptive Retry Budgeting

Polite Retry implements Adaptive Retry Budgeting (ARB), based on research into retry amplification, cascading failures, and system-aware retry control.

Successful requests establish retry capacity. Failures consume it. When failure rates rise or downstream systems signal overload, retry capacity shrinks automatically.

import { retryWithBudget, AdaptiveRetryBudget } from 'polite-retry';

// Create a shared budget (one per downstream service)
const budget = new AdaptiveRetryBudget({
  initialBudget: 0.2,        // Allow 20% retry overhead
  highFailureThreshold: 0.3, // Reduce budget when >30% failing
});

// All requests share this budget
const data = await retryWithBudget(
  async () => {
    const res = await fetch('https://api.example.com/data');
    if (!res.ok) throw new Error(`HTTP ${res.status}`);
    return res.json();
  },
  budget,
  { maxRetries: 3, jitter: 'full' }
);

// Check metrics
console.log(budget.getMetrics());
// { retryAmplificationFactor: 1.15, failureRate: 0.08, ... }

Built from Research

Finding	What it means
Naive retries can reduce success rates	Retrying more is not always more reliable.
Only 4.9% of detected retry configurations used jitter	Many systems remain exposed to synchronized retry waves.
Multi-tier retries amplify request volume	Local retry choices become global load problems.
Adaptive Retry Budgeting limits retry storms	Retries become a bounded resource instead of an unlimited reaction.

Paper: Retry Amplification in Distributed Systems

Choose Your Strategy

Strategy	Use Case	Amplification Risk	Complexity
`retry()`	Simple retries with backoff	Medium	Low
`retryWithCircuitBreaker()`	Stop when service is down	Low	Medium
`retryWithBudget()`	Production microservices	Very Low	Medium
`retryWithProtection()`	Critical systems	Very Low	Higher

Quick Start

import { retry } from 'polite-retry';

// Basic retry with exponential backoff and jitter
const data = await retry(
  async () => {
    const response = await fetch('https://api.example.com/data');
    if (!response.ok) throw new Error(`HTTP ${response.status}`);
    return response.json();
  },
  {
    maxRetries: 3,
    initialDelayMs: 100,
    jitter: 'full', // Prevents synchronized retry storms
    onRetry: (error, attempt) => {
      console.log(`Attempt ${attempt} failed: ${error.message}`);
    }
  }
);

Read the Full Guide Open the Simulator