Skip to main content

Retries, Timeouts, and Error Handling

Every distributed system fails eventually. Networks time out, APIs return 500s, databases lock up under load, third-party services go down for maintenance. The question is never if something will fail — it's how your system responds when it does.

In a synchronous API route, error handling is straightforward: catch the exception, return a 500, log the error. The user knows something went wrong.

In a background workflow, it's more nuanced. The user isn't watching. The failure happens silently, minutes or hours after they interacted with your app. You need a strategy for:

  • Automatically recovering from transient failures
  • Knowing when to stop retrying (because retrying will never help)
  • Handling the case where retrying partially worked (steps that already succeeded)
  • Taking action when everything fails and no recovery is possible

Inngest has built-in answers to all of these. This article covers them exhaustively.


Quick Reference

Default retry count: 4 retries (5 total attempts) per step. Each step retries independently.

Retry strategy: Exponential backoff with jitter — delays grow: ~1s, ~2s, ~4s, ~8s...

Stop retrying immediately: Throw NonRetriableError — Inngest marks the step/function as failed, no more retries.

Control retry timing: Throw RetryAfterError with a duration — Inngest waits exactly that long before the next retry.

When all retries are exhausted: Inngest marks the function as "Failed" and calls your onFailure handler (if configured).

Step-level error catching: Failed steps throw StepError — you can catch it with try/catch to build fallback and rollback logic.


What You Need to Know First

Required reading (in order):

  1. Event-Driven Architecture: Why Your App Needs It
  2. Events, Queues, and Workers: The Building Blocks
  3. Inngest: What It Is and How It Fits In
  4. Your First Inngest Function
  5. Steps: Breaking Work into Durable Units

You should understand:

  • How steps are checkpointed and re-executed on retry
  • The difference between a step-level failure and a function-level failure
  • Why side effects inside steps don't repeat on retry (memoisation)

What We'll Cover in This Article

By the end of this guide, you'll understand:

  • How Inngest's default retry behaviour works, step by step
  • Exponential backoff — what it is and why it exists
  • How to configure retry counts per function
  • The attempt variable and how to use it
  • NonRetriableError — when and how to throw it
  • RetryAfterError — controlling retry timing for rate limits
  • Step-level error catching with try/catch and .catch()
  • Rollback logic when a step permanently fails
  • The onFailure handler — running cleanup code when a function exhausts all retries
  • The global inngest/function.failed event

What We'll Explain Along the Way

  • Exponential backoff with jitter (what that phrase actually means)
  • Transient vs permanent errors (and how to tell the difference in code)
  • The StepError type (what a failed step throws to the outer function)

Part 1: How Retries Work by Default

Let's start with a completely plain Inngest function — no special error handling, no custom retry config:

export const syncUserData = inngest.createFunction(
{ id: "sync-user-data" },
{ event: "user/data.sync.requested" },
async ({ event, step }) => {
const userData = await step.run("fetch-external-data", async () => {
return await externalApi.getUser(event.data.userId);
});

await step.run("save-to-database", async () => {
await db.users.upsert(event.data.userId, userData);
});
},
);

If externalApi.getUser() throws an error, here is the exact sequence of events:

Attempt 1:  "fetch-external-data" throws → fails
Inngest waits ~1 second

Attempt 2: "fetch-external-data" runs again → fails
Inngest waits ~2 seconds

Attempt 3: "fetch-external-data" runs again → fails
Inngest waits ~4 seconds

Attempt 4: "fetch-external-data" runs again → fails
Inngest waits ~8 seconds

Attempt 5: "fetch-external-data" runs again → fails
No more retries — step is marked "Failed"
Error bubbles up to the function
Function is marked "Failed"

Five attempts total: one initial attempt plus four retries. That's the default.

Notice: the "save-to-database" step never ran at all. The function never got past the first step. This is correct — there's nothing to save if we couldn't fetch the data.

Now say the external API was briefly down but recovered. Attempt 3 succeeds:

Attempt 1:  "fetch-external-data" throws → fails
Attempt 2: "fetch-external-data" throws → fails
Attempt 3: "fetch-external-data" succeeds → result saved ✅

Now running "save-to-database"...

Attempt 1: "save-to-database" succeeds ✅
Function complete.

The fetch step's retry counter is separate from the save step's retry counter. Each step gets its own independent set of attempts. A function configured with retries: 4 means each step gets up to 4 retries — not 4 retries shared across all steps.

Configuring retry count

Change the default with the retries option in your function configuration:

inngest.createFunction(
{
id: "send-critical-invoice",
retries: 10, // Each step gets 10 retries (11 total attempts)
},
{ event: "invoice/send.requested" },
async ({ event, step }) => {
/* ... */
},
);

inngest.createFunction(
{
id: "update-analytics-snapshot",
retries: 0, // No retries — fail immediately on first error
},
{ event: "analytics/snapshot.requested" },
async ({ event, step }) => {
/* ... */
},
);

There's no universal right answer for retry count. Consider:

  • Mission-critical operations (payments, invoices, contract signing): high retry counts, because the consequence of failure is significant
  • Best-effort operations (analytics, cache warming, non-essential notifications): lower retry counts or even zero, because retrying repeatedly isn't worth the resource cost
  • Idempotent operations: more retries are safe, since running twice has the same result as running once
  • Non-idempotent operations: fewer retries, and careful thought about whether retrying is even appropriate

Part 2: Exponential Backoff — Why Delays Grow

The retry delays aren't random. They grow exponentially, roughly doubling with each attempt. Let's understand why this design exists — because it's actually quite clever.

The thundering herd problem

Imagine 10,000 users all trigger a function at the same moment. The external API they all call goes down at the same time. All 10,000 functions fail at once.

If retries were immediate or fixed-interval, all 10,000 functions would retry at the same time. The external API, already struggling, gets hit with another 10,000 requests simultaneously. It fails again. All 10,000 retry again. It fails again. This cascade — every failure causing another simultaneous burst of requests — is called the thundering herd problem.

Exponential backoff solves it by spreading retries out over time. After the first failure, some functions wait 1 second, others 1.1 seconds, others 0.9 seconds (this small randomness is the "jitter"). After the second failure, the delays are around 2 seconds. After the third, around 4 seconds. By the fourth retry, the original 10,000 simultaneous failures have spread across a ~10-second window, which the recovering API can handle.

The backoff schedule

Inngest's exact backoff schedule is open source. The approximate delays are:

AttemptApproximate wait before retry
After attempt 1~1 second
After attempt 2~2 seconds
After attempt 3~4 seconds
After attempt 4~8 seconds
After attempt 5~16 seconds
After attempt 6~30 seconds
After attempt 7~1 minute
After attempt 8~2 minutes
After attempt 9~4 minutes
After attempt 10~8 minutes

Each delay is approximate because Inngest adds random jitter to prevent the thundering herd problem described above.


Part 3: The attempt Variable

Every function handler receives an attempt variable alongside event and step. It tells you which attempt the current execution is.

inngest.createFunction(
{ id: "send-notification" },
{ event: "notification/requested" },
async ({ event, step, attempt }) => {
// attempt is zero-indexed:
// 0 = first attempt (not a retry)
// 1 = first retry
// 2 = second retry
// ...and so on

console.log(`Running attempt ${attempt + 1}`); // "Running attempt 1", "Running attempt 2", etc.

await step.run("send-notification", async () => {
// You can use `attempt` to adjust behaviour on retries
// For example: use a different API endpoint if the primary has failed twice
const endpoint =
attempt >= 2
? notificationService.fallbackEndpoint
: notificationService.primaryEndpoint;

await endpoint.send({
userId: event.data.userId,
message: event.data.message,
});
});
},
);

attempt is reset between steps. When a new step starts, attempt goes back to 0. It only counts retries for the current step's execution.

This is useful for:

  • Logging: Add the attempt number to your log messages for better debugging
  • Alerting: Trigger a Slack alert after X failed attempts to get human eyes on a persistent failure
  • Fallback endpoints: Try a different service after several failed attempts on the primary
  • Progressive backoff strategies: Do something different on the last retry vs. early retries
async ({ event, step, attempt }) => {
// Alert the team if we're on the third retry — something is persistently wrong
if (attempt === 3) {
await alertingService.notify({
channel: "#backend-alerts",
message: `⚠️ Sync function has failed 3 times for user ${event.data.userId}`,
});
}

await step.run("sync-data", async () => {
await externalSync.run(event.data.userId);
});
};

Part 4: NonRetriableError — When Retrying Will Never Help

Not every error is worth retrying. Consider these scenarios:

  • A user ID is malformed — retrying won't magically fix the ID
  • A record doesn't exist in the database — retrying won't create it
  • Authentication failed because the API key is invalid — retrying won't change the key
  • A webhook payload violates your schema — retrying the same data won't make it valid

In all these cases, retrying wastes time and resources. Worse, if you have a high retry count, a permanently failing function will keep running for many minutes before finally being marked "Failed" — delaying your alerting and incident response.

NonRetriableError solves this. Throw it and Inngest immediately marks the step (or function) as "Failed" — no further retries.

import { NonRetriableError } from "inngest";

export const reindexDocument = inngest.createFunction(
{ id: "reindex-document" },
{ event: "document/updated" },
async ({ event, step }) => {
const document = await step.run("fetch-document", async () => {
const doc = await db.documents.findById(event.data.documentId);

if (!doc) {
// This document doesn't exist. Retrying won't create it.
// Fail immediately, don't waste retry attempts.
throw new NonRetriableError(
`Document ${event.data.documentId} not found — skipping reindex`,
);
}

if (doc.deleted) {
// Document was deleted after the event was sent.
// No point reindexing a deleted document.
throw new NonRetriableError(
`Document ${event.data.documentId} has been deleted`,
);
}

return doc;
});

// Only reaches here if the document was found and is not deleted
await step.run("reindex-in-search", async () => {
await searchService.index(document);
});
},
);

Preserving the original error with cause

When you throw NonRetriableError, you can pass the original error as cause. This is important for debugging — it means the original stack trace and error message are preserved in Inngest's function logs, not just the NonRetriableError message.

try {
await db.records.update(event.data.recordId, { status: "imported" });
} catch (err) {
if (err.code === "RECORD_NOT_FOUND") {
// Pass the original error as `cause` so it appears in the Dev Server logs
throw new NonRetriableError(
"Record not found — cannot mark as imported",
{ cause: err }, // ← preserves the original error
);
}
throw err; // Re-throw other errors — let them retry normally
}

A decision tree for when to use NonRetriableError

Ask yourself: "If this step ran again in 60 seconds with exactly the same inputs, would it succeed?"

  • Yes → Let Inngest retry normally (transient failure: network blip, temporary service outage)
  • No → Throw NonRetriableError (permanent failure: invalid data, missing record, logic error)
// Transient — retry: the external API might be back up in a moment
throw new Error("External API returned 503 Service Unavailable");

// Permanent — don't retry: the API key won't fix itself
throw new NonRetriableError(
"Authentication failed — API key is invalid or revoked",
);

// Transient — retry: the database might release the lock
throw new Error("Database deadlock detected");

// Permanent — don't retry: the record genuinely doesn't exist
throw new NonRetriableError(`User ${userId} not found in database`);

Part 5: RetryAfterError — Controlling Retry Timing

Sometimes you know exactly when a retry should happen, because the system you're calling tells you.

The most common example: rate limiting. Many APIs return a Retry-After header with responses like 429 Too Many Requests. This header tells you: "I'm rate-limited right now, but try again in 30 seconds." With normal exponential backoff, Inngest might retry too early (wasting an attempt on another 429) or too late (waiting 2 minutes when 30 seconds would have worked).

RetryAfterError lets you tell Inngest exactly when to retry:

import { RetryAfterError } from "inngest";

export const sendSms = inngest.createFunction(
{ id: "send-sms" },
{ event: "sms/send.requested" },
async ({ event, step }) => {
await step.run("send-via-twilio", async () => {
const response = await twilio.messages.create({
to: event.data.phoneNumber,
body: event.data.message,
});

if (response.status === 429) {
const retryAfterSeconds = parseInt(
response.headers["retry-after"] ?? "30",
);

throw new RetryAfterError(
"Twilio rate limit hit",
retryAfterSeconds * 1000, // RetryAfterError accepts milliseconds as a number
);
}

return { messageSid: response.sid };
});
},
);

RetryAfterError accepts the retry delay in three formats:

// As a number — milliseconds to wait
throw new RetryAfterError("Rate limited", 30000); // retry in 30 seconds

// As a duration string — same format as step.sleep()
throw new RetryAfterError("Rate limited", "30m"); // retry in 30 minutes

// As a Date — retry at a specific point in time
const tomorrow = new Date();
tomorrow.setDate(tomorrow.getDate() + 1);
throw new RetryAfterError("Daily quota exhausted", tomorrow); // retry tomorrow

Combining attempt with RetryAfterError

You can combine these to implement increasingly conservative backoff on rate-limited services:

async ({ event, step, attempt }) => {
await step.run("call-rate-limited-api", async () => {
try {
return await api.fetch(event.data.resourceId);
} catch (err) {
if (err.status === 429) {
// Back off more aggressively on repeated rate-limit hits
const waitSeconds = Math.pow(2, attempt) * 15; // 15s, 30s, 60s, 120s...
throw new RetryAfterError(
`Rate limited — waiting ${waitSeconds}s`,
waitSeconds * 1000,
);
}
throw err; // other errors use default backoff
}
});
};

Part 6: Step-Level Error Catching — Fallbacks and Rollbacks

So far we've looked at errors that bubble up and fail the step. But you can also catch step failures and handle them gracefully — trying an alternative, rolling back previous work, or continuing with a degraded result.

How step errors work

When a step exhausts all its retries, it throws a StepError to the outer function handler. This is catchable with standard JavaScript try/catch.

export const generateImage = inngest.createFunction(
{ id: "generate-ai-image" },
{ event: "image/generation.requested" },
async ({ event, step }) => {
let imageUrl: string;
let generatedBy: "dall-e" | "midjourney";

// Try DALL-E first
try {
imageUrl = await step.run("generate-with-dall-e", async () => {
const result = await openai.images.generate({
prompt: event.data.prompt,
size: "1024x1024",
});
return result.data[0].url;
});
generatedBy = "dall-e";
} catch (err) {
// DALL-E failed after all retries — fall back to Midjourney
console.log("DALL-E failed, falling back to Midjourney", err.message);

imageUrl = await step.run("generate-with-midjourney", async () => {
const result = await midjourneyClient.imagine(event.data.prompt);
return result.imageUrl;
});
generatedBy = "midjourney";
}

// Notify the user regardless of which model succeeded
await step.run("notify-user", async () => {
await notifications.send(event.data.userId, {
type: "image-ready",
imageUrl,
generatedBy,
});
});
},
);

This is a genuinely useful pattern. DALL-E and Midjourney both have outages from time to time. Instead of failing the entire workflow, you try one, catch the failure, and seamlessly fall back to the other.

Using .catch() on a step

You can also use the Promise .catch() chain for a more concise fallback pattern:

// Ignore a step's failure entirely — treat it as optional
await step
.run("update-analytics-cache", async () => {
await analyticsService.refreshCache(event.data.userId);
})
.catch(() => {
// Cache update is best-effort. If it fails, continue anyway.
console.log("Analytics cache update failed — continuing");
});

// Following steps run regardless of whether the cache step succeeded
await step.run("send-confirmation", async () => {
await emailService.sendConfirmation(event.data.email);
});

Rollback logic

The most sophisticated use of step-level error catching is rollbacks — undoing the work of previous steps if a later step permanently fails.

export const createSubscription = inngest.createFunction(
{ id: "create-subscription" },
{ event: "subscription/create.requested" },
async ({ event, step }) => {
// Step 1: Create the subscription record in your DB
const subscription = await step.run(
"create-subscription-record",
async () => {
return await db.subscriptions.create({
userId: event.data.userId,
plan: event.data.plan,
status: "pending",
});
},
);

// Step 2: Charge the customer via Stripe
// If this fails permanently, we need to roll back the subscription record
await step
.run("charge-via-stripe", async () => {
const paymentIntent = await stripe.paymentIntents.create({
amount: event.data.amountCents,
currency: "usd",
customer: event.data.stripeCustomerId,
confirm: true,
});

if (paymentIntent.status !== "succeeded") {
throw new NonRetriableError(
`Payment failed with status: ${paymentIntent.status}`,
);
}

return { paymentIntentId: paymentIntent.id };
})
.catch(async (err) => {
// Stripe charge failed after all retries — roll back the subscription
await step.run("rollback-subscription-record", async () => {
await db.subscriptions.update(subscription.id, {
status: "cancelled",
cancelledReason: "payment_failed",
});
});

// Re-throw so the function still fails — we just cleaned up first
throw err;
});

// Step 3: Activate the subscription
await step.run("activate-subscription", async () => {
await db.subscriptions.update(subscription.id, { status: "active" });
await emailService.sendSubscriptionConfirmation(event.data.email);
});
},
);

This is the Inngest equivalent of a database transaction's ROLLBACK. If the charge fails, the subscription record is moved to "cancelled" before the function fails — leaving your data in a consistent state.


Part 7: onFailure — When Everything Fails

NonRetriableError, rollback logic, try/catch — these handle failures during a function run. But what happens when a function exhausts all its retries and is permanently "Failed"?

This is where onFailure comes in. It's a special handler you define alongside your main function. It runs once, automatically, when your function fails for the final time.

export const processPayment = inngest.createFunction(
{
id: "process-payment",
retries: 5,

// This runs when the function fails after exhausting all 5 retries
onFailure: async ({ event, error }) => {
// event is the ORIGINAL event that triggered the failing function
// error is the error that caused the final failure

console.error("Payment processing permanently failed", {
userId: event.data.userId,
orderId: event.data.orderId,
error: error.message,
});

// Send an alert to the engineering team
await alertingService.page({
severity: "high",
title: "Payment processing failure",
details: `Order ${event.data.orderId} for user ${event.data.userId} could not be processed after 5 retries. Error: ${error.message}`,
});

// Notify the customer that something went wrong
await emailService.sendPaymentFailureNotification({
to: event.data.customerEmail,
orderId: event.data.orderId,
});

// Flag the order for manual review
await db.orders.update(event.data.orderId, {
status: "payment_failed_manual_review",
failureReason: error.message,
});
},
},
{ event: "payment/process.requested" },
async ({ event, step }) => {
// ... the main payment processing function
},
);

Key things to understand about onFailure:

It only runs once. Not on every failure — only when the function is permanently done with no more retries.

It receives the original event. So you have access to all the data that triggered the failing function — the user ID, order ID, whatever you put in event.data.

It receives the error. The error parameter is the last error that caused the function to fail.

It does not retry the original function. onFailure is cleanup/alerting logic. It won't re-attempt the original payment. If you need to re-queue the work, you'd do that explicitly inside onFailure.

It is itself retried if it fails. onFailure is a normal Inngest function under the hood — if it throws, it will retry with the default retry count.

The global failure handler

Sometimes you want a single handler for all function failures across your entire app — not one per function. Use the inngest/function.failed system event:

export const handleAnyFailure = inngest.createFunction(
{ id: "handle-any-function-failure" },
{ event: "inngest/function.failed" },
async ({ event }) => {
// event.data contains details about the failed function:
// - event.data.function_id: the ID of the function that failed
// - event.data.run_id: the specific run ID
// - event.data.error: the error details
// - event.data.event: the original event that triggered the function

await monitoringService.recordFailure({
functionId: event.data.function_id,
runId: event.data.run_id,
error: event.data.error.message,
triggeredBy: event.data.event.name,
});
},
);

This is useful for centralised observability — feeding all failures into a monitoring dashboard, error tracking service (like Sentry), or on-call alerting system without duplicating that logic in every function's onFailure handler.


Part 8: Building a Complete Error Strategy

Let's pull everything together with a realistic example — an order processing function that uses every error-handling tool we've covered.

import { NonRetriableError, RetryAfterError } from "inngest";

export const processOrder = inngest.createFunction(
{
id: "process-order",
retries: 5,

onFailure: async ({ event, error }) => {
// Permanent failure after 5 retries — alert + notify customer
await Promise.all([
alerting.page("Order processing failed permanently", {
orderId: event.data.orderId,
error: error.message,
}),
emailService.sendOrderFailureNotification(event.data.customerEmail),
db.orders.update(event.data.orderId, { status: "failed" }),
]);
},
},
{ event: "order/placed" },
async ({ event, step, attempt }) => {
// ── Validate the order ─────────────────────────────────────────────
// Validation errors are permanent — no point retrying
await step.run("validate-order", async () => {
const order = await db.orders.findById(event.data.orderId);

if (!order) {
throw new NonRetriableError(`Order ${event.data.orderId} not found`);
}

if (order.status !== "pending") {
throw new NonRetriableError(
`Order ${event.data.orderId} is not in pending state (status: ${order.status})`,
);
}
});

// ── Reserve inventory ──────────────────────────────────────────────
// Transient failures (DB lock, network blip) should retry.
// If inventory is genuinely unavailable, stop immediately.
const reservation = await step.run("reserve-inventory", async () => {
const result = await inventoryService.reserve(event.data.items);

if (result.status === "out_of_stock") {
throw new NonRetriableError(
`Items out of stock: ${result.unavailableItems.join(", ")}`,
);
}

return result;
});

// ── Charge payment ─────────────────────────────────────────────────
// Rate limit: back off gracefully if the payment processor is overloaded.
// Rollback: if payment fails after all retries, release inventory.
await step
.run("charge-payment", async () => {
const response = await paymentService.charge({
customerId: event.data.customerId,
amountCents: event.data.totalCents,
orderId: event.data.orderId,
});

if (response.status === 429) {
// Payment processor rate-limited us — wait for its signal
throw new RetryAfterError(
"Payment processor rate limit",
response.retryAfterMs,
);
}

if (response.status === "declined") {
// Card declined — not a transient error, don't retry
throw new NonRetriableError(
`Payment declined: ${response.declineCode}`,
);
}

return { chargeId: response.chargeId };
})
.catch(async (err) => {
// Payment failed permanently — release the inventory we reserved
await step.run("release-inventory-reservation", async () => {
await inventoryService.release(reservation.reservationId);
});

throw err; // Re-throw so the function is marked as failed
});

// ── Send confirmation ──────────────────────────────────────────────
// Best-effort — if this fails, don't fail the whole order.
// The order was charged successfully; the email is secondary.
await step
.run("send-confirmation-email", async () => {
await emailService.sendOrderConfirmation({
to: event.data.customerEmail,
orderId: event.data.orderId,
});
})
.catch((err) => {
// Log but don't propagate — the order itself is complete
console.warn(
"Confirmation email failed, but order is complete",
err.message,
);
});

// ── Log attempt count for observability ───────────────────────────
if (attempt > 0) {
console.log(
`Order ${event.data.orderId} succeeded after ${attempt + 1} attempts`,
);
}

return { status: "completed", orderId: event.data.orderId };
},
);

This function demonstrates every error-handling layer:

  • NonRetriableError for validation failures and known permanent errors (out of stock, card declined)
  • RetryAfterError for rate-limited external services
  • .catch() on a critical step to trigger rollback logic (releasing inventory if payment fails)
  • .catch() on a best-effort step to swallow the error and continue
  • onFailure for final cleanup when the entire function gives up
  • attempt for logging retry count in observability output

Common Misconceptions

❌ Misconception: onFailure runs after every failed attempt

Reality: onFailure runs once, after the function has exhausted all its retries. If you have retries: 5, onFailure runs after the sixth attempt fails (the 5 retries plus the initial attempt). It does not run after each individual retry failure.

❌ Misconception: Catching a step error prevents the function from failing

Reality: Catching a step error in try/catch prevents the unhandled error from propagating — but if you re-throw inside catch, the function still fails. Use .catch(() => { /* swallow */ }) if you genuinely want to ignore a step failure and continue. Use try/catch with a re-throw if you want to run cleanup before the function fails.

❌ Misconception: NonRetriableError skips the onFailure handler

Reality: NonRetriableError skips retries, but the function is still marked as "Failed", and onFailure (if configured) will run. The distinction is: retries are skipped, but the final-failure handling still happens.

❌ Misconception: You should use retries: 0 for idempotent functions

Reality: Idempotent functions are safer to retry, not worse. Setting retries: 0 means a single transient network blip permanently fails your function. For idempotent operations, a higher retry count is actually safer — retrying can't cause duplicate side effects, and you recover from transient failures automatically.


Troubleshooting Common Issues

Problem: A function keeps retrying forever and never fails

Symptoms: The function appears "Running" in the Dev Server for a very long time and doesn't complete.

Common cause: Your step is throwing a non-error value (like a string or an object) instead of an actual Error instance. Some retry systems handle this correctly, but it's worth checking.

// ❌ Throwing a string — some versions may not handle this reliably
throw "Something went wrong";

// ✅ Always throw a proper Error (or Inngest error class)
throw new Error("Something went wrong");
throw new NonRetriableError("Permanent failure");

Also check: is your retry count set unreasonably high (e.g. retries: 100)? A function with 100 retries and 10 minute backoffs will be "running" for hours.

Problem: onFailure is not being called

Diagnostic steps:

// 1. Verify the function ID in onFailure matches your function's id
inngest.createFunction(
{
id: "my-function", // ← must be unique and stable
onFailure: async ({ event, error }) => {
console.log("onFailure called for:", event.data); // does this log?
},
},
{ event: "my/event" },
async ({ event }) => {
throw new Error("Intentional failure for testing");
},
);

Common causes: The function is still retrying (hasn't exhausted retries yet). Check the run timeline in the Dev Server — if it shows "Retrying" it hasn't permanently failed yet.

Problem: You want to retry a failed function manually after fixing a bug

Solution: In the Dev Server (and Inngest Cloud), you can replay failed function runs directly from the dashboard. Open the run, click "Replay" — the function re-runs from the beginning with the same original event data, but your fixed code. We'll cover this in detail in Article 13: Observability — Reading Inngest's Run Logs (coming soon).


Check Your Understanding

Quick Quiz

1. A function has retries: 3 and three steps. The second step fails every single time. What is the maximum total number of step executions across the entire function run?

Show Answer

Let's count:

  • Step 1 runs once and succeeds: 1 execution
  • Step 2 fails on all 4 attempts (initial + 3 retries): 4 executions
  • Step 3 never runs because step 2 never succeeded: 0 executions

Total: 5 executions

Step 1 is memoised after its first success — it never re-runs. Step 3 never gets a chance to start. The function is marked "Failed" after step 2 exhausts all attempts.

2. What's the difference between throwing new Error(...) and new NonRetriableError(...)?

Show Answer

A standard Error tells Inngest: "something went wrong, but try again — it might work next time." Inngest schedules a retry with exponential backoff.

NonRetriableError tells Inngest: "this error cannot be resolved by retrying — stop now." Inngest immediately marks the step/function as "Failed" without scheduling any further retries. onFailure still runs if configured.

Use NonRetriableError when you know for certain that retrying the same operation with the same data will never succeed — for example, when a record doesn't exist, when input data is invalid, or when a payment was definitively declined.

3. You have a step that calls a third-party API. The API returns a Retry-After: 120 header (meaning "retry in 120 seconds"). Which error class should you throw, and what should you pass it?

Show Answer

Throw RetryAfterError with the delay in milliseconds (or as a string):

throw new RetryAfterError(
"Third-party API rate limit",
120 * 1000, // 120 seconds in milliseconds
);

// Or equivalently:
throw new RetryAfterError("Third-party API rate limit", "2m");

This tells Inngest to wait exactly 120 seconds before the next retry, rather than using the default exponential backoff (which might retry too early and waste an attempt on another 429).

Hands-On Challenge

Take the handleUserSignup function from Article 4. Add these three error-handling behaviours:

  1. If the user's email domain is on a known spam blocklist (check against a static array for simplicity), throw a NonRetriableError — there's no point retrying
  2. If the email service returns a 429 rate limit response, throw a RetryAfterError with a 60-second delay
  3. If the entire function fails after all retries, use onFailure to log the failure and mark the user account as "setup_failed" in the database
See a Suggested Solution
import { NonRetriableError, RetryAfterError } from "inngest";

const SPAM_DOMAINS = ["mailnull.com", "guerrillamail.com", "throwaway.email"];

export const handleUserSignup = inngest.createFunction(
{
id: "handle-user-signup",
retries: 3,

onFailure: async ({ event, error }) => {
console.error("Signup function permanently failed", {
userId: event.data.userId,
error: error.message,
});

await db.users.update(event.data.userId, {
status: "setup_failed",
setupFailureReason: error.message,
});
},
},
{ event: "user/account.created" },
async ({ event, step }) => {
// Validate email domain
await step.run("validate-email-domain", async () => {
const domain = event.data.email.split("@")[1];
if (SPAM_DOMAINS.includes(domain)) {
throw new NonRetriableError(
`Email domain ${domain} is on the spam blocklist`,
);
}
});

const profile = await step.run("create-user-profile", async () => {
return await db.profiles.create({
userId: event.data.userId,
email: event.data.email,
name: event.data.name,
});
});

const sendEmail = step.run("send-welcome-email", async () => {
const result = await emailService.sendWelcome({
to: event.data.email,
name: event.data.name,
});

// Handle rate limiting from the email provider
if (result.status === 429) {
throw new RetryAfterError("Email service rate limit", "60s");
}

return result;
});

const createTrial = step.run("create-billing-trial", async () => {
return await billingService.createTrial(event.data.userId);
});

await Promise.all([sendEmail, createTrial]);

return { profileId: profile.profileId, status: "setup complete" };
},
);

Summary: Key Takeaways

  • Default retry behaviour: 4 retries (5 total attempts) per step. Each step retries independently.
  • Exponential backoff with jitter prevents thundering herd problems — delays grow roughly as 1s, 2s, 4s, 8s... with randomness added to spread retries across time.
  • attempt is zero-indexed and available in the handler — useful for logging, alerts, and fallback logic. It resets between steps.
  • NonRetriableError stops all retries immediately. Use it when you know retrying the same inputs will never succeed. Pass { cause: originalError } to preserve the original stack trace in logs.
  • RetryAfterError controls exactly when the next retry happens. Essential for respecting Retry-After headers from rate-limited APIs.
  • Step-level try/catch lets you catch a failed step's error and run fallback logic, rollbacks, or alternative paths — without failing the whole function.
  • .catch() on a step lets you swallow failures for best-effort operations or chain rollback steps.
  • onFailure runs once when a function permanently fails after exhausting all retries. Use it for cleanup, alerting, and customer notification.
  • inngest/function.failed is a system event you can listen to globally — useful for centralised failure monitoring across all functions.

What's Next?

You now have a complete picture of how Inngest handles failure at every level: individual step retries, intelligent error classification, rollback logic, and final-failure handling.

In Article 7: Fan-Out — Triggering Multiple Tasks from One Event (coming soon), we move from resilience to architecture. You'll learn how to design systems where one event triggers multiple independent functions in parallel — the fan-out pattern that powers everything from order processing to notification pipelines.


Version Information

Tested with:

  • inngest: ^4.1.x
  • Node.js: v18.x, v20.x, v22.x
  • TypeScript: 5.x

Further reading: