Skip to main content

Message Queues: The Problem They Solve

It's 2 AM. Your phone buzzes. A user's profile photo upload is timing out — and it's been happening for the last 40 minutes. You pull up the logs. The image resizer is taking 8 seconds per image under load. The API is waiting for it synchronously. Users who upload during peak hours get a 500 error. Their photo is gone.

You didn't have a bug in your code. You had a structural problem. The API and the image resizer were too tightly coupled — one couldn't breathe without the other.

Message queues exist to solve exactly this. Not as a fancy optimization, but as a fundamental architectural shift in how services communicate. By the end of this article, you'll understand why that shift matters, what a queue actually guarantees, and when you need a real broker instead of something you build yourself.


Quick Reference

When to use a message queue: When two services communicate and the sender doesn't need an immediate response — and reliability, independent scaling, or fault isolation matter.

Basic pattern:

// Producer: hand off work and move on
await queue.publish("image.resize", {
userId: "u_123",
imageKey: "uploads/photo.jpg",
});

// Consumer: pick up work independently, at its own pace
queue.consume("image.resize", async (job) => {
await resizeImage(job.imageKey);
await notifyUser(job.userId);
});

Common patterns:

  • Task queue — one producer, one or more competing consumers (image resize, email send)
  • Fan-out — one event delivered to multiple independent consumers (user signed up → email service + analytics + CRM)
  • Delay queue — jobs processed after a specified delay (retry in 60 seconds, send reminder tomorrow)
  • Dead letter queue — failed jobs collected for inspection and replay

Gotchas:

  • ⚠️ In-memory queues lose all jobs on process restart — never use them for durable work
  • ⚠️ At-least-once delivery means your consumer may receive the same message twice — design for idempotency
  • ⚠️ A queue is not a database — don't use it as persistent storage for application state

See also:

Read more about queues in this module as other articles become available.


Version Information

This article covers concepts, not specific library versions. Code examples in the "build your own" section use plain TypeScript with no dependencies.

Tested with:

  • Node.js: v20.x (LTS)
  • TypeScript: 5.x

What You Need to Know First

This is the first article in the Message Queues module. It's conceptual — you don't need prior knowledge of any specific queue technology.

You should be comfortable with:

  • TypeScript basics — functions, async/await, interfaces
  • HTTP APIs — what a request/response cycle looks like
  • Basic Node.js — running a script, npm install

If you're coming from the Inngest module, you already know what async job processing feels like from the consumer side. This article explains the underlying infrastructure that makes it possible.

What We'll Cover in This Article

By the end of this guide, you'll understand:

  • Why synchronous service coupling causes reliability problems at scale
  • What a message queue actually guarantees (and what it doesn't)
  • Why building your own queue breaks in production — and exactly where it breaks
  • When a third-party broker is worth the operational cost
  • What the major queue options are and how they differ at a glance

What We'll Explain Along the Way

Don't worry if you're unfamiliar with these — we'll explain them as we go:

  • Tight vs loose coupling (with diagrams)
  • Delivery guarantees: at-most-once, at-least-once, exactly-once
  • Idempotency — what it means and why it matters for queues
  • The broker landscape — RabbitMQ, Kafka, BullMQ, Inngest

Part 1: The Coupling Problem

Let's go back to the image upload scenario and look at it structurally.

When a user uploads a profile photo, your API does something like this:

// The naive implementation — everything in one request handler
app.post("/upload", async (req, res) => {
const imageKey = await uploadToStorage(req.file); // ~200ms
const resized = await resizeImage(imageKey); // ~2000ms under load 😬
await updateUserProfile(req.userId, resized); // ~50ms
res.json({ success: true });
});

This works fine with 10 users. Notice what happens with 500 concurrent uploads:

Diagram: Under load, the image resizer becomes a bottleneck. Because the API waits synchronously, a slow resizer means a failed upload for the user.

Three things went wrong here, and they're all caused by the same root issue: tight coupling.

The resizer's problem became the API's problem. If the resizer is slow, the API is slow. If the resizer crashes, the API returns errors. Two services that should be independent are now one failure domain.

The API is doing work the user doesn't need right now. Does the user need their photo resized before the upload confirmation? No. They need to know the photo was received safely. The resize can happen afterward.

You can't scale them independently. If resizing is the bottleneck, you want more resizer capacity. But the resizer is called directly from the API — you can't just add more resizer instances without load-balancing between them.

What loose coupling looks like

Here's the same flow, restructured:

Diagram: With a queue between the API and resizer, the API returns immediately. The resizer works at its own pace. A slow or crashed resizer no longer affects the upload experience.

The API now does one thing: accept the file, save it, and hand a job to the queue. The resizer subscribes to that queue and processes jobs independently. They share no direct connection.

This is what loose coupling feels like. Now let's understand what the queue has to guarantee to make this safe.


Part 2: What a Real Queue Guarantees

"Just use an array" is the natural first instinct. Let's follow that instinct all the way to production — and watch it fall apart.

The naive custom implementation

Here's a queue you might write on a Friday afternoon:

// queue.ts — the naive approach
interface Job {
id: string;
type: string;
payload: unknown;
createdAt: Date;
}

// An in-memory array acting as a queue
const jobQueue: Job[] = [];

// Producer: add jobs to the queue
export function enqueue(type: string, payload: unknown): void {
jobQueue.push({
id: crypto.randomUUID(),
type,
payload,
createdAt: new Date(),
});
console.log(`[Queue] Job enqueued: ${type}. Queue depth: ${jobQueue.length}`);
}

// Consumer: poll the queue every 100ms
setInterval(async () => {
const job = jobQueue.shift(); // Remove and return the first item
if (!job) return;

console.log(`[Worker] Processing job: ${job.type}`);
await processJob(job);
}, 100);

And you'd use it like this:

// In your upload handler
app.post("/upload", async (req, res) => {
const imageKey = await uploadToStorage(req.file);
enqueue("image.resize", { imageKey, userId: req.userId }); // Fast — just push to array
res.json({ success: true });
});

This works. You'll be proud of it. Then you'll deploy it, and the first production incident will begin.

Failure 1: Process restart wipes the queue

Your deployment pipeline restarts the Node.js process every time you push code. The in-memory array is gone. Every job that was waiting — gone.

[Deploy] Restarting server...
[Queue] Lost 47 pending image resize jobs.
[Users] 47 users will never see their photos updated.

What you need: Persistence. Jobs must survive process restarts. This means writing to disk, a database, or a durable broker — not an array in memory.

Failure 2: Multiple workers pull the same job

You add a second worker process to handle the load. Now two processes are running the same setInterval loop against the same... wait. They're not sharing the same array. Each process has its own memory.

So you move the queue to a shared database. Now two workers query the database and pop jobs. But shift() is not atomic across a database query:

// Worker 1: SELECT * FROM jobs WHERE status = 'pending' LIMIT 1  → gets job #42
// Worker 2: SELECT * FROM jobs WHERE status = 'pending' LIMIT 1 → also gets job #42
// Both workers process the same image resize job.
// The user's photo is updated twice, the second write overwriting the first.

What you need: Atomic delivery — a guarantee that only one consumer receives each message. Real brokers implement this with locks, leases, or queue-level exclusivity.

Failure 3: A crashed worker loses the job

Your worker receives a job and starts processing. Halfway through, the process crashes — maybe an uncaught exception, maybe an OOM kill.

setInterval(async () => {
const job = jobQueue.shift(); // Job removed from queue here
// ⬆️ The job is gone from the queue at this point

await resizeImage(job.payload.imageKey); // 💥 Process crashes here

// The job was never completed.
// The queue has no record of it.
// The user's photo will never be resized.
}, 100);

You called shift() before processing completed. The job is gone from the queue, but the work was never done.

What you need: Acknowledgement semantics. A real queue only removes a message after the consumer explicitly confirms it was processed successfully. Until then, the message stays "in flight" and will be redelivered if the consumer disappears.

Failure 4: Producers outrun consumers

Your app goes viral. Uploads spike to 10,000/minute. Your resizer can handle 100/minute. Your in-memory array starts growing:

[Queue] Depth: 100
[Queue] Depth: 1,000
[Queue] Depth: 10,000
[Queue] Depth: 50,000
[FATAL] JavaScript heap out of memory

What you need: Back-pressure — a mechanism for producers to slow down when consumers can't keep up. Real brokers expose queue depth metrics and can block or slow producers when thresholds are exceeded.

Failure 5: You can't see what's happening

You get a support ticket: "My photo from yesterday still hasn't updated." You have no idea how many jobs are in the queue. You don't know if jobs are failing or just slow. You can't inspect the queue or replay a specific job.

What you need: Observability — queue depth, consumer throughput, failure rate, and the ability to inspect and replay individual messages.

The gap between naive and real

Here's what a real message queue gives you that an array never will:

GuaranteeArrayReal broker
Survives process restart
Atomic delivery (one consumer per message)
Acknowledgement (re-deliver on crash)
Back-pressure
Dead letter queue (inspect failures)
Queue depth and throughput metrics
Replay failed messages

Every one of these is a production incident waiting to happen. Not "might be nice to have." A certainty, given enough traffic and time.


Part 3: Delivery Guarantees — What Queues Promise

Before you choose a queue, you need to understand what it's promising you. There are three delivery guarantee levels, and they mean very different things.

At-most-once delivery

The broker delivers the message once and forgets it. If the consumer crashes before processing, the message is gone.

Producer → Broker → Consumer (message deleted immediately on delivery)
↓ crash
Message is lost. Never redelivered.

When it's acceptable: Non-critical, high-volume data where occasional loss is tolerable — real-time metrics, analytics events, live dashboard updates. You'd rather drop a data point than slow the system.

When it's dangerous: Anything the user cares about. Payment processing, file uploads, email sending.

At-least-once delivery

The broker keeps the message until the consumer explicitly acknowledges it. If the consumer crashes before acknowledging, the message is redelivered to another consumer.

Producer → Broker → Consumer (message kept in broker)
↓ processes successfully
Consumer sends ACK

Broker deletes message ✅

OR:

Producer → Broker → Consumer (message kept in broker)
↓ crashes
Broker timeout — message redelivered to next consumer ✅
(but the original consumer may have partially processed it)

The catch: The consumer may receive the same message more than once. If the consumer processes a message, then crashes before sending the ACK, the broker will redeliver it.

This is the default guarantee in RabbitMQ, and it's the right default for most work queues. But it requires your consumer to be idempotent.

Idempotency means processing the same message twice produces the same result as processing it once. Resizing an image to 800px and uploading the result is idempotent — doing it twice just overwrites the first result with an identical one. Charging a credit card is not idempotent — doing it twice charges the customer twice.

// ❌ Not idempotent — running this twice charges the user twice
async function processPayment(job: PaymentJob): Promise<void> {
await stripe.charges.create({ amount: job.amount, customer: job.customerId });
}

// ✅ Idempotent — running this twice produces the same result
async function processPayment(job: PaymentJob): Promise<void> {
await stripe.charges.create({
amount: job.amount,
customer: job.customerId,
idempotency_key: job.id, // Stripe deduplicates by this key
});
}

Exactly-once delivery

Each message is delivered to a consumer exactly once — no duplicates, no losses.

This sounds like the obvious choice. It is also the hardest guarantee to implement, and most brokers don't truly provide it at the broker level. Achieving it requires coordination between the broker and the consumer that is expensive and slow.

In practice, exactly-once is achieved at the application level through idempotency keys and deduplication logic, layered on top of at-least-once delivery. This is what Kafka's transactions, RabbitMQ with publisher confirms, and Inngest's step deduplication are doing under the hood.

The practical rule: Design your consumers to be idempotent. Use at-least-once delivery. Sleep well.


Part 4: When to Use a Third-Party Queue vs. Building Your Own

You've seen why the naive array fails. But "third-party queue" spans everything from a Redis library to a distributed broker cluster. Here's how to think about the decision.

Signals that a custom or lightweight solution is fine

You don't always need a full broker. Here's when simpler is genuinely better:

  • Volume is low — under a few thousand jobs per day, a simple database queue with SELECT ... FOR UPDATE SKIP LOCKED is operationally simple and surprisingly robust. This pattern trades off consistency for simplicity and works well when jobs are loosely-coupled.
  • Loss is acceptable — if dropping jobs occasionally is fine (metrics, analytics pings), an in-memory queue or fire-and-forget approach works.
  • Single process, single machine — if you're never running multiple worker instances, you avoid the atomic delivery problem entirely.
  • You already have Redis — BullMQ adds job persistence, retries, and a monitoring UI on top of Redis with minimal operational overhead. If Redis is already in your stack, BullMQ is often the right first broker.

Signals that you need a real broker

Reach for RabbitMQ, Kafka, or a managed alternative when:

  • Multiple services consume the same events — you need routing logic (send this event to email service AND analytics service AND CRM, but not to the report generator)
  • Consumers are in different languages — AMQP (RabbitMQ's protocol) has clients in every major language; an in-house Redis-based solution ties you to Node.js
  • Message loss is a bug, not a feature — durable queues + publisher confirms + ACK semantics
  • Queue backup must not crash your service — broker-level back-pressure and memory management
  • You need to replay past events — Kafka's log-based model makes this possible; queue-based models don't
  • Volume exceeds what a single Redis instance can handle reliably — horizontal scaling becomes important

The decision table

SignalIn-memoryDB pollingBullMQ (Redis)RabbitMQKafka
Survives restart
Multiple workers
Complex routing
Message replay✅ manual
Polyglot consumersNode.jsAnyAny
Ops burdenNoneLowLow-mediumMediumHigh
Good forThrowaway scriptsLow-volume durable jobsNode.js job queuesRouted multi-service messagingEvent streaming at scale

Part 5: The Broker Landscape at a Glance

This module focuses on RabbitMQ in depth. But you'll make better architectural decisions if you know where it sits relative to the alternatives. Here's a one-paragraph introduction to each.

BullMQ

BullMQ is a Redis-backed job queue for Node.js. You get persistence, retries with backoff, delayed jobs, priorities, rate limiting, and a UI (Bull Board) — all without running a separate broker process. If your team already has Redis and your workers are Node.js, BullMQ is frequently the right call. Its limitation is that Redis is a single point of failure unless clustered, and it doesn't support polyglot consumers or complex routing topologies.

RabbitMQ

RabbitMQ is a full AMQP message broker. It runs as a separate process (or cluster), supports clients in any language, and provides a sophisticated routing model through exchanges and bindings. It's the right choice when you need reliable message delivery across multiple services, complex routing logic, or AMQP protocol compliance. The tradeoff is operational complexity — you run it, monitor it, and manage its memory and disk pressure.

Kafka

Kafka is not a queue — it's a distributed event log. Rather than messages being consumed and deleted, they're appended to a log that consumers read at their own pace, from wherever they left off. This makes Kafka uniquely suited for event streaming, event sourcing, and scenarios where multiple independent systems need to consume the same historical event stream. The cost is significant: Kafka has a steeper learning curve, requires a cluster to operate reliably, and imposes a different mental model than a traditional queue. Article 10 introduces Kafka from scratch and helps you decide when it's the right call.

Inngest

Inngest is a managed workflow platform with a queue built in. Rather than publishing raw messages, you trigger functions over HTTP with durable retry semantics. It adds step functions — multi-step workflows where each step is independently retried, its state persisted, and its output passed to the next step. Inngest eliminates infrastructure to operate, but introduces a different cost model (per-step pricing) and requires your functions to be reachable via HTTP. If you've used the Inngest module, you know this model well.

What this module covers: Articles 2 through 8 go deep on RabbitMQ — its internals, its routing model, and its production patterns. Article 9 compares the lightweight options (BullMQ, DB polling, custom). Article 10 puts all four in a decision framework with Kafka introduced from scratch.


Common Misconceptions

❌ Misconception: "A queue is just an async function call"

Reality: An async function runs in the same process, shares the same memory, and disappears when the process restarts. A message queue is a separate system with its own storage, its own process lifecycle, and its own guarantees about delivery. The key difference: if your API process crashes, async function calls die with it. Jobs in a durable queue survive.

Example:

// ❌ This is not a queue — it's a fire-and-forget async call
app.post("/upload", async (req, res) => {
const imageKey = await uploadToStorage(req.file);
resizeImage(imageKey).catch(console.error); // If process restarts → lost
res.json({ success: true });
});

// ✅ This is a queue — the job survives independently of this process
app.post("/upload", async (req, res) => {
const imageKey = await uploadToStorage(req.file);
await queue.publish("image.resize", { imageKey }); // Stored durably in broker
res.json({ success: true });
});

❌ Misconception: "At-exactly-once delivery is the safe default"

Reality: Exactly-once is the hardest guarantee to provide and requires coordination overhead that increases latency. Most production systems use at-least-once delivery and achieve safety through idempotent consumers. Trying to enforce exactly-once at the infrastructure level often produces a system that is simultaneously slower and less reliable.

❌ Misconception: "Adding a queue always makes things faster"

Reality: A queue adds latency to processing — the job has to be serialized, stored in the broker, and delivered to a consumer. The benefit is not speed; it's resilience and decoupling. Your API responds faster (it no longer waits for processing), but the total time from upload to completed resize may be longer. That's a good tradeoff for reliability, not a free speed boost.

❌ Misconception: "I'll add durability later when I need it"

Reality: "Later" rarely comes. By the time you notice you need durable job processing, you've built application logic around the in-memory queue's behavior, your consumers are not idempotent, and you have no way to replay lost jobs. Starting with a real broker — even BullMQ over Redis — is almost always cheaper than migrating to one after the fact.


Troubleshooting Common Issues

Problem: Jobs disappear when the server restarts

Symptoms: Pending work is silently lost after a deployment or crash.

Common causes:

  1. Using an in-memory queue (array, Map, or similar) — 100% of cases where restarts cause loss
  2. Using a durable broker but publishing non-persistent messages — messages bypass disk storage
  3. Consumer ACKs before processing completes — broker deletes message before work is done

Diagnostic steps:

// Step 1: Check if your queue is in-memory
// If you see this pattern anywhere, jobs will not survive a restart:
const queue: Job[] = []; // ❌ Array in process memory

// Step 2: Check message persistence in RabbitMQ
// Messages must be marked persistent AND the queue must be durable
await channel.assertQueue("image-resize", { durable: true }); // Queue survives restart ✅
await channel.sendToQueue(
"image-resize",
Buffer.from(JSON.stringify(job)),
{ persistent: true }, // Message survives restart ✅
);

// Step 3: Check where you're acknowledging
channel.consume("image-resize", async (msg) => {
if (!msg) return;

// ❌ ACK before processing — job lost if process crashes during processJob()
channel.ack(msg);
await processJob(msg);

// ✅ ACK after processing — job redelivered if process crashes
await processJob(msg);
channel.ack(msg);
});

Solution: Use a durable queue with persistent messages and acknowledge only after successful processing.

Prevention: Never use in-memory structures for work that must survive restarts. Choose your queue technology based on your durability requirements before writing your first producer.


Problem: The same job is processed twice

Symptoms: Duplicate emails sent, images resized multiple times, database records created twice.

Common causes:

  1. Consumer processing time exceeds the broker's acknowledgement timeout — broker redelivers to another consumer while the first is still working
  2. Consumer crashes after processing but before sending ACK — broker redelivers
  3. Multiple workers competing on an in-memory queue with no atomic pop

Diagnostic steps:

// Step 1: Log job IDs on receipt to detect duplicates
channel.consume("image-resize", async (msg) => {
const job = JSON.parse(msg.content.toString());
console.log(`[Worker] Received job: ${job.id}`); // Watch for repeated IDs in logs

// Step 2: Check if your consumer is idempotent
// Can this safely run twice with the same input?
await resizeImage(job.imageKey); // ✅ Idempotent — same result either way
});

// Step 3: Check your ACK timeout vs processing time in RabbitMQ
// If processJob() takes 30s but consumer_timeout is 15s, the broker redelivers
// Set in rabbitmq.conf:
// consumer_timeout = 3600000 (1 hour, in milliseconds)

Solution: Make your consumers idempotent. Use an idempotency key (the job ID) to detect and skip duplicate processing.

Prevention: Design consumers to be idempotent from the start. At-least-once delivery is the standard — assume duplicates will happen and handle them gracefully.


Check Your Understanding

Quick Quiz

1. Your team is processing 500 password reset emails per day. The email sender is a Node.js service. You already run Redis. Which queue approach makes the most sense?

Show Answer

BullMQ. The volume is low-to-moderate, you're in Node.js, and Redis is already in your stack. A full broker like RabbitMQ would add operational overhead that isn't justified at this scale. BullMQ gives you persistence, retries, and monitoring out of the box.


2. What's wrong with this pattern?

channel.consume("resize-queue", async (msg) => {
channel.ack(msg); // Line A
await resizeImage(msg); // Line B
});
Show Answer

The ACK is sent before the work is done (Line A before Line B). If the process crashes during resizeImage(), the broker has already been told the job succeeded and will not redeliver it. The job is silently lost.

Correct pattern:

channel.consume("resize-queue", async (msg) => {
await resizeImage(msg); // Do the work first
channel.ack(msg); // Then confirm success
});

3. A user reports their profile photo from 3 days ago still hasn't updated. You're running a custom in-memory queue. What most likely happened?

Show Answer

A server restart (deployment, crash, or scaling event) cleared the in-memory queue. The job was silently lost — no error, no dead letter, no way to replay it. This is the core failure mode of in-memory queues for durable work.


Hands-On Challenge

The scenario: You're building a document export service. Users request PDF exports of their data. Exports take 10-30 seconds. Multiple worker processes handle exports.

Your task: Identify every failure mode in this implementation and describe the fix for each:

// export-queue.ts
const pendingExports: ExportJob[] = [];

export function queueExport(userId: string, documentId: string): void {
pendingExports.push({ userId, documentId, requestedAt: new Date() });
}

// worker.ts
setInterval(async () => {
const job = pendingExports.shift();
if (!job) return;
await generatePDF(job);
await emailToUser(job.userId);
}, 500);
Show Solution

Failure 1: No persistence pendingExports is in-memory. Any restart loses all queued exports. Fix: Use a durable broker (BullMQ/Redis, RabbitMQ) or a database queue.

Failure 2: Not safe for multiple workers If two worker processes run this code, they each have their own pendingExports array. Jobs enqueued by the API process are invisible to worker processes. Even with a shared data store, shift() is not atomic across processes. Fix: Use a broker that provides atomic message delivery.

Failure 3: No acknowledgement shift() removes the job before processing. A crash during generatePDF() loses the job. Fix: Use broker ACK semantics — only confirm the job after emailToUser() completes successfully.

Failure 4: emailToUser() is probably not idempotent If the job is redelivered and processed twice, the user gets two emails. Fix: Add a deduplication check before sending (track sent exports in DB by job ID).

Failure 5: No visibility You cannot inspect what's pending, what's failed, or how deep the queue is. Fix: Use a broker with a management UI (RabbitMQ management plugin, Bull Board for BullMQ).


Summary: Key Takeaways

  • Tight coupling is the root cause — when services call each other synchronously, one service's failure becomes every caller's failure. Queues decouple them.
  • A message queue is not an array — it provides persistence, atomic delivery, acknowledgement, back-pressure, and observability. An array provides none of these.
  • Delivery guarantees matter — at-most-once loses messages, at-least-once may duplicate them, exactly-once is expensive. Design idempotent consumers and use at-least-once.
  • Start with the right tool — BullMQ for Node.js with Redis already in the stack; RabbitMQ for multi-language, complex routing; Kafka for event streaming and replay; Inngest for managed durable workflows.
  • Don't defer durability — migrating from an in-memory queue to a real broker after you've built around it is painful. The overhead of starting with BullMQ is one npm install.

What's Next?

You now understand why message queues exist and what they guarantee. Before you touch RabbitMQ's API, you need to understand how queues actually work at the data structure level — because that's what determines the behavior you'll see in production.

The next step is understanding how queues work internally — linked lists, ring buffers, acknowledgement state machines, FIFO vs priority vs delay queues, and the push vs pull delivery models.


References