0 Minutes of Downtime! Mastering Next.js Zero-Downtime Deployments with Atomic Swap

0 Minutes of Downtime! Mastering Next.js Zero-Downtime Deployments with Atomic Swaps

My previous Next.js deployment method frequently caused service interruptions of 2-5 minutes. This happened during the sequence of rm -rf .next, next build, and pm2 restart, resulting in 500/502 errors. It was incredibly inconvenient, and I was determined to find a solution.

Attempts and Pitfalls

My initial approach was to modify the build process. Instead of directly touching the .next directory, I tried creating the new build in a separate directory called .next.new. I used the command next build --distDir .next.new.

# Old method (causing issues)
rm -rf .next
next build
pm2 restart app_name

# Attempt 1: Using a new build directory
next build --distDir .next.new

After creating the new build, I was stumped on how to replace the existing .next directory. Simply renaming directories felt risky.

The Root Cause

The core of the problem was the service gap that occurred when the build process and server restart happened concurrently. With the .next directory completely gone while the new build was being placed and then the server starting up, errors were inevitable. PM2's default restart mechanism couldn't bridge this gap.

The Solution

This led me to adopt an Atomic Swap technique. Once the new build is complete, I rename the current .next directory to .next.old and then atomically swap .next.new to become the new .next.

# 1. Create the new build
next build --distDir .next.new

# 2. Backup the current build (rename)
mv .next .next.old

# 3. Replace with the new build (Atomic Swap)
mv .next.new .next

# 4. Sequential restart with PM2 cluster mode
pm2 reload app_name

I also adjusted my PM2 configuration. In ecosystem.config.js, I set up cluster mode with 2 instances and used pm2 reload to restart each instance sequentially.

// ecosystem.config.js
module.exports = {
  apps : [{
    name: "app_name",
    script: "node_modules/next/dist/bin/next",
    instances: 2, // Set to 2 instances
    exec_mode: "cluster", // Use cluster mode
    watch: false,
    env: {
      NODE_ENV: "production",
    },
    kill_timeout: 5000 // 5 second timeout
  }]
};

If the build fails, the .next directory remains unchanged, effectively keeping the previous version active. I also added a script to clean up the .next.old directory after a successful deployment.

The Results

Significantly reduced frequency and duration of 500/502 errors during deployment (nearly zero minutes of downtime).
Ensured stability through gradual deployments using pm2 reload.
Automatic rollback to the previous version without manual intervention in case of build failures.
Full deployment automation with a single command: redeploy.sh.

Key Takeaways — Avoiding the Same Pitfalls

[ ] Separate your build process from your server restart process.
[ ] Consider using the Atomic Swap technique for directory replacements. (The mv command is atomic.)
[ ] Leverage PM2's cluster mode and the pm2 reload command for gradual instance restarts.
[ ] Clearly define your rollback strategy for build failures. (In this case, maintaining the previous version was the solution.)
[ ] Add a script to clean up old build directories after a successful deployment.