Heap Cap Exceeded RAM on e2-medium 4GB — Recalculating PM2 Memory Allocation — Riel build logs

Discovery

Running 3 Next.js apps on a single GCE e2-medium (2 vCPU / 4GB RAM) instance with PM2 cluster. While building operational monitoring, I checked the memory settings:

# ecosystem.config.js (Before changes)
riel_agent:  instances 2, --max-old-space-size=896  → 1792MB
riel_chat:   instances 2, --max-old-space-size=896  → 1792MB
riel_secret: instances 1, --max-old-space-size=512  →  512MB
─────────────────────────────────────────────────
Total heap cap                                       4096MB

The server's RAM is 3924MB. The total heap cap exceeds the RAM by 172MB. On top of that, PostgreSQL (~400MB) + gunicorn 4 workers (~600MB) + occasional Playwright (~300MB) are also running on the same server.

Why It Didn't Crash Until Now

--max-old-space-size is the ceiling for the V8 heap, not an allocation. Actual usage was around ~100MB per process, so it was never an issue in normal operation.

The danger lies in spike moments. If memory leaks or large SSR requests occur simultaneously across multiple processes, each will try to grow up to its ceiling (896MB) → exceeding physical RAM → Linux OOM killer forcefully terminates the process → service downtime. Setting the ceiling higher than RAM itself was a ticking time bomb.

Recalculation — Separating Node / Non-Node Budgets

First, let's subtract the non-Node usage to calculate the available capacity for Node:

Total RAM           3924MB
- PostgreSQL       ~400MB
- gunicorn 4 workers   ~600MB
- OS               ~400MB
- Playwright       ~300MB (intermittent)
─────────────────────────
Available for Node   ~2200MB

Now, redistribute the heap cap total within 2.2GB, reflecting the weight of each app:

# After changes
riel_agent:  2 × 512MB = 1024MB  (Blog SSR / Admin — heaviest)
riel_chat:   2 × 384MB =  768MB  (Chat — relatively lighter)
riel_secret: 1 × 384MB =  384MB
─────────────────────────────
Total                    2176MB  < 2200MB ✓

I also set max_memory_restart (PM2's restart trigger) slightly above the heap cap. Since RSS = heap + non-heap, setting heap to 512MB means restarting at 640MB. PM2 restarts before a true OOM occurs.

Trap — `reload` Doesn't Apply `node_args`

I ran pm2 reload, but the --max-old-space-size didn't change. node_args are arguments at the time of process spawn, so they aren't applied during a graceful reload.

# reload is not enough pm2 reload ecosystem.config.js # node_args not changed Full restart is required

pm2 delete ecosystem.config.js pm2 start ecosystem.config.js pm2 save # Persists after reboot

You can verify the applied settings with pm2 describe under interpreter args:

$ pm2 describe riel_agent | grep 'interpreter args'
│ interpreter args  │ --max-old-space-size=512 │   ✓

Lessons Learned

Total heap cap < physical RAM is an immutable rule. Don't just look at the ceiling for a single process; compare the total (instances × ceiling) against RAM. If a DB/backend shares the server, subtract its usage too.
Heap cap is a ceiling, not an allocation. Even if normal usage is low, if the ceiling exceeds RAM, spikes can cause OOM. "It's not crashing now, so it's fine" is the most dangerous judgment.
Changing node_args requires a restart, not a reload. Graceful reload only regenerates workers; it doesn't pass new arguments to the Node interpreter. Use `delete` + `start`, then `pm2 save`.
Set `max_memory_restart` slightly above the heap cap. This allows PM2 to restart cleanly before the OOM killer steps in. Remember that RSS is larger than the heap.