← Build logs
BackendMay 10, 2026

CPU at 70% with Low Traffic? My Story of Catching a Duplicate Scheduler in a 4-Worker Environment

📅 Written on 2026-05-10 — A real trap encountered while operating Riel(aicoreutility.com)

The Symptom

I noticed a strange pattern while monitoring CPU usage on the admin page's operation monitoring tab. Even during the early morning hours when there were almost no users, the CPU was spiking up to 70%+.

I checked the logs.

00:01:23 [profile_analyzer] running for user_id=42
00:01:23 [profile_analyzer] running for user_id=42
00:01:23 [profile_analyzer] running for user_id=42
00:01:23 [profile_analyzer] running for user_id=42

The same task was logged exactly 4 times. Each of the 4 gunicorn workers was running APScheduler.

Why Did This Happen?

The code that starts the scheduler in the FastAPI lifespan looks like this.

@asynccontextmanager
async def lifespan(app: FastAPI):
    scheduler.add_job(profile_analysis_job, "cron", hour=15)
    scheduler.start()
    yield

When gunicorn starts 4 workers, the lifespan also runs 4 times. This results in 4 schedulers being created. The same job runs 4 times every day at midnight KST.

Cost calculation: One profile_analysis takes about ₩120. If it runs 4 times daily, that's ₩480. A monthly leak of ₩14,400.

Solution Candidates

  1. Reduce the number of workers to 1 — Sacrifices throughput. Rejected.
  2. Separate into a dedicated worker process — Requires adding a systemd unit. Increases operational complexity.
  3. Redis lock — Adds Redis dependency. Increases infrastructure burden.
  4. PostgreSQL advisory lock — Already using PG, so 0 new dependencies. Chosen.

PostgreSQL Advisory Lock

PG's pg_try_advisory_lock(key) is an advisory (agreement-based) lock. It allows only one session in the entire cluster to hold the lock for a given integer key, without affecting the data. The lock is automatically released when the session ends.

SCHEDULER_LOCK_KEY = 0x52494F4C  # ASCII "RIOL"

@asynccontextmanager
async def lifespan(app: FastAPI):
    pool = await Database.get_pool()

    # Permanently acquire one connection from the pool (releasing it also releases the lock)
    lock_conn = await pool.acquire()
    got = await lock_conn.fetchval(
        "SELECT pg_try_advisory_lock($1)", SCHEDULER_LOCK_KEY
    )

    if got:
        scheduler.add_job(profile_analysis_job, "cron", hour=15)
        scheduler.start()
        logger.info(f"[Scheduler] this worker (pid={os.getpid()}) holds lock")
    else:
        await pool.release(lock_conn)
        logger.info(f"[Scheduler] worker (pid={os.getpid()}) skipped — another holds lock")

    yield

Key Takeaways

  • You must use the function with `try_`. The regular pg_advisory_lock will wait until it acquires the lock, causing 4 workers to queue up.
  • Do not return the connection holding the lock to the pool. If it's reused for other queries and implicitly committed, the lock might be released.
  • The lock key can be a 32-bit signed int or a (int, int) pair. Using a readable ASCII value makes debugging easier.

Verification

After deployment, I checked directly in PG.

SELECT locktype, classid, objid, pid, mode, granted
FROM pg_locks
WHERE locktype = 'advisory';
 locktype | classid |  objid   |  pid  |     mode      | granted
----------+---------+----------+-------+---------------+---------
 advisory |       0 | 1380733260 | 12847 | ExclusiveLock | t
(1 row)

Only one worker held the lock. The other 3 workers were solely handling API traffic.

Results

MetricBeforeAfter
profile_analysis executions/day4 times1 time
Daily LLM Cost₩480₩120
Early morning CPU spikes70%+Below 20%

From ₩14,400/month to ₩3,600/month. A 75% saving.

Learnings

  • Even with gunicorn's --preload enabled, lifespan runs for each worker. You must assume lifespan code will be multiplied by the number of workers.
  • If you have code in lifespan that "must run only once," you need separate singleton guarantees.
  • PG advisory lock is a zero-cost singleton tool. If you're already using PG, there's no reason not to use it.

📌 A Comment from 2026

This pattern can be applied to scenarios beyond schedulers, such as "single worker cache warming" or "one worker sending Slack notifications." I've developed a habit of suspecting any side effects within the lifespan.