Disk Full: How `docker image prune` Saved My Solo VM from a Deploy Freeze
Disk Full: How `docker image prune` Saved My Solo VM from a Deploy Freeze
Running a full AI product, aicoreutility.com, on a single, small VM as a solo developer is a constant exercise in resource management and engineering scar tissue. Most of the time, things hum along, but sometimes, the unglamorous reality of limited resources bites hard. One such incident recently brought my entire deployment pipeline to a grinding halt, and the fix was surprisingly simple, yet critical: docker image prune.
The context was a planned upgrade to my riel_agent service. Previously, deploying new versions involved a manual script, deploy_agent.sh. To improve reliability and reduce manual intervention, I decided to automate the build and deployment process. The goal was to have a git push to a specific branch trigger an automated build in the cloud, followed by a reconciliation process on the VM to pull and deploy the new container image.
This new architecture involved:
- A
git pushto theriel_agent/**path triggering a GitHub Action. - The GitHub Action using Cloud Build (with Kaniko) to build a new container image. This build happened outside my VM, preventing Out-Of-Memory (OOM) errors during the build process itself.
- The newly built image being pushed to Artifact Registry.
- A systemd timer on my VM, running every 90 seconds, checking Artifact Registry for a new image digest.
- If a new digest was found, a script (
tools/container_deploy/reconcile_image.sh) would pull the new image and perform a zero-downtime swap of the running container.
This setup, detailed in commit 8ab1eb1, was designed to be robust. The build was asynchronous and offloaded, and the VM-side reconciliation handled the deployment with rollback capabilities.
The first few automated deployments went off without a hitch. The GitHub trigger fired, Cloud Build completed successfully, Artifact Registry was updated, and the VM's reconcile script kicked in.
Then came the incident. During one of these automated deployments, the process stalled. The reconcile script, which was supposed to pull the new Docker image and swap the running container, just... stopped. It wasn't failing outright; it was stuck in a loop, retrying every 90 seconds. My AI product was effectively frozen in its current state, unable to update.
Panic set in. I checked the logs. The reconcile script was indeed stuck trying to pull the new image. But why? The network was fine, the permissions were correct, and the image existed in Artifact Registry. I SSH'd into the VM, intending to manually pull the image and force the deployment.
That's when I saw it. My VM's disk was at 100% usage. Completely full. The error message from the `docker pull` command, buried in the logs of the failed reconcile attempts, was essentially indicating it couldn't extract the new image layers because there was no space left on the device.
The Root Cause: Docker Image Bloat
Each automated build and deployment was pulling a new version of the Docker image. My VM was running the latest, but the previous versions were still being stored locally. Over time, with multiple deployments, these old, unused Docker images had accumulated, silently consuming all available disk space on my small 30GB VM. The `docker pull` command needed temporary space to extract the new image, and with the disk full, it couldn't proceed. This caused the reconcile script to hang, and the 90-second timer just kept retrying the same failed operation, thrashing the system.
The immediate fix was to manually clear some space. I ran docker image prune -a -f to remove all dangling and unused images. This freed up several gigabytes, allowing the `docker pull` to finally succeed and the deployment to complete. However, this was a manual, reactive fix. The underlying problem – uncontrolled image accumulation – would happen again.
The Real Fix: Automating Disk Cleanup
The lesson was clear: any automated deployment process involving Docker on a resource-constrained VM must include automated disk cleanup. The fix was to integrate Docker's image pruning directly into the deployment script. I modified the reconcile_image.sh script (commit 83d876d) to include:
- Before pulling: Run
docker image prune -fto remove dangling images that might be taking up space unnecessarily. - After a successful deployment: Run
docker image prune -a -f. This command removes all images that are not currently tagged and not used by any running container. This ensures that only the currently running image and potentially one previous version (if kept for rollback) remain.
This addition ensures that the disk space used by Docker images is actively managed. The goal is to keep the disk usage bounded, primarily by the size of a single Docker image, plus a small buffer. On my VM, this effectively capped the Docker image storage to around 81% of the disk space, leaving about 5.4GB free.
This incident was a stark reminder that even with sophisticated CI/CD pipelines and cloud builds, the edge – the single VM running the application – has its own unique set of challenges. Disk space management, often an afterthought, became a critical failure point. By adding automated pruning, I've turned a potential recurring disaster into a managed aspect of the deployment pipeline.
...building aicoreutility.com in the open... aicoreutility.com