System Operation and Deployment Documentation Guide for Infra Developers
Ever been stuck trying to figure out deployment system operations and automation because the documentation was a mess? I definitely have. Especially when multiple systems are tangled together, it's incredibly tough to grasp and document each one's unique operating methods and deployment processes. This post is about how I navigated that chaos and documented the various system operations and deployments for Lab S2.
Attempts and Pitfalls
Initially, I started by gathering scattered information for each system into a single file named CLAUDE.md. This covered everything from devto pure logic verification to GitHub trigger auto-deployment, riel_agent push auto-deployment, poller auto-container deployment, Cloud Build containerization prep, GCE operational stabilization, and local Postgres DB backups. It was a lot to cover.
As I added and modified records for each topic, I ran into a few hurdles. For example, the poller auto-container deployment attempt didn't go as planned, and I had to roll it back. Figuring out what exactly needed to be adjusted took some time.
# What I Tried (Simplified)
- devto logic verification and expansion
- GitHub trigger auto-deployment: live/disk prune fix
- riel_agent push auto-deployment infrastructure documentation
- Poller auto-container deployment attempt (failed and rolled back)
- Cloud Build containerization preparation and proof
- GCE operational stabilization history
- Local Postgres DB backup records
While the act of adding and modifying content in CLAUDE.md itself was relatively straightforward, setting a standard for *how much* detail to include for each piece of information was challenging. Too much detail makes it hard to read, and too little can make it impossible to understand later.
The Root Cause
Ultimately, the biggest issues were the lack of clear standards for "how much should be documented" and "how to organize it systematically." The required depth of information varied depending on each system's characteristics, making it difficult to capture everything consistently in a single document. Furthermore, it was hard to track and reflect changes in system operations and deployments in real-time.
The Solution
The solution was to consolidate and update all these processes within the CLAUDE.md file. I documented the operational and deployment records for each system with the necessary level of detail.
# CLAUDE.md Example (Excerpts)
## devto Pure Logic Verification and Expansion
- **Goal**: Verify devto's pure logic and prepare the foundation for expansion if needed.
- **Key Content**:
- [Verification Script]
```python
# Example of actual verification logic (mock)
def validate_devto_logic(input_data):
if not isinstance(input_data, dict):
return False, "Invalid input type"
if "key" not in input_data or not isinstance(input_data["key"], str):
return False, "Missing or invalid 'key'"
# ... additional logic validation ...
return True, "Validation successful"
# Example usage
test_data = {"key": "sample_key", "value": 123}
is_valid, message = validate_devto_logic(test_data)
print(f"Validation result: {is_valid}, Message: {message}")
```
- [Expansion Directions]: Adding new API endpoints, improving data processing methods, etc.
## GitHub Trigger Auto-Deployment: Live and Disk Prune Fix
- **Goal**: Stabilize the auto-deployment pipeline triggered by GitHub Push and clean up unnecessary disk space.
- **Key Content**:
- **Issue**: Deployment failure upon pushing to a specific branch, build agent disk space shortage.
- **Solution**:
- Modified `github_actions_workflow.yml`: Fixed deployment script errors.
```yaml
# github_actions_workflow.yml (partial)
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Deploy Application
run: |
./scripts/deploy.sh production
echo "Deployment successful!"
```
- Automated disk prune script: Included in CI/CD pipeline or added as a `cron` job.
```bash
# prune_disk.sh example
#!/bin/bash
echo "Cleaning up old Docker images..."
docker image prune -af
echo "Cleaning up old build artifacts..."
find /app/builds -type f -mtime +7 -delete
echo "Cleanup complete."
```
- **Result**: Achieved 99% deployment success rate, reduced build agent disk usage by 30%.
## riel_agent Push Auto-Deployment Infrastructure
- **Goal**: Build and operate infrastructure for service push auto-deployment using riel_agent.
- **Key Content**:
- Architecture diagram (text-based or image link)
- Configuration file example (`riel_agent.conf`)
```ini
[agent]
listen_port = 8080
log_level = info
[deployment]
repository = git@github.com:your_org/your_repo.git
branch = main
deploy_script = /opt/riel/deploy.sh
```
- Deployment script (`/opt/riel/deploy.sh`)
## Poller Auto-Container Deployment Attempt (Failed and Rolled Back)
- **Goal**: Attempt auto-deployment of the poller service as a container.
- **Attempted Method**: Periodic deployment attempt using Kubernetes CronJob.
- **Reason for Failure**: Container startup failure due to deployment configuration errors, leading to rollback.
- **Improvement Direction**: Re-evaluate deployment parameters, clarify rollback strategy.
## Cloud Build Containerization Preparation and Proof
- **Goal**: Prepare for building and managing container images using Cloud Build.
- **Key Content**:
- `cloudbuild.yaml` configuration example
```yaml
# cloudbuild.yaml
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA']
images:
- 'gcr.io/$PROJECT_ID/my-app:$COMMIT_SHA'
```
- Results of container image build and push tests.
## GCE Operational Stabilization History
- **Goal**: Ensure operational stability of Google Compute Engine instances.
- **Key Activities**:
- Setting up and analyzing monitoring metrics (CPU, memory, network traffic).
- Recording patch applications and security updates.
- Documenting procedures for handling incidents.
## Local Postgres DB Backup
- **Goal**: Regular backups of the local development environment's Postgres database.
- **Backup Script**:
```bash
#!/bin/bash
DB_USER="your_db_user"
DB_NAME="your_db_name"
BACKUP_DIR="/path/to/backups"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
mkdir -p $BACKUP_DIR
pg_dump -U $DB_USER -d $DB_NAME -F c -b -v -f "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.backup"
echo "Database backup created at $BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.backup"
# Delete old backup files (e.g., older than 7 days)
find $BACKUP_DIR -type f -mtime +7 -delete
echo "Old backups cleaned up."
```
- Backup frequency and retention policy.
Results
- The documentation for Lab S2's deployment systems and DB operations has been consolidated and updated into a single file, CLAUDE.md.
- The operating methods, deployment processes, and configuration information for each system are now systematically recorded, improving readability.
- Clear documentation is now available for new team members or when understanding existing systems is required.
In Summary — To Avoid the Same Pitfalls
- [ ] Define the list of systems to document in advance. Clearly outlining which systems and what information to record is the first step.
- [ ] Determine the level of detail for documentation. It's important to set standards to include only necessary information, avoiding being too shallow or too deep.
- [ ] Maintain a consistent format and structure. Using templates or establishing rules is recommended to avoid confusion, even when multiple people are working together.
- [ ] Plan for regular updates. Develop a habit of updating documentation immediately when system changes occur.
- [ ] Utilize version control. Tracking document changes using tools like Git is highly recommended.