Improving Backend Error Handling: Building User-Friendly Screens, Auto-Recovery, and Information Collection Systems

Improving Backend Error Handling: Building User-Friendly Screens, Auto-Recovery, and an Information Gathering System

The previous generic 'Application error' message was confusing for users. Additionally, the lack of auto-recovery and information gathering capabilities during errors made operations difficult. In this post, I want to share my experience of solving these problems and improving operational stability.

Attempts and Pitfalls

First, I started by replacing the stiff 'Application error' message with a user-friendly screen. The goal was to clearly inform users about what went wrong and how to proceed.

<!-- Old Error Page (Example) -->
<h1>Application Error</h1>
<p>An unexpected error occurred. Please try again later.</p>

Next, I added functionality to automatically recover the system when an error occurred. This was to minimize service downtime caused by recurring errors. I also built a system to automatically collect relevant information when an error occurred. I believed this would help identify frequent error types and find root causes.

# Auto-recovery logic on error (Conceptual Example)
def handle_error_and_recover(error_details):
    log_error(error_details)
    if is_recoverable(error_details):
        attempt_recovery()
        return "Recovered successfully"
    else:
        trigger_alert_to_ops()
        return "Error logged, manual intervention required"

def is_recoverable(error_details):
    # Determine recoverability based on specific error codes or patterns
    return error_details.get("code") in ["TEMP_UNAVAILABLE", "NETWORK_ISSUE"]

def attempt_recovery():
    # Attempt recovery like restarting the service, clearing cache, etc.
    print("Attempting to restart service...")
    # Implement actual recovery logic
    pass

Initially, I just focused on making the error messages look better. However, simply creating user-friendly screens didn't solve the underlying issues. The system would still crash on errors, and it was hard to pinpoint the cause. Implementing the auto-recovery feature, in particular, led to unexpected exceptions, and I spent hours debugging.

// Log example when collecting error information
{
  "timestamp": "2026-06-11T10:30:00Z",
  "error_code": "DB_CONNECTION_FAILED",
  "message": "Failed to connect to database: timeout expired",
  "service_name": "user-service",
  "request_id": "abc123xyz789",
  "stack_trace": "...",
  "environment": "production"
}

Cause

The old 'Application error' message exposed technical details, causing unnecessary confusion for users. Furthermore, there was no mechanism for the system to self-recover from errors, and systematically collecting information about when errors occurred meant problem resolution took a long time.

Solution

I implemented user-friendly error screens that provided understandable messages instead of technical jargon, along with guidance on the next steps.

<!-- Improved Error Page (Example) -->
<h1>Sorry, a temporary issue has occurred.</h1>
<p>We apologize for the inconvenience. Please try again shortly, and it should work normally.</p>
<p>If the problem persists, please contact customer support.</p>

I added recovery logic, such as automatically restarting the system or adjusting related configurations when an error occurred.

# Improved error handling and recovery logic (Conceptual Example)
def robust_error_handler(exception):
    error_info = collect_error_details(exception)
    log_error_to_central_system(error_info)

    if is_service_degraded(error_info):
        attempt_auto_recovery(error_info)
    else:
        notify_operations_team(error_info)

    display_user_friendly_error_page()

def collect_error_details(exception):
    # Extract necessary info from the exception object (error code, message, stack trace, etc.)
    return {
        "code": getattr(exception, "error_code", "UNKNOWN"),
        "message": str(exception),
        "stack_trace": traceback.format_exc(),
        "service": os.environ.get("SERVICE_NAME", "unknown-service")
    }

def is_service_degraded(error_info):
    # Determine if recovery is needed based on specific error codes or frequency
    return error_info.get("code") in ["TIMEOUT", "RESOURCE_EXHAUSTED"]

def attempt_auto_recovery(error_info):
    print(f"Attempting auto-recovery for error: {error_info.get('code')}")
    # Actual recovery logic: restart service, reload config, etc.
    if error_info.get("code") == "TIMEOUT":
        print("Restarting dependent service...")
        # dependent_service.restart()
    pass

Finally, I built a feature to automatically collect and store information about when errors occurred, their types, and related request details in a central system. This has allowed me to analyze error patterns and proactively address issues.

# Logging error information to a central system (Example)
import requests
import json

def log_error_to_central_system(error_info):
    central_logging_url = "http://your-central-logging-service.internal/log"
    try:
        response = requests.post(central_logging_url, json=error_info)
        response.raise_for_status() # Raise an exception for HTTP errors
        print("Error logged to central system successfully.")
    except requests.exceptions.RequestException as e:
        print(f"Failed to log error to central system: {e}")

Results

User experience has significantly improved, reducing confusion when errors occur.
Service downtime has decreased thanks to the auto-recovery feature.
Problem resolution speed has improved due to systematic error information collection.

Summary — To Avoid the Same Pitfalls

[ ] Make error messages user-friendly, minimizing technical details.
[ ] Define and implement scenarios for automatic error recovery in advance.
[ ] Build a system to record detailed information about error occurrences (time, type, related info) and manage it centrally.
[ ] Thoroughly consider and test potential exceptions when implementing recovery logic.