Improving Backend Error Handling: Building User-Friendly Screens, Auto-Recovery, and Information Collection Systems
Improving Backend Error Handling: Building User-Friendly Screens, Auto-Recovery, and an Information Gathering System
The previous generic 'Application error' message was confusing for users. Additionally, the lack of auto-recovery and information gathering capabilities during errors made operations difficult. In this post, I want to share my experience of solving these problems and improving operational stability.
Attempts and Pitfalls
First, I started by replacing the stiff 'Application error' message with a user-friendly screen. The goal was to clearly inform users about what went wrong and how to proceed.
<!-- Old Error Page (Example) -->
<h1>Application Error</h1>
<p>An unexpected error occurred. Please try again later.</p>
Next, I added functionality to automatically recover the system when an error occurred. This was to minimize service downtime caused by recurring errors. I also built a system to automatically collect relevant information when an error occurred. I believed this would help identify frequent error types and find root causes.
# Auto-recovery logic on error (Conceptual Example)
def handle_error_and_recover(error_details):
log_error(error_details)
if is_recoverable(error_details):
attempt_recovery()
return "Recovered successfully"
else:
trigger_alert_to_ops()
return "Error logged, manual intervention required"
def is_recoverable(error_details):
# Determine recoverability based on specific error codes or patterns
return error_details.get("code") in ["TEMP_UNAVAILABLE", "NETWORK_ISSUE"]
def attempt_recovery():
# Attempt recovery like restarting the service, clearing cache, etc.
print("Attempting to restart service...")
# Implement actual recovery logic
pass
Initially, I just focused on making the error messages look better. However, simply creating user-friendly screens didn't solve the underlying issues. The system would still crash on errors, and it was hard to pinpoint the cause. Implementing the auto-recovery feature, in particular, led to unexpected exceptions, and I spent hours debugging.
// Log example when collecting error information
{
"timestamp": "2026-06-11T10:30:00Z",
"error_code": "DB_CONNECTION_FAILED",
"message": "Failed to connect to database: timeout expired",
"service_name": "user-service",
"request_id": "abc123xyz789",
"stack_trace": "...",
"environment": "production"
}
Cause
The old 'Application error' message exposed technical details, causing unnecessary confusion for users. Furthermore, there was no mechanism for the system to self-recover from errors, and systematically collecting information about when errors occurred meant problem resolution took a long time.
Solution
I implemented user-friendly error screens that provided understandable messages instead of technical jargon, along with guidance on the next steps.
<!-- Improved Error Page (Example) -->
<h1>Sorry, a temporary issue has occurred.</h1>
<p>We apologize for the inconvenience. Please try again shortly, and it should work normally.</p>
<p>If the problem persists, please contact customer support.</p>
I added recovery logic, such as automatically restarting the system or adjusting related configurations when an error occurred.
# Improved error handling and recovery logic (Conceptual Example)
def robust_error_handler(exception):
error_info = collect_error_details(exception)
log_error_to_central_system(error_info)
if is_service_degraded(error_info):
attempt_auto_recovery(error_info)
else:
notify_operations_team(error_info)
display_user_friendly_error_page()
def collect_error_details(exception):
# Extract necessary info from the exception object (error code, message, stack trace, etc.)
return {
"code": getattr(exception, "error_code", "UNKNOWN"),
"message": str(exception),
"stack_trace": traceback.format_exc(),
"service": os.environ.get("SERVICE_NAME", "unknown-service")
}
def is_service_degraded(error_info):
# Determine if recovery is needed based on specific error codes or frequency
return error_info.get("code") in ["TIMEOUT", "RESOURCE_EXHAUSTED"]
def attempt_auto_recovery(error_info):
print(f"Attempting auto-recovery for error: {error_info.get('code')}")
# Actual recovery logic: restart service, reload config, etc.
if error_info.get("code") == "TIMEOUT":
print("Restarting dependent service...")
# dependent_service.restart()
pass
Finally, I built a feature to automatically collect and store information about when errors occurred, their types, and related request details in a central system. This has allowed me to analyze error patterns and proactively address issues.
# Logging error information to a central system (Example)
import requests
import json
def log_error_to_central_system(error_info):
central_logging_url = "http://your-central-logging-service.internal/log"
try:
response = requests.post(central_logging_url, json=error_info)
response.raise_for_status() # Raise an exception for HTTP errors
print("Error logged to central system successfully.")
except requests.exceptions.RequestException as e:
print(f"Failed to log error to central system: {e}")
Results
- User experience has significantly improved, reducing confusion when errors occur.
- Service downtime has decreased thanks to the auto-recovery feature.
- Problem resolution speed has improved due to systematic error information collection.
Summary — To Avoid the Same Pitfalls
- [ ] Make error messages user-friendly, minimizing technical details.
- [ ] Define and implement scenarios for automatic error recovery in advance.
- [ ] Build a system to record detailed information about error occurrences (time, type, related info) and manage it centrally.
- [ ] Thoroughly consider and test potential exceptions when implementing recovery logic.