Root Cause Analysis (RCA)
Post-incident documentation and learning
Root Cause Analysis (RCA) reports help you document what happened during an incident, why it happened, and how you'll prevent it from happening again. ATStatus provides structured RCA creation that communicates transparently with your users.
What is Root Cause Analysis?
Root Cause Analysis is a systematic process for identifying the underlying causes of an incident. A well-written RCA:
- Documents the full incident timeline
- Identifies what went wrong and why
- Shows the impact to users
- Details remediation actions taken
- Outlines steps to prevent recurrence
- Builds user trust through transparency
Major cloud providers (AWS, Google Cloud, Azure) publish detailed RCAs for significant outages. Following this practice demonstrates maturity and builds credibility with your users.
When to Write an RCA
Consider creating an RCA for:
| Criteria | Recommendation |
|---|---|
| Major outage (>1 hour) | Always — users expect explanation |
| Data loss or security incident | Always — mandatory for compliance |
| Repeated incidents | Always — shows you're addressing the pattern |
| High user impact | Recommended — even for short outages |
| Minor issues (<15 min) | Optional — brief summary may suffice |
Creating an RCA
Step 1: Access the RCA Editor
- Navigate to Admin → Incidents
- Find the resolved incident
- Click the incident to open details
- Click Add Root Cause Analysis or Edit RCA
Step 2: Fill in RCA Sections
The RCA editor provides a structured format with the following sections:
Step 3: Review and Publish
- Preview the RCA as users will see it
- Have a colleague review for clarity and accuracy
- Click Save RCA
- The RCA will appear on the public incident page
RCA Visibility Options
Control who can see your RCA:
| Setting | Description |
|---|---|
| Public | Visible to all status page visitors |
| Authenticated Only | Only visible to logged-in users (if auth enabled) |
| Internal Only | Only visible to admin users (for internal tracking) |
Example RCA
Here's an example of a well-written RCA:
Root Cause Analysis: API Outage - January 15, 2024
A database connection pool exhaustion issue caused 47 minutes of intermittent API errors between 14:23 and 15:10 UTC on January 15, 2024.
- 14:23 UTC — Monitoring detected elevated API error rates
- 14:25 UTC — On-call engineer alerted and investigation began
- 14:35 UTC — Root cause identified as connection pool exhaustion
- 14:42 UTC — Emergency config change deployed, pool size increased
- 15:10 UTC — Error rates returned to normal, incident resolved
A scheduled batch job was updated to process records in parallel without properly releasing database connections. This exhausted the connection pool (configured at 10 connections), causing new API requests to fail with timeout errors.
Approximately 23% of API requests failed with 504 errors during the incident window. Web dashboard users experienced slow page loads. Mobile app users saw intermittent "connection failed" messages.
Immediately increased connection pool size from 10 to 50. Rolled back the batch job change and applied a fix to properly use connection pooling in parallel operations.
- Added monitoring alert for connection pool utilization above 80%
- Implemented code review checklist for database operations
- Added load testing for batch job changes to staging environment
- Increased default connection pool size across all services to 50
Best Practices
- • Be honest and transparent
- • Use specific technical details
- • Include concrete prevention steps
- • Write for a non-technical audience
- • Publish within 48-72 hours
- • Acknowledge user impact
- • Blame individuals or teams
- • Use vague descriptions
- • Over-promise on prevention
- • Include sensitive details (passwords, IPs)
- • Make excuses or deflect
- • Leave prevention steps empty
API Access
RCAs can be accessed programmatically via the API:
# Get incident with RCA
GET /api/incidents/{id}
# Response includes RCA if present
{
"id": "inc_abc123",
"title": "API Outage",
"status": "resolved",
"rca": {
"summary": "Database connection pool exhaustion...",
"timeline": "...",
"rootCause": "...",
"impact": "...",
"resolution": "...",
"prevention": "..."
}
}