ATStatus
ATStatus WikiLoading documentation...

Root Cause Analysis (RCA)

Post-incident documentation and learning

Root Cause Analysis (RCA) reports help you document what happened during an incident, why it happened, and how you'll prevent it from happening again. ATStatus provides structured RCA creation that communicates transparently with your users.

What is Root Cause Analysis?

Root Cause Analysis is a systematic process for identifying the underlying causes of an incident. A well-written RCA:

  • Documents the full incident timeline
  • Identifies what went wrong and why
  • Shows the impact to users
  • Details remediation actions taken
  • Outlines steps to prevent recurrence
  • Builds user trust through transparency
Professional Best Practice

Major cloud providers (AWS, Google Cloud, Azure) publish detailed RCAs for significant outages. Following this practice demonstrates maturity and builds credibility with your users.

When to Write an RCA

Consider creating an RCA for:

CriteriaRecommendation
Major outage (>1 hour)Always — users expect explanation
Data loss or security incidentAlways — mandatory for compliance
Repeated incidentsAlways — shows you're addressing the pattern
High user impactRecommended — even for short outages
Minor issues (<15 min)Optional — brief summary may suffice

Creating an RCA

Step 1: Access the RCA Editor

  1. Navigate to Admin → Incidents
  2. Find the resolved incident
  3. Click the incident to open details
  4. Click Add Root Cause Analysis or Edit RCA

Step 2: Fill in RCA Sections

The RCA editor provides a structured format with the following sections:

Summary
A brief overview of what happened. This appears prominently on the incident page. Example: "Database connection pool exhaustion caused 47 minutes of API errors."
Timeline
Chronological sequence of events. Include when the issue started, when it was detected, key milestones during response, and when it was resolved.
Root Cause
The underlying technical or process failure that caused the incident. Be specific — "configuration error" is too vague, "expired TLS certificate on load balancer" is better.
Impact
What users experienced. Include affected services, error rates, duration, and number of users impacted if known. "23% of API requests failed for 47 minutes."
Resolution
How the issue was fixed. Include immediate actions taken and any temporary measures. "Increased connection pool size from 10 to 50 and restarted affected services."
Prevention
Long-term actions to prevent recurrence. Include monitoring improvements, process changes, or infrastructure updates. "Added alerting for connection pool utilization above 80%."

Step 3: Review and Publish

  1. Preview the RCA as users will see it
  2. Have a colleague review for clarity and accuracy
  3. Click Save RCA
  4. The RCA will appear on the public incident page

RCA Visibility Options

Control who can see your RCA:

SettingDescription
PublicVisible to all status page visitors
Authenticated OnlyOnly visible to logged-in users (if auth enabled)
Internal OnlyOnly visible to admin users (for internal tracking)

Example RCA

Here's an example of a well-written RCA:

Root Cause Analysis: API Outage - January 15, 2024

Summary:

A database connection pool exhaustion issue caused 47 minutes of intermittent API errors between 14:23 and 15:10 UTC on January 15, 2024.

Timeline:
  • 14:23 UTC — Monitoring detected elevated API error rates
  • 14:25 UTC — On-call engineer alerted and investigation began
  • 14:35 UTC — Root cause identified as connection pool exhaustion
  • 14:42 UTC — Emergency config change deployed, pool size increased
  • 15:10 UTC — Error rates returned to normal, incident resolved
Root Cause:

A scheduled batch job was updated to process records in parallel without properly releasing database connections. This exhausted the connection pool (configured at 10 connections), causing new API requests to fail with timeout errors.

Impact:

Approximately 23% of API requests failed with 504 errors during the incident window. Web dashboard users experienced slow page loads. Mobile app users saw intermittent "connection failed" messages.

Resolution:

Immediately increased connection pool size from 10 to 50. Rolled back the batch job change and applied a fix to properly use connection pooling in parallel operations.

Prevention:
  • Added monitoring alert for connection pool utilization above 80%
  • Implemented code review checklist for database operations
  • Added load testing for batch job changes to staging environment
  • Increased default connection pool size across all services to 50

Best Practices

✓ Do
  • • Be honest and transparent
  • • Use specific technical details
  • • Include concrete prevention steps
  • • Write for a non-technical audience
  • • Publish within 48-72 hours
  • • Acknowledge user impact
✗ Don't
  • • Blame individuals or teams
  • • Use vague descriptions
  • • Over-promise on prevention
  • • Include sensitive details (passwords, IPs)
  • • Make excuses or deflect
  • • Leave prevention steps empty

API Access

RCAs can be accessed programmatically via the API:

# Get incident with RCA
GET /api/incidents/{id}

# Response includes RCA if present
{
  "id": "inc_abc123",
  "title": "API Outage",
  "status": "resolved",
  "rca": {
    "summary": "Database connection pool exhaustion...",
    "timeline": "...",
    "rootCause": "...",
    "impact": "...",
    "resolution": "...",
    "prevention": "..."
  }
}