Root Cause Analysis (RCA)

Post-incident documentation and learning

Root Cause Analysis (RCA) reports help you document what happened during an incident, why it happened, and how you'll prevent it from happening again. ATStatus provides structured RCA creation that communicates transparently with your users.

What is Root Cause Analysis?

Root Cause Analysis is a systematic process for identifying the underlying causes of an incident. A well-written RCA:

Documents the full incident timeline
Identifies what went wrong and why
Shows the impact to users
Details remediation actions taken
Outlines steps to prevent recurrence
Builds user trust through transparency

Professional Best Practice

Major cloud providers (AWS, Google Cloud, Azure) publish detailed RCAs for significant outages. Following this practice demonstrates maturity and builds credibility with your users.

When to Write an RCA

Consider creating an RCA for:

Criteria	Recommendation
Major outage (>1 hour)	Always — users expect explanation
Data loss or security incident	Always — mandatory for compliance
Repeated incidents	Always — shows you're addressing the pattern
High user impact	Recommended — even for short outages
Minor issues (<15 min)	Optional — brief summary may suffice

Creating an RCA

Step 1: Access the RCA Editor

Navigate to Admin → Incidents
Find the resolved incident
Click the incident to open details
Click Add Root Cause Analysis or Edit RCA

Step 2: Fill in RCA Sections

The RCA editor provides a structured format with the following sections:

Summary

A brief overview of what happened. This appears prominently on the incident page. Example: "Database connection pool exhaustion caused 47 minutes of API errors."

Timeline

Chronological sequence of events. Include when the issue started, when it was detected, key milestones during response, and when it was resolved.

Root Cause

The underlying technical or process failure that caused the incident. Be specific — "configuration error" is too vague, "expired TLS certificate on load balancer" is better.

Impact

What users experienced. Include affected services, error rates, duration, and number of users impacted if known. "23% of API requests failed for 47 minutes."

Resolution

How the issue was fixed. Include immediate actions taken and any temporary measures. "Increased connection pool size from 10 to 50 and restarted affected services."

Prevention

Long-term actions to prevent recurrence. Include monitoring improvements, process changes, or infrastructure updates. "Added alerting for connection pool utilization above 80%."

Step 3: Review and Publish

Preview the RCA as users will see it
Have a colleague review for clarity and accuracy
Click Save RCA
The RCA will appear on the public incident page

RCA Visibility Options

Control who can see your RCA:

Setting	Description
Public	Visible to all status page visitors
Authenticated Only	Only visible to logged-in users (if auth enabled)
Internal Only	Only visible to admin users (for internal tracking)

Example RCA

Here's an example of a well-written RCA:

Root Cause Analysis: API Outage - January 15, 2024

Summary:

A database connection pool exhaustion issue caused 47 minutes of intermittent API errors between 14:23 and 15:10 UTC on January 15, 2024.

Timeline:

14:23 UTC — Monitoring detected elevated API error rates
14:25 UTC — On-call engineer alerted and investigation began
14:35 UTC — Root cause identified as connection pool exhaustion
14:42 UTC — Emergency config change deployed, pool size increased
15:10 UTC — Error rates returned to normal, incident resolved

Root Cause:

A scheduled batch job was updated to process records in parallel without properly releasing database connections. This exhausted the connection pool (configured at 10 connections), causing new API requests to fail with timeout errors.

Impact:

Approximately 23% of API requests failed with 504 errors during the incident window. Web dashboard users experienced slow page loads. Mobile app users saw intermittent "connection failed" messages.

Resolution:

Immediately increased connection pool size from 10 to 50. Rolled back the batch job change and applied a fix to properly use connection pooling in parallel operations.

Prevention:

Added monitoring alert for connection pool utilization above 80%
Implemented code review checklist for database operations
Added load testing for batch job changes to staging environment
Increased default connection pool size across all services to 50

Best Practices

✓ Do

• Be honest and transparent
• Use specific technical details
• Include concrete prevention steps
• Write for a non-technical audience
• Publish within 48-72 hours
• Acknowledge user impact

✗ Don't

• Blame individuals or teams
• Use vague descriptions
• Over-promise on prevention
• Include sensitive details (passwords, IPs)
• Make excuses or deflect
• Leave prevention steps empty

API Access

RCAs can be accessed programmatically via the API:

# Get incident with RCA
GET /api/incidents/{id}

# Response includes RCA if present
{
  "id": "inc_abc123",
  "title": "API Outage",
  "status": "resolved",
  "rca": {
    "summary": "Database connection pool exhaustion...",
    "timeline": "...",
    "rootCause": "...",
    "impact": "...",
    "resolution": "...",
    "prevention": "..."
  }
}

Root Cause Analysis (RCA)

What is Root Cause Analysis?

When to Write an RCA

Creating an RCA

Step 1: Access the RCA Editor

Step 2: Fill in RCA Sections

Step 3: Review and Publish

RCA Visibility Options

Example RCA

Root Cause Analysis: API Outage - January 15, 2024

Best Practices

API Access

Related Documentation