502 Errors After a Major Headless CMS Update
How a major CMS upgrade caused intermittent 502s, and how CloudWatch triage plus SQL execution plan tuning resolved it.
TL;DR
- Right after a major CMS update, the API started returning intermittent 502s.
- CloudWatch showed unstable DatabaseConnections, a spike in selectAttempt, and sustained CPU utilization, pointing to the DB.
- The schema became more normalized, table count and JOINs exploded, and
EXPLAINshowed full table scans everywhere. - Rebuilt indexes, refined JOIN conditions and selected columns, and removed unnecessary JOINs.
- 502s disappeared and CPU dropped; in some cases performance improved beyond pre-upgrade.
System Overview (High Level)
- Headless CMS (schema managed as code, auto-generated)
- MySQL (RDS)
- Applications use the CMS via API
- Monitoring via CloudWatch
A very typical setup.
What Changed
I ran a major update of the headless CMS.
We knew there were breaking changes, but the schema definition still generated correctly, and migrations completed successfully.
At that point I expected some performance drop, but nothing dramatic.
What Happened
After deployment, these symptoms appeared:
- API intermittently returned 502
- Some requests were extremely slow
- Hard to reproduce consistently
CloudWatch revealed several suspicious signals.
CloudWatch Signals
Unstable Connections
DatabaseConnections had been steady before, but started fluctuating with request volume after the update. In addition:
- selectAttempt (number of SELECTs) nearly doubled
- DB load clearly increased
I started to suspect connection pool behavior or query volume.
CPU Stuck High
CPUUtilization hovered around 70% consistently, not just spikes. That suggested:
- not a single heavy query
- something constantly keeping the DB busy
Narrowing Down the Cause
Reviewing the update changelog showed one key change.
Heavier Normalization
After the update:
- Table count more than doubled
- Many new join tables were added
Result:
- JOINs exploded for reads
- SQL per request became much more complex
It felt like the SQL layer was the real culprit, so I inspected the queries issued by the app.
EXPLAIN and Despair
Running EXPLAIN on the problematic queries showed full table scans everywhere.
- Indexes were not used
- Many JOIN targets
- Decent row counts
That explained the CPU burn.
What I Did
Then came the slow tuning work:
- Rebuild indexes
- Reorder JOINs and conditions
- Limit selected columns to the minimum
- Remove unnecessary JOINs
I repeated EXPLAIN -> fix -> recheck over and over.
Results
- 502s were resolved
- CPU utilization dropped significantly
- Some endpoints became faster than before the upgrade
It was painful, but it reinforced a simple truth: if you face the query plan, it gets better.
Lessons
Two things stood out the most.
Always Check the Execution Plan
- Even with ORM/CMS, SQL always runs underneath
- "It works" is not enough
- Full scans are almost always bad
CloudWatch Is an Excellent Triage Tool
- CPU
- Connections
- Query counts
With these, you can quickly judge whether the issue is app, DB, or schema related.
Closing Thoughts
Major updates make things easier, but internal structures can change a lot.
Especially:
- deeper normalization
- more JOINs
- connection management changes
These are areas to scrutinize after upgrades.
It was a tough incident, but learning to read execution plans and appreciate CloudWatch made it worthwhile. I hope this helps anyone who gets stuck in a similar swamp.