502 Errors After a Major Headless CMS Update

TL;DR

Right after a major CMS update, the API started returning intermittent 502s.
CloudWatch showed unstable DatabaseConnections, a spike in selectAttempt, and sustained CPU utilization, pointing to the DB.
The schema became more normalized, table count and JOINs exploded, and EXPLAIN showed full table scans everywhere.
Rebuilt indexes, refined JOIN conditions and selected columns, and removed unnecessary JOINs.
502s disappeared and CPU dropped; in some cases performance improved beyond pre-upgrade.

System Overview (High Level)

Headless CMS (schema managed as code, auto-generated)
MySQL (RDS)
Applications use the CMS via API
Monitoring via CloudWatch

A very typical setup.

What Changed

I ran a major update of the headless CMS.

We knew there were breaking changes, but the schema definition still generated correctly, and migrations completed successfully.

At that point I expected some performance drop, but nothing dramatic.

What Happened

After deployment, these symptoms appeared:

API intermittently returned 502
Some requests were extremely slow
Hard to reproduce consistently

CloudWatch revealed several suspicious signals.

CloudWatch Signals

Unstable Connections

DatabaseConnections had been steady before, but started fluctuating with request volume after the update. In addition:

selectAttempt (number of SELECTs) nearly doubled
DB load clearly increased

I started to suspect connection pool behavior or query volume.

CPU Stuck High

CPUUtilization hovered around 70% consistently, not just spikes. That suggested:

not a single heavy query
something constantly keeping the DB busy

Narrowing Down the Cause

Reviewing the update changelog showed one key change.

Heavier Normalization

After the update:

Table count more than doubled
Many new join tables were added

Result:

JOINs exploded for reads
SQL per request became much more complex

It felt like the SQL layer was the real culprit, so I inspected the queries issued by the app.

EXPLAIN and Despair

Running EXPLAIN on the problematic queries showed full table scans everywhere.

Indexes were not used
Many JOIN targets
Decent row counts

That explained the CPU burn.

What I Did

Then came the slow tuning work:

Rebuild indexes
Reorder JOINs and conditions
Limit selected columns to the minimum
Remove unnecessary JOINs

I repeated EXPLAIN -> fix -> recheck over and over.

Results

502s were resolved
CPU utilization dropped significantly
Some endpoints became faster than before the upgrade

It was painful, but it reinforced a simple truth: if you face the query plan, it gets better.

Lessons

Two things stood out the most.

Always Check the Execution Plan

Even with ORM/CMS, SQL always runs underneath
"It works" is not enough
Full scans are almost always bad

CloudWatch Is an Excellent Triage Tool

CPU
Connections
Query counts

With these, you can quickly judge whether the issue is app, DB, or schema related.

Closing Thoughts

Major updates make things easier, but internal structures can change a lot.

Especially:

deeper normalization
more JOINs
connection management changes

These are areas to scrutinize after upgrades.

It was a tough incident, but learning to read execution plans and appreciate CloudWatch made it worthwhile. I hope this helps anyone who gets stuck in a similar swamp.