How Swiggy Scaled and Maintained Postgres (Without Breaking Production)

Every startup hits a point where growth outpaces the systems built in the early days. For Swiggy, that moment arrived when a single, overworked Postgres database began throttling product velocity and customer experience. What followed is a practical playbook any team can apply: clean up indexes, understand and control vacuuming, and use pg_repack for online, no-downtime maintenance.

Below is a field guide based on Swiggy’s journey, what they changed, why it mattered, and how you can replicate the results.

The Turning Point: From “It Works” to “It Hurts”
Swiggy’s legacy app ran on Postgres 9.3 on AWS RDS, mostly with default parameters. While the application evolved into multiple microservices, the database remained a multi-purpose workhorse with fuzzy ownership boundaries.

Symptoms piled up:

  • High CPU and IOPS thanks to inefficient queries and redundant data.

  • Auto-vacuum disabled, replaced by nightly VACUUM FULL during a downtime window.

  • Data duplication from early data models and overlapping responsibilities.

Downtime was increasingly risky. If the maintenance window slipped, customers felt it. Swiggy needed a plan that protected uptime.

Step 1: Audit and Fix Indexes

Step 2: Vacuuming 101, Why It’s Non-Optional

Step 3: Why Manual (VACUUM FULL) Was a Trap

Step 4: Bring Back Auto-Vacuum, Safely

Step 5: pg_repack — Online Maintenance That Respects Uptime

Results: From Firefighting to Flow

A Pragmatic Playbook You Can Copy

Helpful Snippets

Learnings for Developers

Step 1: Audit and Fix Indexes

Indexes are performance multipliers—until they aren’t. Each index consumes CPU, memory, and write throughput. Over time, unused or misaligned indexes quietly tax the system.

What Swiggy found

  • Many indexes were unused due to changing query patterns.

  • Others became obsolete as business logic evolved.

What they did

  • Dropped unused indexes after observing usage for a representative period.

  • Added partial indexes—indexing only the rows that queries actually filter on.

Impact

  • ~15% CPU reduction.

  • Lower storage growth and reduced write amplification.

How to find unused indexes

SQL
				-- Observe over days/weeks before deciding to drop anything
SELECT
  schemaname, relname AS table_name, indexrelname AS index_name,
  idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, indexrelname;

			

When to consider partial indexes

SQL
				-- Example: only index active orders to speed up common reads
CREATE INDEX CONCURRENTLY idx_orders_active
  ON orders (restaurant_id, created_at)
  WHERE status = 'active';

			

Step 2: Vacuuming 101 — Why It’s Non-Optional

Postgres uses MVCC (Multi-Version Concurrency Control). Updates don’t overwrite rows; they write new versions and mark old ones invisible to future transactions. Those dead tuples bloat tables and indexes.

Key ideas

  • Each row tracks creation/deletion via xmin and xmax.

  • Auto-vacuum cleans dead tuples and freezes old transaction IDs, preventing catastrophic XID wraparound.

  • Without vacuuming, bloat rises, queries slow, and wraparound can make data “disappear” to new transactions.

Measure bloat pressure

SQL
				SELECT
  relname AS table_name,
  n_live_tup,
  n_dead_tup,
  round( (n_dead_tup::numeric / NULLIF(n_live_tup,0)) * 100, 2 ) AS dead_pct
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog','information_schema')
ORDER BY dead_pct DESC NULLS LAST
LIMIT 20;

			

Step 3: Why Manual VACUUM FULL Was a Trap

Swiggy disabled auto-vacuum and relied on nightly VACUUM FULL. While it reclaimed space, it also:

  • Exclusively locked tables.

  • Rewrote entire relations.

  • Was hard to predict and often overran the window.

  • Increased ops toil and risk.

As traffic grew, this tactic couldn’t keep up.

Step 4: Bring Back Auto-Vacuum—Safely

Before re-enabling auto-vacuum, Swiggy migrated to Postgres 9.6 with AWS DMS for zero downtime. That unlocked better controls and safer experimentation.

Instrument before you flip the switch
Track:

  • When auto-vacuum runs, table, duration, and tuples processed.

  • Dead/live tuple ratios, table and index sizes.

  • Pending XIDs and relfrozenxid age.

  • Infra metrics: CPU, IOPS, throughput, queue depth.

Throttle auto-vacuum’s bite

SQL
				# Example: let it work, but don’t starve the app
autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_cost_limit = 2000
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20

			

Step 5: pg_repack — Online Maintenance That Respects Uptime

Enter pg_repack: a utility that rebuilds tables and indexes online by creating a new copy and swapping it in—without a long, blocking lock.

What pg_repack gives you

  • Removes bloat from tables and indexes.

  • Re-orders data to match clustered indexes.

  • Avoids exclusive locks on the table for most of the run.

  • Comparable efficiency to CLUSTER, but suitable for production.

Runbook

  1. Rehearse on staging with realistic data, traffic, and extensions installed.

  2. Ensure disk headroom (temporary copy of the largest table/index).

  3. Schedule off-peak, announce internally, and have a rollback plan.

  4. Repack the worst offenders first; iterate over the rest.

Typical commands (adjust for your environment and permissions):

SQL
				# Repack a single large table
pg_repack --dbname=app_db --table=public.orders --jobs=4

# Repack a schema
pg_repack --dbname=app_db --schema=public --jobs=4

# Dry-run style: estimate impact first (check table sizes, bloat ratios)

			

Step 5: pg_repack — Online Maintenance That Respects Uptime

Enter pg_repack: a utility that rebuilds tables and indexes online by creating a new copy and swapping it in—without a long, blocking lock.

What pg_repack gives you

  • Removes bloat from tables and indexes.

  • Re-orders data to match clustered indexes.

  • Avoids exclusive locks on the table for most of the run.

  • Comparable efficiency to CLUSTER, but suitable for production.

Runbook

  1. Rehearse on staging with realistic data, traffic, and extensions installed.

  2. Ensure disk headroom (temporary copy of the largest table/index).

  3. Schedule off-peak, announce internally, and have a rollback plan.

  4. Repack the worst offenders first; iterate over the rest.

Typical commands (adjust for your environment and permissions):

SQL
				# Repack a single large table
pg_repack --dbname=app_db --table=public.orders --jobs=4

# Repack a schema
pg_repack --dbname=app_db --schema=public --jobs=4

# Dry-run style: estimate impact first (check table sizes, bloat ratios)

			

Results: From Firefighting to Flow

By combining index cleanup, controlled auto-vacuum, and pg_repack, Swiggy:

  • Reduced CPU and lowered IOPS during normal operation.

  • Slowed storage growth by eliminating unnecessary indexes and bloat.

  • Cut operational toil, no longer betting on a fragile nightly window.

  • Gained a repeatable maintenance routine that scales with traffic.

A Pragmatic Playbook You Can Copy

1) Start with visibility

  • Enable pg_stat_statements, collect slow queries, size tables/indexes over time.

  • Track n_dead_tup, autovacuum runs, relfrozenxid age, and IO saturation.

2) Clean up indexes

  • Drop unused ones; merge overlaps; avoid low-cardinality multi-column indexes.

  • Add partial or covering indexes for the hottest predicates.

3) Treat auto-vacuum as essential

  • Never disable globally. Tune cost settings and scale up gradually.

  • For “hot” tables, raise autovacuum_vacuum_scale_factor and/or lower thresholds to run sooner.

  • Confirm progress: vacuum time vs. growth rate.

4) Use pg_repack for online reorganization

  • Repack the biggest, hottest tables first; then the largest indexes.

  • Bake repack calls into operational runbooks; consider periodic automation.

5) Engineer your safety net

  • Always have rollback steps: disable a setting, pause a job, fail back.

  • Alert on queue depth, IOPS, p95 latency, and dead-tuple ratios.

Helpful Snippets

Find biggest tables and their index overhead

SQL
				SELECT
  n.nspname AS schema,
  c.relname AS table_name,
  pg_size_pretty(pg_total_relation_size(c.oid)) AS total,
  pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
  pg_size_pretty(pg_total_relation_size(c.oid) - pg_relation_size(c.oid)) AS indexes_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
ORDER BY pg_total_relation_size(c.oid) DESC
LIMIT 20;

			

Spot tables likely to need repack

SQL
				SELECT
  relname AS table_name,
  n_live_tup,
  n_dead_tup,
  round(100 * n_dead_tup::numeric / NULLIF(n_live_tup,0), 2) AS dead_pct
FROM pg_stat_all_tables
WHERE schemaname NOT IN ('pg_catalog','information_schema')
  AND n_live_tup > 100000
ORDER BY dead_pct DESC
LIMIT 20;

			

Per-table auto-vacuum overrides (fine-tune hot tables)

SQL
				ALTER TABLE public.orders
  SET (autovacuum_vacuum_scale_factor = 0.05, autovacuum_analyze_scale_factor = 0.05);

ALTER TABLE public.events
  SET (autovacuum_vacuum_cost_limit = 4000, autovacuum_vacuum_cost_delay = 5ms);

			

Learnings for Developers

  • Audit indexes regularly; adapt them to today’s queries, not yesterday’s.

  • Internalize MVCC. Dead tuples = silent performance tax; vacuum prevents catastrophe.

  • Auto-vacuum is mandatory. Tune gently, roll out gradually, and monitor everything.

  • Use pg_repack to remove bloat online; schedule smartly and ensure disk headroom.

  • Build a repeatable runbook with clear rollback steps and success metrics.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top