Royal Glow internal docs · now fully interactive — Steps, API tables, file trees & live status
Royal Glow Docs

Operations

Runbook for running, monitoring, deploying, and recovering the Royal Glow platform.

Operations

This is the on-call runbook. It covers how to deploy, monitor, respond to incidents, and recover from failures. When something breaks, jump to Incident Response and Common Issues & Fixes.

Deployment

Normal Deployment Flow

Work flows in one direction: dev → test → pprd → prod.

Merge PR to dev

CI runs lint + typecheck + unit tests.

Merge devtest

CI adds integration tests + Playwright + Lighthouse.

Merge testpprd

CI adds k6 load test + OWASP ZAP.

Merge pprdprod

Requires manual approval, then auto-deploys.

Emergency Hotfix

For critical production bugs that can't wait for the full pipeline:

# Create hotfix branch from prod
git checkout prod
git checkout -b hotfix/booking-crash

# Make the fix, commit, push
git push origin hotfix/booking-crash

# Create PR directly to prod (requires manual approval)
# After merge, backport to dev:
git checkout dev
git merge prod

Rollback

Customer + admin sites (rgss-web, rgss-admin) — Cloudflare Workers via the OpenNext adapter (@opennextjs/cloudflare):

  1. Run wrangler rollback for the affected Worker, or open Cloudflare → Workers & Pages → rgss-web → Deployments
  2. Find the last known-good deployment / version
  3. Promote it — takes effect in ~30 seconds

Payload CMS (rgss-cms, cms.theroyalglow.in):

  1. Go to Render → Service → Deploys
  2. Click the last known-good deploy → "Rollback"

Not possible without data loss. Use Neon's point-in-time restore if a migration caused data corruption (see Point-in-Time Restore).

Monitoring

Daily Checks (2 minutes)

Status page

Check BetterStack status page — all monitors green?

Errors

Check Sentry — any new error issues overnight?

Jobs

Check BetterStack heartbeats — did all scheduled jobs run?

Weekly Checks (15 minutes)

Funnel

Review PostHog funnel — any drop-off increase in booking flow?

Review Sentry error trends — any new patterns?

Performance

Check Lighthouse CI scores on recent PRs — any performance regression?

Load

Review k6 load test results — any latency increase?

Alerts

AlertSourceAction
Site downBetterStackCheck Cloudflare status, check Render logs
API health failingBetterStackCheck Neon DB connection, check Render
Scheduled job missedBetterStack heartbeatCheck QStash dashboard for delivery logs
New error spikeSentryCheck error details, deploy fix or rollback
Slow response (P95 > 3s)Sentry performanceCheck DB query times, check Neon metrics

Incident Response

Severity Levels

LevelDefinitionResponse TimeExample
P0Site completely downImmediateCloudflare outage, Neon DB unreachable
P1Core feature broken< 1 hourBooking creation failing, auth broken
P2Feature degraded< 4 hoursEmails not sending, PDF generation failing
P3Minor issueNext business dayCosmetic bug, non-critical feature broken

P0/P1 Response Steps

Acknowledge

Post in Slack #incidents that you're investigating.

Diagnose

Check BetterStack, Sentry, Cloudflare, Render, Neon dashboards.

Communicate

Update the BetterStack status page with an incident message.

Fix or rollback

Deploy a fix or roll back to the last known-good deployment.

Verify

Confirm Checkly synthetic checks pass.

Post-mortem

Document what happened, why, and how to prevent it.

Common Issues & Fixes

Database Operations

Running Migrations

# Generate migration from schema changes
cd packages/db
bunx drizzle-kit generate

# Apply to dev branch
DATABASE_URL=$DATABASE_URL_DEV bunx drizzle-kit migrate

# Apply to test branch
DATABASE_URL=$DATABASE_URL_TEST bunx drizzle-kit migrate

# Apply to prod (requires manual confirmation)
DATABASE_URL=$DATABASE_URL_PROD bunx drizzle-kit migrate

Never run migrations directly on prod without testing on test + pprd first.

Neon Branch Management

# Create a new branch for a feature (via Neon console or API)
# Branches are instant copy-on-write — no data copying needed

# Reset pprd from prod (normally done by GitHub Actions nightly)
curl -X POST https://console.neon.tech/api/v2/projects/$NEON_PROJECT_ID/branches/$PPRD_BRANCH_ID/restore \
  -H "Authorization: Bearer $NEON_API_KEY" \
  -d '{"source_branch_id": "$PROD_BRANCH_ID"}'

Point-in-Time Restore

If a bad migration or data corruption occurs on prod:

Open the branch

Go to Neon console → Project → Branches → prod.

Restore to a point in time

Click "Restore" → select a point in time before the incident.

Verify

This creates a new branch — verify data is correct.

Swap the connection

Point the connection string to the restored branch.

Re-apply migrations

Apply any migrations that happened after the restore point.

Backup & Recovery

Weekly Backups

Every Sunday at 2 AM UTC, GitHub Actions:

  1. Runs pg_dump against Neon prod branch
  2. Uploads compressed dump to Cloudflare R2 (backups/weekly/)
  3. Retains 8 weeks of backups
  4. Pings BetterStack heartbeat on success

Restore from Backup

# Download backup from R2
aws s3 cp s3://theroyalglow-uploads/backups/weekly/backup-2026-06-01.sql.gz . \
  --endpoint-url https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com

# Decompress
gunzip backup-2026-06-01.sql.gz

# Restore to a new Neon branch (never restore directly to prod)
psql $DATABASE_URL_RESTORE < backup-2026-06-01.sql

Scaling

The platform is designed to scale without code changes:

BottleneckSolution
DB connection pool exhaustedIncrease Neon compute size or connection limit
Cloudflare Workers CPU limitMove heavy work to Render origin
Upstash Redis rate limitUpgrade Upstash plan
Ably message limitUpgrade Ably plan (6M/mo free is generous)
Resend email limitUpgrade Resend plan

Launch Checklist

Before going live at theroyalglow.in:

  • DNS: point theroyalglow.in → Cloudflare Workers (rgss-web)
  • DNS: point admin.theroyalglow.in → Cloudflare Workers (rgss-admin)
  • DNS: point cms.theroyalglow.in → Render (rgss-cms)
  • DNS: point docs.theroyalglow.in → Cloudflare Workers (docs)
  • DNS: point status.theroyalglow.in → BetterStack status page
  • SSL: verify all domains have valid certificates (Cloudflare handles this)
  • GMB: update website field from old URL to https://theroyalglow.in
  • GMB: set booking action link to https://theroyalglow.in/?book=1&utm_source=gmb
  • Google Search Console: verify ownership, submit sitemap
  • Sentry: verify source maps are uploading correctly
  • BetterStack: verify all monitors are active and alerting
  • Checkly: verify all synthetic checks are passing
  • Run Lighthouse CI on all key pages — all scores must pass
  • Test booking flow end-to-end on prod
  • Test Google OAuth sign-in on prod
  • Test invoice email delivery on prod
  • Seed production data (branch, categories, services, tiers)
OpenReport an issue

Was this page helpful?

On this page