Operations
Runbook for running, monitoring, deploying, and recovering the Royal Glow platform.
Operations
This is the on-call runbook. It covers how to deploy, monitor, respond to incidents, and recover from failures. When something breaks, jump to Incident Response and Common Issues & Fixes.
Deployment
Normal Deployment Flow
Work flows in one direction: dev → test → pprd → prod.
Merge PR to dev
CI runs lint + typecheck + unit tests.
Merge dev → test
CI adds integration tests + Playwright + Lighthouse.
Merge test → pprd
CI adds k6 load test + OWASP ZAP.
Merge pprd → prod
Requires manual approval, then auto-deploys.
Emergency Hotfix
For critical production bugs that can't wait for the full pipeline:
# Create hotfix branch from prod
git checkout prod
git checkout -b hotfix/booking-crash
# Make the fix, commit, push
git push origin hotfix/booking-crash
# Create PR directly to prod (requires manual approval)
# After merge, backport to dev:
git checkout dev
git merge prodRollback
Customer + admin sites (rgss-web, rgss-admin) — Cloudflare Workers via the OpenNext adapter (@opennextjs/cloudflare):
- Run
wrangler rollbackfor the affected Worker, or open Cloudflare → Workers & Pages →rgss-web→ Deployments - Find the last known-good deployment / version
- Promote it — takes effect in ~30 seconds
Payload CMS (rgss-cms, cms.theroyalglow.in):
- Go to Render → Service → Deploys
- Click the last known-good deploy → "Rollback"
Not possible without data loss. Use Neon's point-in-time restore if a migration caused data corruption (see Point-in-Time Restore).
Monitoring
Daily Checks (2 minutes)
Status page
Check BetterStack status page — all monitors green?
Errors
Check Sentry — any new error issues overnight?
Jobs
Check BetterStack heartbeats — did all scheduled jobs run?
Weekly Checks (15 minutes)
Funnel
Review PostHog funnel — any drop-off increase in booking flow?
Error trends
Review Sentry error trends — any new patterns?
Performance
Check Lighthouse CI scores on recent PRs — any performance regression?
Load
Review k6 load test results — any latency increase?
Alerts
| Alert | Source | Action |
|---|---|---|
| Site down | BetterStack | Check Cloudflare status, check Render logs |
| API health failing | BetterStack | Check Neon DB connection, check Render |
| Scheduled job missed | BetterStack heartbeat | Check QStash dashboard for delivery logs |
| New error spike | Sentry | Check error details, deploy fix or rollback |
| Slow response (P95 > 3s) | Sentry performance | Check DB query times, check Neon metrics |
Incident Response
Severity Levels
| Level | Definition | Response Time | Example |
|---|---|---|---|
| P0 | Site completely down | Immediate | Cloudflare outage, Neon DB unreachable |
| P1 | Core feature broken | < 1 hour | Booking creation failing, auth broken |
| P2 | Feature degraded | < 4 hours | Emails not sending, PDF generation failing |
| P3 | Minor issue | Next business day | Cosmetic bug, non-critical feature broken |
P0/P1 Response Steps
Acknowledge
Post in Slack #incidents that you're investigating.
Diagnose
Check BetterStack, Sentry, Cloudflare, Render, Neon dashboards.
Communicate
Update the BetterStack status page with an incident message.
Fix or rollback
Deploy a fix or roll back to the last known-good deployment.
Verify
Confirm Checkly synthetic checks pass.
Post-mortem
Document what happened, why, and how to prevent it.
Common Issues & Fixes
Database Operations
Running Migrations
# Generate migration from schema changes
cd packages/db
bunx drizzle-kit generate
# Apply to dev branch
DATABASE_URL=$DATABASE_URL_DEV bunx drizzle-kit migrate
# Apply to test branch
DATABASE_URL=$DATABASE_URL_TEST bunx drizzle-kit migrate
# Apply to prod (requires manual confirmation)
DATABASE_URL=$DATABASE_URL_PROD bunx drizzle-kit migrateNever run migrations directly on prod without testing on test + pprd first.
Neon Branch Management
# Create a new branch for a feature (via Neon console or API)
# Branches are instant copy-on-write — no data copying needed
# Reset pprd from prod (normally done by GitHub Actions nightly)
curl -X POST https://console.neon.tech/api/v2/projects/$NEON_PROJECT_ID/branches/$PPRD_BRANCH_ID/restore \
-H "Authorization: Bearer $NEON_API_KEY" \
-d '{"source_branch_id": "$PROD_BRANCH_ID"}'Point-in-Time Restore
If a bad migration or data corruption occurs on prod:
Open the branch
Go to Neon console → Project → Branches → prod.
Restore to a point in time
Click "Restore" → select a point in time before the incident.
Verify
This creates a new branch — verify data is correct.
Swap the connection
Point the connection string to the restored branch.
Re-apply migrations
Apply any migrations that happened after the restore point.
Backup & Recovery
Weekly Backups
Every Sunday at 2 AM UTC, GitHub Actions:
- Runs
pg_dumpagainst Neonprodbranch - Uploads compressed dump to Cloudflare R2 (
backups/weekly/) - Retains 8 weeks of backups
- Pings BetterStack heartbeat on success
Restore from Backup
# Download backup from R2
aws s3 cp s3://theroyalglow-uploads/backups/weekly/backup-2026-06-01.sql.gz . \
--endpoint-url https://$R2_ACCOUNT_ID.r2.cloudflarestorage.com
# Decompress
gunzip backup-2026-06-01.sql.gz
# Restore to a new Neon branch (never restore directly to prod)
psql $DATABASE_URL_RESTORE < backup-2026-06-01.sqlScaling
The platform is designed to scale without code changes:
| Bottleneck | Solution |
|---|---|
| DB connection pool exhausted | Increase Neon compute size or connection limit |
| Cloudflare Workers CPU limit | Move heavy work to Render origin |
| Upstash Redis rate limit | Upgrade Upstash plan |
| Ably message limit | Upgrade Ably plan (6M/mo free is generous) |
| Resend email limit | Upgrade Resend plan |
Launch Checklist
Before going live at theroyalglow.in:
- DNS: point
theroyalglow.in→ Cloudflare Workers (rgss-web) - DNS: point
admin.theroyalglow.in→ Cloudflare Workers (rgss-admin) - DNS: point
cms.theroyalglow.in→ Render (rgss-cms) - DNS: point
docs.theroyalglow.in→ Cloudflare Workers (docs) - DNS: point
status.theroyalglow.in→ BetterStack status page - SSL: verify all domains have valid certificates (Cloudflare handles this)
- GMB: update website field from old URL to
https://theroyalglow.in - GMB: set booking action link to
https://theroyalglow.in/?book=1&utm_source=gmb - Google Search Console: verify ownership, submit sitemap
- Sentry: verify source maps are uploading correctly
- BetterStack: verify all monitors are active and alerting
- Checkly: verify all synthetic checks are passing
- Run Lighthouse CI on all key pages — all scores must pass
- Test booking flow end-to-end on prod
- Test Google OAuth sign-in on prod
- Test invoice email delivery on prod
- Seed production data (branch, categories, services, tiers)
Was this page helpful?