PagerPal Operator Guide#

This guide is for the person running PagerPal day to day. It focuses on the current v1 single-node appliance model: one FastAPI process, one database, and in-process background workers.

Access model#

PagerPal has two separate access mechanisms:

Surface Credential Purpose
Web UI Login-enabled User Account session Operator access to dashboards, incidents, teams, schedules, policies, alert sources, and settings.
Management API Real User Account Basic auth (email + password) CRUD/read access according to role.
Alert ingestion webhooks Alert source API key Allows monitoring systems to create/update/resolve incidents.
/health None Public health endpoint for local checks or load balancers.

Roles:

  • admin: manage responders, teams, schedules, escalation policies, alert sources, and account details.
  • responder: read operational state and run incident actions such as acknowledge, resolve, reopen, manual escalation, and notification retry.
  • viewer: read-only access.

Set a strong SECRET_KEY, enable secure session cookies in production, run migrations, then create the first admin account with scripts/create_admin.py or the first-run /login bootstrap screen.

Safe local/demo mode#

Use safe mode when testing UI workflows or screenshots. It disables real outbound notifications and scheduler jobs:

NOTIFICATION_SENDING_ENABLED=false \
NOTIFICATION_RETRY_WORKER_ENABLED=false \
ESCALATION_WORKER_ENABLED=false \
uvicorn app.main:app --host 127.0.0.1 --port 8000

Open the UI:

http://127.0.0.1:8000/dashboard

If this is a fresh database, /login prompts for the first admin account. Existing deployments should sign in with a login-enabled User Account.

Production readiness checklist#

Before using PagerPal for real paging:

  • Copy .env.example to .env and replace every placeholder.
  • Set a strong SECRET_KEY.
  • Set SESSION_COOKIE_SECURE=true for HTTPS production.
  • Set PAGERPAL_BASE_URL to the HTTPS URL operators use.
  • Set ALLOWED_ORIGINS to explicit production origins.
  • Configure at least one notification provider: INFOBIP_* for SMS/WhatsApp or SMTP_* for email.
  • Keep NOTIFICATION_SENDING_ENABLED=false until a controlled test alert is verified.
  • Run python -m alembic upgrade head.
  • Create at least one admin account.
  • Confirm GET /health returns 200.
  • Confirm /settings reports expected notification provider and worker health.
  • Confirm exactly one PagerPal app process has retry/escalation workers enabled.
  • Put TLS in front of the app before exposing it beyond the instance/private network.
  • Configure database/EBS backups.
  • Use /alert-sources to send one controlled test alert and confirm incident creation, notification logs, acknowledgement, and resolution.

Day-to-day workflow#

1. Check live state#

Use /dashboard first. It is the operational landing page for:

  • active triggered/acknowledged incidents,
  • worker/config warnings,
  • severity/status scanning,
  • quick links into incident detail pages.

2. Triage an incident#

Open /incidents/{id} and review:

  • incident severity and status,
  • assigned/on-call user,
  • timeline events,
  • notification delivery logs,
  • any clipped provider error messages.

3. Acknowledge or resolve#

Incident action forms require an explicit Acting as responder selection. This preserves responder-without-login workflows and lets an admin or responder record the correct incident actor.

Use:

  • Acknowledge when someone has taken ownership.
  • Resolve when the incident is actually fixed.
  • Reopen when a resolved incident is active again.
  • Manual escalate when the currently assigned responder should be bypassed.
  • Retry notification for failed/exhausted notification logs after the underlying config/problem is fixed.

4. Confirm recovery webhooks#

Grafana state: ok and CloudWatch NewStateValue: OK can auto-resolve matching open incidents when the alert source and external ID match. Repeated recovery webhooks should be idempotent.

Escalation policies#

Escalation level numbers are unique within a policy. Use level 1 for the first page target, level 2 for the next target, and so on; PagerPal rejects duplicate level numbers because they make escalation order ambiguous.

Direct user assignees must be active users. PagerPal keeps existing inactive assignees visible for audit/history, but new escalation configuration should point at an active user or an active team coverage path.

Schedule overrides#

Use schedule overrides for temporary coverage changes that intentionally overlap existing on-call windows. Normal schedule entries are rejected when they overlap another entry on the same schedule; override entries are allowed to overlap and take precedence when PagerPal resolves the current on-call responder.

User deactivation#

Deactivating a user leaves historical incidents, team memberships, schedule entries, and escalation assignee records intact, records a timeline warning on any open incident assigned to that user, and removes the user from future current on-call resolution. PagerPal rejects inactive users for new team memberships, schedule entries, and direct escalation assignees, so after deactivation add or adjust coverage with an active responder.

Alert source key handling#

Alert source API keys are credentials. Treat them like passwords.

Rules:

  • Use header-based credentials where possible: X-API-Key: <alert-source-key>.
  • Avoid query-string credentials because they can appear in logs, proxies, and shell history.
  • PagerPal shows a full alert source key only immediately after creation or regeneration.
  • PagerPal masks alert source keys in default list/get responses.
  • If a key is lost, regenerate it and update the monitoring system.
  • If a key may have leaked, regenerate it immediately.

Worker model#

PagerPal v1 uses APScheduler inside the FastAPI process for:

  • notification retries,
  • automatic escalation.

Run exactly one app process with workers enabled. Do not run multiple Uvicorn workers, multiple containers, or multiple EC2 instances with workers enabled until a singleton scheduler/locking design exists.

Check worker health from:

/settings
/api/v1/system/jobs

The API endpoint requires a real User Account. Viewers may read worker status; configuration changes require admin access.

Notification modes#

Mode Expected use
NOTIFICATION_SENDING_ENABLED=false Safe UI/testing mode. Delivery attempts are logged without real provider sends.
NOTIFICATION_SENDING_ENABLED=true Real paging mode. Requires valid Infobip SMS/WhatsApp config, SMTP email config, or both.

When real sending is enabled, verify /settings before sending test alerts. Provider validation should surface missing required settings without printing secret values.

Troubleshooting#

UI redirects to /login#

Sign in with a login-enabled User Account. If no admin exists, complete first-run bootstrap on /login or run python scripts/create_admin.py --email <admin-email>.

API returns 401 Unauthorized#

Check that the client is using a real User Account email and password. The old shared UI_USERNAME / UI_PASSWORD process credentials are not accepted for management routes.

API or UI action returns 403 Forbidden#

Check the account role. Admin-only configuration writes require admin; incident actions require responder or admin; viewer accounts are read-only.

Alert webhook returns 401 or 403#

Check that the webhook includes an active alert source API key. Prefer:

X-API-Key: <alert-source-key>

403 can also mean the alert source exists but is inactive.

Notifications are not sent#

Check:

  1. NOTIFICATION_SENDING_ENABLED=true.
  2. /settings notification provider status.
  3. Notification logs on the incident detail page.
  4. Whether the notification is already marked exhausted and needs manual retry after the config is fixed.

Escalations are not happening#

Check:

  1. ESCALATION_WORKER_ENABLED=true.
  2. /settings or /api/v1/system/jobs worker status.
  3. The incident is still triggered; acknowledged/resolved incidents should not continue escalating.
  4. The escalation policy level delays and recipient configuration.

Current v1 limitations#

  • Single-node/single-process design only.
  • No SSO/SAML yet; local email/password accounts are the current auth model.
  • UI action attribution still depends on the visible Acting as selector so responder-only identities can remain valid.
  • Background workers stop if the app process stops.
  • SQLite is fine for local/demo use; PostgreSQL is recommended for durable AWS hosting.
  • Scaling beyond one process needs scheduler coordination first.