PagerPal Operator Guide#
This guide is for the person running PagerPal day to day. It focuses on the current v1 single-node appliance model: one FastAPI process, one database, and in-process background workers.
Access model#
PagerPal has two separate access mechanisms:
| Surface | Credential | Purpose |
|---|---|---|
| Web UI | Login-enabled User Account session | Operator access to dashboards, incidents, teams, schedules, policies, alert sources, and settings. |
| Management API | Real User Account Basic auth (email + password) |
CRUD/read access according to role. |
| Alert ingestion webhooks | Alert source API key | Allows monitoring systems to create/update/resolve incidents. |
/health |
None | Public health endpoint for local checks or load balancers. |
Roles:
admin: manage responders, teams, schedules, escalation policies, alert sources, and account details.responder: read operational state and run incident actions such as acknowledge, resolve, reopen, manual escalation, and notification retry.viewer: read-only access.
Set a strong SECRET_KEY, enable secure session cookies in production, run migrations, then create the first admin account with scripts/create_admin.py or the first-run /login bootstrap screen.
Safe local/demo mode#
Use safe mode when testing UI workflows or screenshots. It disables real outbound notifications and scheduler jobs:
NOTIFICATION_SENDING_ENABLED=false \
NOTIFICATION_RETRY_WORKER_ENABLED=false \
ESCALATION_WORKER_ENABLED=false \
uvicorn app.main:app --host 127.0.0.1 --port 8000
Open the UI:
http://127.0.0.1:8000/dashboard
If this is a fresh database, /login prompts for the first admin account. Existing deployments should sign in with a login-enabled User Account.
Production readiness checklist#
Before using PagerPal for real paging:
- Copy
.env.exampleto.envand replace every placeholder. - Set a strong
SECRET_KEY. - Set
SESSION_COOKIE_SECURE=truefor HTTPS production. - Set
PAGERPAL_BASE_URLto the HTTPS URL operators use. - Set
ALLOWED_ORIGINSto explicit production origins. - Configure at least one notification provider:
INFOBIP_*for SMS/WhatsApp orSMTP_*for email. - Keep
NOTIFICATION_SENDING_ENABLED=falseuntil a controlled test alert is verified. - Run
python -m alembic upgrade head. - Create at least one admin account.
- Confirm
GET /healthreturns200. - Confirm
/settingsreports expected notification provider and worker health. - Confirm exactly one PagerPal app process has retry/escalation workers enabled.
- Put TLS in front of the app before exposing it beyond the instance/private network.
- Configure database/EBS backups.
- Use
/alert-sourcesto send one controlled test alert and confirm incident creation, notification logs, acknowledgement, and resolution.
Day-to-day workflow#
1. Check live state#
Use /dashboard first. It is the operational landing page for:
- active triggered/acknowledged incidents,
- worker/config warnings,
- severity/status scanning,
- quick links into incident detail pages.
2. Triage an incident#
Open /incidents/{id} and review:
- incident severity and status,
- assigned/on-call user,
- timeline events,
- notification delivery logs,
- any clipped provider error messages.
3. Acknowledge or resolve#
Incident action forms require an explicit Acting as responder selection. This preserves responder-without-login workflows and lets an admin or responder record the correct incident actor.
Use:
- Acknowledge when someone has taken ownership.
- Resolve when the incident is actually fixed.
- Reopen when a resolved incident is active again.
- Manual escalate when the currently assigned responder should be bypassed.
- Retry notification for failed/exhausted notification logs after the underlying config/problem is fixed.
4. Confirm recovery webhooks#
Grafana state: ok and CloudWatch NewStateValue: OK can auto-resolve matching open incidents when the alert source and external ID match. Repeated recovery webhooks should be idempotent.
Escalation policies#
Escalation level numbers are unique within a policy. Use level 1 for the first page target, level 2 for the next target, and so on; PagerPal rejects duplicate level numbers because they make escalation order ambiguous.
Direct user assignees must be active users. PagerPal keeps existing inactive assignees visible for audit/history, but new escalation configuration should point at an active user or an active team coverage path.
Schedule overrides#
Use schedule overrides for temporary coverage changes that intentionally overlap existing on-call windows. Normal schedule entries are rejected when they overlap another entry on the same schedule; override entries are allowed to overlap and take precedence when PagerPal resolves the current on-call responder.
User deactivation#
Deactivating a user leaves historical incidents, team memberships, schedule entries, and escalation assignee records intact, records a timeline warning on any open incident assigned to that user, and removes the user from future current on-call resolution. PagerPal rejects inactive users for new team memberships, schedule entries, and direct escalation assignees, so after deactivation add or adjust coverage with an active responder.
Alert source key handling#
Alert source API keys are credentials. Treat them like passwords.
Rules:
- Use header-based credentials where possible:
X-API-Key: <alert-source-key>. - Avoid query-string credentials because they can appear in logs, proxies, and shell history.
- PagerPal shows a full alert source key only immediately after creation or regeneration.
- PagerPal masks alert source keys in default list/get responses.
- If a key is lost, regenerate it and update the monitoring system.
- If a key may have leaked, regenerate it immediately.
Worker model#
PagerPal v1 uses APScheduler inside the FastAPI process for:
- notification retries,
- automatic escalation.
Run exactly one app process with workers enabled. Do not run multiple Uvicorn workers, multiple containers, or multiple EC2 instances with workers enabled until a singleton scheduler/locking design exists.
Check worker health from:
/settings
/api/v1/system/jobs
The API endpoint requires a real User Account. Viewers may read worker status; configuration changes require admin access.
Notification modes#
| Mode | Expected use |
|---|---|
NOTIFICATION_SENDING_ENABLED=false |
Safe UI/testing mode. Delivery attempts are logged without real provider sends. |
NOTIFICATION_SENDING_ENABLED=true |
Real paging mode. Requires valid Infobip SMS/WhatsApp config, SMTP email config, or both. |
When real sending is enabled, verify /settings before sending test alerts. Provider validation should surface missing required settings without printing secret values.
Troubleshooting#
UI redirects to /login#
Sign in with a login-enabled User Account. If no admin exists, complete first-run bootstrap on /login or run python scripts/create_admin.py --email <admin-email>.
API returns 401 Unauthorized#
Check that the client is using a real User Account email and password. The old shared UI_USERNAME / UI_PASSWORD process credentials are not accepted for management routes.
API or UI action returns 403 Forbidden#
Check the account role. Admin-only configuration writes require admin; incident actions require responder or admin; viewer accounts are read-only.
Alert webhook returns 401 or 403#
Check that the webhook includes an active alert source API key. Prefer:
X-API-Key: <alert-source-key>
403 can also mean the alert source exists but is inactive.
Notifications are not sent#
Check:
NOTIFICATION_SENDING_ENABLED=true./settingsnotification provider status.- Notification logs on the incident detail page.
- Whether the notification is already marked
exhaustedand needs manual retry after the config is fixed.
Escalations are not happening#
Check:
ESCALATION_WORKER_ENABLED=true./settingsor/api/v1/system/jobsworker status.- The incident is still
triggered; acknowledged/resolved incidents should not continue escalating. - The escalation policy level delays and recipient configuration.
Current v1 limitations#
- Single-node/single-process design only.
- No SSO/SAML yet; local email/password accounts are the current auth model.
- UI action attribution still depends on the visible Acting as selector so responder-only identities can remain valid.
- Background workers stop if the app process stops.
- SQLite is fine for local/demo use; PostgreSQL is recommended for durable AWS hosting.
- Scaling beyond one process needs scheduler coordination first.