project-pal-e-platform updated 2026-04-04pal-e-platform
Vision
The infrastructure pillar of a DORA Elite AI Enterprise. In the three-pillar model (platform=DevOps/SRE, docs=product, agency=process+enforcement), pal-e-platform proves the DORA numbers — Deployment Frequency and MTTR. A developer adds one entry to var.services, pushes code to Forgejo, and gets: a namespace, CI pipeline, container registry project, GitOps deployment, TLS ingress, monitoring, log aggregation, and alerting. The Terraform is the control plane. The platform is the product.
Three repos, three control planes, one system. pal-e-platform (Terraform + Salt) provisions the foundation: k3s cluster, Tailscale networking, Forgejo, Woodpecker CI, Harbor, MinIO, CNPG Postgres, Keycloak, and the full monitoring + validation stack. pal-e-services (Terraform) onboards services via ArgoCD and a for_each automation pattern. pal-e-deployments (Kustomize + ArgoCD) defines how applications deploy via GitOps overlays — the source ArgoCD reads for all 6 services. Three control planes manage three layers: Terraform manages what exists in the cluster (Helm releases, namespaces, RBAC). GitOps/ArgoCD manages how applications are delivered (kustomize overlays, image tags, auto-sync). SaltStack manages the host (k3s, nftables firewall, packages, GPG-encrypted pillar). Everything self-hosted. No external cloud dependencies except Tailscale for networking.
Operating thesis: This platform proves that one human architect + AI agent orchestration can build and operate infrastructure that traditionally requires a 50-person engineering organization. Three control planes: Terraform manages everything inside the cluster. GitOps manages application delivery. SaltStack manages everything on the host. A seven-pillar validation framework (observability, SLO governance, policy, security, progressive delivery, load testing, chaos engineering) proves it all works — not through architecture documents, but through measured, repeatable evidence. DORA is the proof.
DORA thesis: Platform hardening IS DORA enablement. Every phase in the hardening plan directly improves one or more DORA metrics — observability reduces MTTR and Change Failure Rate, CI hardening increases Deployment Frequency and reduces Lead Time, Kustomize patterns make deploys repeatable, network security and env isolation reduce blast radius. The virtuous cycle: platform maturity → developers trust production → they ship more often → DORA metrics improve → which validates the platform investment. DORA is two systems measured as one: Observability (SRE — production health) + Kanban (DevEx — value throughput via pal-e-docs boards). The platform provides the observability. Pal-e-docs provides the Kanban. DORA proves both work. This is what makes it an elite AI enterprise — not just that AI agents write the code, but that the system they operate within is measured, observable, and continuously improving.
User Stories
Who uses the platform, what they need, and how we measure success. pal-e-platform serves one primary role: the Superuser who deploys and operates infrastructure for all projects.
| Role | Story | Success Metric | story:X key |
|---|---|---|---|
| Superuser (Lucas) | I can deploy infrastructure changes via tofu plan/apply and see them succeed in Woodpecker CI without manual intervention. |
Pipeline success rate >95%. Zero manual kubectl interventions for routine deploys. | story:superuser-deploy |
| Superuser (Lucas) | I can observe the health of all services via Grafana dashboards. When something breaks, I see it before users report it. | MTTR <30min for infrastructure incidents. Alert-to-awareness <5min. | story:superuser-observe |
| Superuser (Lucas) | I can recover from failures using documented SOPs. Every failure mode has a runbook. | All failure modes covered by recovery SOPs. Zero novel failure responses (every response follows an SOP). | story:superuser-recover |
| Superuser (Lucas) | I can onboard a new service to the platform (Forgejo repo, Woodpecker CI, k3s deployment, Tailscale funnel) following a documented procedure. | Service onboarding follows service-onboarding-sop. New service deploys in <1 day. |
story:superuser-onboard-service |
| Superuser (Lucas) | I can SSH into the platform from any device (phone, laptop, tablet) using any standard SSH client without browser-based approval gates. | SSH from Termius/any client succeeds on first attempt. Zero browser redirects in the SSH flow. | story:superuser-remote-access |
Architecture
Domain Model
graph LR
subgraph control["Control Planes"]
TF_P["pal-e-platform\n(OpenTofu)"]
TF_S["pal-e-services\n(OpenTofu)"]
SALT["SaltStack"]
end
subgraph platform_resources["Platform Resources"]
HR[Helm Release]
NS[Namespace]
HP[Harbor Project]
SM[ServiceMonitor]
FUNNEL[Tailscale Funnel]
KEYCLOAK[Keycloak IdP]
OLLAMA[Ollama + GPU]
DORA[DORA Exporter]
BLACKBOX[Blackbox Exporter]
end
subgraph service_resources["Per-Service Bundle"]
SVC["Service\n(var.services entry)"]
PIPE[Woodpecker Pipeline]
ARGO_APP[ArgoCD Application]
CNPG_DB[Postgres DB]
OVERLAY["Kustomize Overlay\n(pal-e-deployments)"]
end
subgraph host_resources["Host Resources"]
K3S[k3s Cluster]
FW[nftables Firewall]
PKG[Packages]
PILLAR[GPG-encrypted Pillar]
end
TF_P -->|deploys| HR
TF_S -->|creates per| SVC
SVC --- NS & HP & PIPE & ARGO_APP & SM & FUNNEL
SVC -.->|optional| CNPG_DB
SVC -.->|kustomize overlay| OVERLAY
SALT -->|manages| K3S & FW & PKG & PILLAR
Data Flow
graph LR
subgraph deploy_flow["Deployment Pipeline"]
DEV[Developer] -->|push| FORGEJO[Forgejo]
FORGEJO -->|webhook| WP[Woodpecker CI]
WP -->|test + build via kaniko| HARBOR[Harbor]
HARBOR -->|poll tags| IU[Image Updater]
IU -->|write .argocd-source| DEPLOY[pal-e-deployments\nkustomize overlays]
DEPLOY -->|detect change| ARGO[ArgoCD]
ARGO -->|sync| K8S[k8s Pod]
end
subgraph observe_flow["Observability Pipeline"]
K8S -->|scrape metrics| PROM[Prometheus\n15d retention]
K8S -->|container logs| PROMTAIL[Promtail]
PROMTAIL --> LOKI[Loki\n7d retention]
PROM --> GRAFANA[Grafana]
LOKI --> GRAFANA
PROM -->|alert rules| AM[Alertmanager]
AM -->|notify| TG[Telegram]
BLACKBOX[Blackbox Exporter\n13 probes] -->|probe_success| PROM
DORA[DORA Exporter\n726 metrics] -->|scrape| PROM
end
subgraph infra_flow["Infrastructure Changes"]
PR[PR to main] -->|tofu plan| REVIEW[Plan Output]
REVIEW -->|merge| APPLY[tofu apply]
APPLY -->|update| CLUSTER[k8s Resources]
end
Deployment
graph TD
subgraph host["Arch Linux · 12 cores · 125GB RAM · 1.8TB NVMe"]
SALT["SaltStack\n27 states · GPG pillar · nftables"]
subgraph k3s["k3s Cluster"]
subgraph tf_platform["pal-e-platform (Terraform)"]
monitoring["monitoring\nPrometheus · Grafana · Loki\nPromtail · Alertmanager\nBlackbox Exporter · DORA Exporter"]
forgejo["forgejo\nForgejo git server"]
woodpecker["woodpecker\nCI server + agent\nCNPG Postgres"]
harbor["harbor\nCore · Registry · Nginx\nDB · Redis · Trivy"]
minio["minio\nObject storage"]
cnpg_sys["cnpg-system\nPostgres operator"]
ollama["ollama\nOllama + NVIDIA GPU"]
keycloak["keycloak\nKeycloak IdP (OIDC)"]
tailscale["tailscale\nOperator + funnels"]
end
subgraph tf_services["pal-e-services (Terraform)"]
argocd["argocd\nArgoCD + Image Updater"]
apps["per-service namespaces\npal-e-docs · basketball-api\npal-e-app · westsidekingsandqueens\nplatform-validation"]
end
postgres["postgres\npal-e-postgres (CNPG managed)"]
tofu_state["tofu-state\nTF state secrets"]
end
end
tailscale -.->|TLS funnel| forgejo & harbor & minio & monitoring & woodpecker & keycloak
Validation Pipeline (Target State — Phases 16-23)
graph LR
subgraph tier1["Tier 1 — Foundation"]
SLOTH["Sloth\nSLO YAML → Recording Rules"] -->|generate| RULES["PrometheusRules\nMulti-window burn rate"]
OTEL["OTel Collector"] -->|traces| TEMPO["Tempo\nTrace backend"]
TEMPO --> GRAFANA_T1[Grafana]
RULES --> PROM["Prometheus"]
end
subgraph tier2["Tier 2 — Hardening"]
subgraph security["Security Pipeline"]
COSIGN["Cosign\nCI Signing"] -->|signed image| HARBOR[Harbor]
RENOVATE["Renovate\nDep PRs"] -->|update PRs| FORGEJO[Forgejo]
HARBOR -->|admission| KYVERNO["Kyverno\nPolicy Admission"]
KYVERNO -->|admit/reject| K8S[k8s API]
K8S -->|runtime| FALCO["Falco\neBPF DaemonSet"]
ZAP["OWASP ZAP\nWeekly CronJob"] -->|scan| FUNNELS[Tailscale Funnels]
end
subgraph delivery["Progressive Delivery"]
MERGE[Merge] -->|image update| ROLLOUT["Argo Rollouts\nCanary 20%"]
ROLLOUT -->|query| SLO_CHECK{"SLO burn rate\n< threshold?"}
SLO_CHECK -->|yes| PROMOTE[Promote 100%]
SLO_CHECK -->|no| ROLLBACK[Auto-Rollback]
end
end
subgraph tier3["Tier 3 — Advanced Validation"]
K6["k6 Operator\nLoad Profiles"] -->|test| SERVICES[Service Endpoints]
LITMUS["LitmusChaos\nExperiment Library"] -->|inject fault| CLUSTER[k8s Resources]
end
subgraph glass["Single Pane of Glass"]
PROM_MAIN["Prometheus"]
GRAFANA_MAIN["Grafana\nOperations Dashboard"]
AM["Alertmanager → Telegram"]
end
KYVERNO -->|metrics| PROM_MAIN
FALCO -->|events| PROM_MAIN
ZAP -->|results| PROM_MAIN
K6 -->|remote write| PROM_MAIN
LITMUS -->|exporter| PROM_MAIN
ROLLOUT -->|metrics| PROM_MAIN
PROM -->|federate| PROM_MAIN
PROM_MAIN --> GRAFANA_MAIN
PROM_MAIN -->|alert rules| AM
Plan
Active: plan-pal-e-platform — Platform Hardening
Harden from working dev cluster to production-grade, seven-pillar validated system. 23 phases across three tiers: Tier 1 Foundation — observability (1-5, 14-15), SLO governance/Sloth (16), distributed tracing/OTel (17), operations dashboard (18). Tier 2 Hardening — network security (8), policy-as-code/Kyverno (19), security deepening/Renovate+Falco+Cosign+ZAP (20a-d), progressive delivery/Argo Rollouts (21). Tier 3 Advanced Validation — load testing/k6 (22), chaos engineering/LitmusChaos (23, capstone). 16/23 main phases COMPLETED. Phase 17a (Woodpecker Secrets) in-progress. Every validation tool feeds Prometheus/Grafana — single pane of glass.
Completed plans:
| Plan | Completed | Summary |
|---|---|---|
plan-2026-02-26-tf-modularize-postgres |
2026-03-13 | SQLite to Postgres migration + CNPG operator deployment |
plan-2026-02-25-platform-observability |
2026-03-13 | 5 phases reparented into plan-pal-e-platform |
plan-2026-02-26-salt-host-management |
2026-02-28 | SaltStack: host audit, bootstrap, codify 27 states, GPG pillar, nftables |
plan-2026-02-24-minio-object-storage |
2026-02-25 | MinIO standalone deployment |
plan-2026-03-01-dora-metrics-dashboard |
2026-03-02 | DORA framework + metrics foundation |
Board
board-pal-e-platform — Pal E Platform Board. Continuous kanban. 26 items (1 plan, 22 phases, 3 issues). Columns: Backlog → In Progress → Done. Auto-syncs plan phases via sync_board. Forgejo issues auto-sync via sync-issues.
Status
- Platform stable and operational — all core infrastructure deployed and running. Seven-pillar validation framework scoped (Phases 16-23).
- k3s cluster with Tailscale funnels for ingress/TLS (no cert-manager, no Traefik)
- Forgejo, Woodpecker CI (Postgres-backed via CNPG), Harbor, MinIO, Keycloak all operational
- CNPG Postgres operator deployed — WAL archiving to MinIO, daily base backups, PITR verified
- Monitoring stack: Prometheus (15d retention), Grafana (3 custom dashboards + kube-prometheus defaults), Loki (7d retention), Promtail, Alertmanager (Telegram), Blackbox Exporter (13 probe targets)
- DORA measurement pipeline LIVE — exporter producing 726 metrics. Platform Overall: High-Elite. 262 PRs merged, 11.4/day, p50 lead time 10 min.
- Ollama + NVIDIA device plugin deployed (GPU workloads, Qwen3-Embedding-4B)
- Keycloak IdP LIVE — OIDC chain: Keycloak → basketball-api JWKS → westside-app Auth.js. 50 users, role-based access (admin/coach/player).
- 6 services onboarded via pal-e-services: pal-e-docs, basketball-api, pal-e-app, westsidekingsandqueens, platform-validation, gcal-scheduler
- Woodpecker CI automated — plan-on-PR + apply-on-merge for pal-e-platform. Merge = deploy.
- Kustomize migration COMPLETE — all 6 services on centralized overlays in pal-e-deployments. ArgoCD reads from
pal-e-deployments. - Network security COMPLETE (Phase 8) — three-layer defense: NetworkPolicies (15 namespaces), Tailscale ACLs (role-scoped), nftables host firewall (Salt-managed).
- Alert tuning COMPLETE (Phase 16-alert) — 5 PRs across 4 repos. Alerts reduced from 19 to stale-only.
- Resource usage: 12 cores / 125GB RAM / 1.8TB NVMe. Cluster uses ~11% CPU, ~9% RAM. Massive headroom for validation tooling.
- Salt plan COMPLETE — host fully codified as 27 Salt states, GPG-encrypted pillar, nftables firewall applied.
- Source of truth: Forgejo — migrated from GitHub 2026-02-27. GitHub is historical only.
- 16 of 23 main phases completed + subphases — see
plan-pal-e-platform. Three-tier framework: Tier 1 (SLO, OTel, Dashboard) → Tier 2 (Kyverno, Security 20a-d, Rollouts) → Tier 3 (k6 Load, LitmusChaos Capstone). Phase 17a (Woodpecker Secrets) in-progress.
Milestones
| Date | Milestone | Impact |
|---|---|---|
| 2026-03-14 | Woodpecker Postgres Migration + DORA Pipeline Complete | 5 PRs, 2 phases completed (5+13), DORA measurement pipeline reliable. Infra DF/LT moved from Medium→High. 726 metrics across 28 repos. Grafana | Alertmanager | Woodpecker |
| 2026-03-14 | Platform Hardening: 8/13 phases complete | Phases 1-6, 10, 13 COMPLETED. Observability stack: 26 Grafana dashboards, 31 alert rule groups, 19 ServiceMonitors, 3 PodMonitors. Alert noise floor: 23→3. |
| 2026-03-02 | Platform bootstrap complete | k3s + Tailscale + Forgejo + Woodpecker + Harbor + MinIO + kube-prometheus-stack + Loki + CNPG + ArgoCD all deployed via OpenTofu. Salt codifies host. |
Repos
| Repo | Platform | Role | Status |
|---|---|---|---|
| pal-e-platform | Forgejo | OpenTofu IaC + SaltStack for base platform | active |
| pal-e-services | Forgejo | OpenTofu IaC for service onboarding | active |
| pal-e-deployments | Forgejo | Kustomize bases + per-service overlays | active |
| minio-sdk | Forgejo | Pure Python S3 SDK with custom Signature V4 signing | active |
| minio-playground | Forgejo | Mobile-first vanilla HTML/CSS/JS file browser prototype | active |
| gmail-sdk | Forgejo | Gmail API SDK — OAuth auth, token lifecycle, email operations | active |
| gmail-mcp | Forgejo | MCP server for Gmail — wraps gmail-sdk for Claude Code | active |
Infrastructure
| Component | Details |
|---|---|
| Host | Arch Linux · 12 cores · 125GB RAM · 1.8TB NVMe · NVIDIA GPU |
| Cluster | k3s single-node · Tailscale funnels for ingress/TLS |
| Control Plane 1 | Terraform (OpenTofu) — pal-e-platform deploys Helm charts, pal-e-services onboards services |
| Control Plane 2 | GitOps — ArgoCD reads kustomize overlays from pal-e-deployments. Image Updater polls Harbor tags. |
| Control Plane 3 | SaltStack — 27 states, GPG-encrypted pillar, nftables firewall, k3s lifecycle |
| CI | Woodpecker CI (Postgres-backed via CNPG). Plan-on-PR, apply-on-merge for Terraform repos. Test+build+push for app repos. |
| Container Registry | Harbor — Trivy scanning, robot accounts per service, SBOM storage (future: Cosign signatures) |
| Object Storage | MinIO — CNPG WAL archives, Loki chunks, Tempo traces (future) |
| Identity | Keycloak — OIDC provider. Realms: westside-basketball, mcd-tracker. JWKS validation in app APIs. |
| Monitoring | Prometheus (15d) · Grafana (26 dashboards) · Loki (7d) · Promtail · Alertmanager → Telegram · Blackbox (13 probes) · DORA Exporter (726 metrics) |
| Secrets | Salt GPG pillar (21 secrets) + SOPS/Age in kustomize overlays (6 app secrets). Two paths per sop-secrets-management. |
| GPU | Ollama + NVIDIA device plugin — Qwen3-Embedding-4B for pal-e-docs semantic search |
| Services onboarded | 6: pal-e-docs, basketball-api, pal-e-app, westsidekingsandqueens, platform-validation, gcal-scheduler |
Inbox
Untriaged TODOs awaiting scoping into plan-pal-e-platform. See convention-todo-lifecycle.
All 7 platform TODOs are now parked under plan phases. No unparented items in the inbox.