project-page active pal-e-platform
project-pal-e-platform updated 2026-04-04

pal-e-platform

Vision

The infrastructure pillar of a DORA Elite AI Enterprise. In the three-pillar model (platform=DevOps/SRE, docs=product, agency=process+enforcement), pal-e-platform proves the DORA numbers — Deployment Frequency and MTTR. A developer adds one entry to var.services, pushes code to Forgejo, and gets: a namespace, CI pipeline, container registry project, GitOps deployment, TLS ingress, monitoring, log aggregation, and alerting. The Terraform is the control plane. The platform is the product.

Three repos, three control planes, one system. pal-e-platform (Terraform + Salt) provisions the foundation: k3s cluster, Tailscale networking, Forgejo, Woodpecker CI, Harbor, MinIO, CNPG Postgres, Keycloak, and the full monitoring + validation stack. pal-e-services (Terraform) onboards services via ArgoCD and a for_each automation pattern. pal-e-deployments (Kustomize + ArgoCD) defines how applications deploy via GitOps overlays — the source ArgoCD reads for all 6 services. Three control planes manage three layers: Terraform manages what exists in the cluster (Helm releases, namespaces, RBAC). GitOps/ArgoCD manages how applications are delivered (kustomize overlays, image tags, auto-sync). SaltStack manages the host (k3s, nftables firewall, packages, GPG-encrypted pillar). Everything self-hosted. No external cloud dependencies except Tailscale for networking.

Operating thesis: This platform proves that one human architect + AI agent orchestration can build and operate infrastructure that traditionally requires a 50-person engineering organization. Three control planes: Terraform manages everything inside the cluster. GitOps manages application delivery. SaltStack manages everything on the host. A seven-pillar validation framework (observability, SLO governance, policy, security, progressive delivery, load testing, chaos engineering) proves it all works — not through architecture documents, but through measured, repeatable evidence. DORA is the proof.

DORA thesis: Platform hardening IS DORA enablement. Every phase in the hardening plan directly improves one or more DORA metrics — observability reduces MTTR and Change Failure Rate, CI hardening increases Deployment Frequency and reduces Lead Time, Kustomize patterns make deploys repeatable, network security and env isolation reduce blast radius. The virtuous cycle: platform maturity → developers trust production → they ship more often → DORA metrics improve → which validates the platform investment. DORA is two systems measured as one: Observability (SRE — production health) + Kanban (DevEx — value throughput via pal-e-docs boards). The platform provides the observability. Pal-e-docs provides the Kanban. DORA proves both work. This is what makes it an elite AI enterprise — not just that AI agents write the code, but that the system they operate within is measured, observable, and continuously improving.

User Stories

Who uses the platform, what they need, and how we measure success. pal-e-platform serves one primary role: the Superuser who deploys and operates infrastructure for all projects.

Role Story Success Metric story:X key
Superuser (Lucas) I can deploy infrastructure changes via tofu plan/apply and see them succeed in Woodpecker CI without manual intervention. Pipeline success rate >95%. Zero manual kubectl interventions for routine deploys. story:superuser-deploy
Superuser (Lucas) I can observe the health of all services via Grafana dashboards. When something breaks, I see it before users report it. MTTR <30min for infrastructure incidents. Alert-to-awareness <5min. story:superuser-observe
Superuser (Lucas) I can recover from failures using documented SOPs. Every failure mode has a runbook. All failure modes covered by recovery SOPs. Zero novel failure responses (every response follows an SOP). story:superuser-recover
Superuser (Lucas) I can onboard a new service to the platform (Forgejo repo, Woodpecker CI, k3s deployment, Tailscale funnel) following a documented procedure. Service onboarding follows service-onboarding-sop. New service deploys in <1 day. story:superuser-onboard-service
Superuser (Lucas) I can SSH into the platform from any device (phone, laptop, tablet) using any standard SSH client without browser-based approval gates. SSH from Termius/any client succeeds on first attempt. Zero browser redirects in the SSH flow. story:superuser-remote-access

Architecture

Domain Model

graph LR
    subgraph control["Control Planes"]
        TF_P["pal-e-platform\n(OpenTofu)"]
        TF_S["pal-e-services\n(OpenTofu)"]
        SALT["SaltStack"]
    end

    subgraph platform_resources["Platform Resources"]
        HR[Helm Release]
        NS[Namespace]
        HP[Harbor Project]
        SM[ServiceMonitor]
        FUNNEL[Tailscale Funnel]
        KEYCLOAK[Keycloak IdP]
        OLLAMA[Ollama + GPU]
        DORA[DORA Exporter]
        BLACKBOX[Blackbox Exporter]
    end

    subgraph service_resources["Per-Service Bundle"]
        SVC["Service\n(var.services entry)"]
        PIPE[Woodpecker Pipeline]
        ARGO_APP[ArgoCD Application]
        CNPG_DB[Postgres DB]
        OVERLAY["Kustomize Overlay\n(pal-e-deployments)"]
    end

    subgraph host_resources["Host Resources"]
        K3S[k3s Cluster]
        FW[nftables Firewall]
        PKG[Packages]
        PILLAR[GPG-encrypted Pillar]
    end

    TF_P -->|deploys| HR
    TF_S -->|creates per| SVC
    SVC --- NS & HP & PIPE & ARGO_APP & SM & FUNNEL
    SVC -.->|optional| CNPG_DB
    SVC -.->|kustomize overlay| OVERLAY
    SALT -->|manages| K3S & FW & PKG & PILLAR

Data Flow

graph LR
    subgraph deploy_flow["Deployment Pipeline"]
        DEV[Developer] -->|push| FORGEJO[Forgejo]
        FORGEJO -->|webhook| WP[Woodpecker CI]
        WP -->|test + build via kaniko| HARBOR[Harbor]
        HARBOR -->|poll tags| IU[Image Updater]
        IU -->|write .argocd-source| DEPLOY[pal-e-deployments\nkustomize overlays]
        DEPLOY -->|detect change| ARGO[ArgoCD]
        ARGO -->|sync| K8S[k8s Pod]
    end

    subgraph observe_flow["Observability Pipeline"]
        K8S -->|scrape metrics| PROM[Prometheus\n15d retention]
        K8S -->|container logs| PROMTAIL[Promtail]
        PROMTAIL --> LOKI[Loki\n7d retention]
        PROM --> GRAFANA[Grafana]
        LOKI --> GRAFANA
        PROM -->|alert rules| AM[Alertmanager]
        AM -->|notify| TG[Telegram]
        BLACKBOX[Blackbox Exporter\n13 probes] -->|probe_success| PROM
        DORA[DORA Exporter\n726 metrics] -->|scrape| PROM
    end

    subgraph infra_flow["Infrastructure Changes"]
        PR[PR to main] -->|tofu plan| REVIEW[Plan Output]
        REVIEW -->|merge| APPLY[tofu apply]
        APPLY -->|update| CLUSTER[k8s Resources]
    end

Deployment

graph TD
    subgraph host["Arch Linux · 12 cores · 125GB RAM · 1.8TB NVMe"]
        SALT["SaltStack\n27 states · GPG pillar · nftables"]
        subgraph k3s["k3s Cluster"]
            subgraph tf_platform["pal-e-platform (Terraform)"]
                monitoring["monitoring\nPrometheus · Grafana · Loki\nPromtail · Alertmanager\nBlackbox Exporter · DORA Exporter"]
                forgejo["forgejo\nForgejo git server"]
                woodpecker["woodpecker\nCI server + agent\nCNPG Postgres"]
                harbor["harbor\nCore · Registry · Nginx\nDB · Redis · Trivy"]
                minio["minio\nObject storage"]
                cnpg_sys["cnpg-system\nPostgres operator"]
                ollama["ollama\nOllama + NVIDIA GPU"]
                keycloak["keycloak\nKeycloak IdP (OIDC)"]
                tailscale["tailscale\nOperator + funnels"]
            end
            subgraph tf_services["pal-e-services (Terraform)"]
                argocd["argocd\nArgoCD + Image Updater"]
                apps["per-service namespaces\npal-e-docs · basketball-api\npal-e-app · westsidekingsandqueens\nplatform-validation"]
            end
            postgres["postgres\npal-e-postgres (CNPG managed)"]
            tofu_state["tofu-state\nTF state secrets"]
        end
    end

    tailscale -.->|TLS funnel| forgejo & harbor & minio & monitoring & woodpecker & keycloak

Validation Pipeline (Target State — Phases 16-23)

graph LR
    subgraph tier1["Tier 1 — Foundation"]
        SLOTH["Sloth\nSLO YAML → Recording Rules"] -->|generate| RULES["PrometheusRules\nMulti-window burn rate"]
        OTEL["OTel Collector"] -->|traces| TEMPO["Tempo\nTrace backend"]
        TEMPO --> GRAFANA_T1[Grafana]
        RULES --> PROM["Prometheus"]
    end

    subgraph tier2["Tier 2 — Hardening"]
        subgraph security["Security Pipeline"]
            COSIGN["Cosign\nCI Signing"] -->|signed image| HARBOR[Harbor]
            RENOVATE["Renovate\nDep PRs"] -->|update PRs| FORGEJO[Forgejo]
            HARBOR -->|admission| KYVERNO["Kyverno\nPolicy Admission"]
            KYVERNO -->|admit/reject| K8S[k8s API]
            K8S -->|runtime| FALCO["Falco\neBPF DaemonSet"]
            ZAP["OWASP ZAP\nWeekly CronJob"] -->|scan| FUNNELS[Tailscale Funnels]
        end
        subgraph delivery["Progressive Delivery"]
            MERGE[Merge] -->|image update| ROLLOUT["Argo Rollouts\nCanary 20%"]
            ROLLOUT -->|query| SLO_CHECK{"SLO burn rate\n< threshold?"}
            SLO_CHECK -->|yes| PROMOTE[Promote 100%]
            SLO_CHECK -->|no| ROLLBACK[Auto-Rollback]
        end
    end

    subgraph tier3["Tier 3 — Advanced Validation"]
        K6["k6 Operator\nLoad Profiles"] -->|test| SERVICES[Service Endpoints]
        LITMUS["LitmusChaos\nExperiment Library"] -->|inject fault| CLUSTER[k8s Resources]
    end

    subgraph glass["Single Pane of Glass"]
        PROM_MAIN["Prometheus"]
        GRAFANA_MAIN["Grafana\nOperations Dashboard"]
        AM["Alertmanager → Telegram"]
    end

    KYVERNO -->|metrics| PROM_MAIN
    FALCO -->|events| PROM_MAIN
    ZAP -->|results| PROM_MAIN
    K6 -->|remote write| PROM_MAIN
    LITMUS -->|exporter| PROM_MAIN
    ROLLOUT -->|metrics| PROM_MAIN
    PROM -->|federate| PROM_MAIN
    PROM_MAIN --> GRAFANA_MAIN
    PROM_MAIN -->|alert rules| AM

Plan

Active: plan-pal-e-platform — Platform Hardening

Harden from working dev cluster to production-grade, seven-pillar validated system. 23 phases across three tiers: Tier 1 Foundation — observability (1-5, 14-15), SLO governance/Sloth (16), distributed tracing/OTel (17), operations dashboard (18). Tier 2 Hardening — network security (8), policy-as-code/Kyverno (19), security deepening/Renovate+Falco+Cosign+ZAP (20a-d), progressive delivery/Argo Rollouts (21). Tier 3 Advanced Validation — load testing/k6 (22), chaos engineering/LitmusChaos (23, capstone). 16/23 main phases COMPLETED. Phase 17a (Woodpecker Secrets) in-progress. Every validation tool feeds Prometheus/Grafana — single pane of glass.

Completed plans:

Plan Completed Summary
plan-2026-02-26-tf-modularize-postgres 2026-03-13 SQLite to Postgres migration + CNPG operator deployment
plan-2026-02-25-platform-observability 2026-03-13 5 phases reparented into plan-pal-e-platform
plan-2026-02-26-salt-host-management 2026-02-28 SaltStack: host audit, bootstrap, codify 27 states, GPG pillar, nftables
plan-2026-02-24-minio-object-storage 2026-02-25 MinIO standalone deployment
plan-2026-03-01-dora-metrics-dashboard 2026-03-02 DORA framework + metrics foundation

Board

board-pal-e-platform — Pal E Platform Board. Continuous kanban. 26 items (1 plan, 22 phases, 3 issues). Columns: Backlog → In Progress → Done. Auto-syncs plan phases via sync_board. Forgejo issues auto-sync via sync-issues.

Status

  • Platform stable and operational — all core infrastructure deployed and running. Seven-pillar validation framework scoped (Phases 16-23).
  • k3s cluster with Tailscale funnels for ingress/TLS (no cert-manager, no Traefik)
  • Forgejo, Woodpecker CI (Postgres-backed via CNPG), Harbor, MinIO, Keycloak all operational
  • CNPG Postgres operator deployed — WAL archiving to MinIO, daily base backups, PITR verified
  • Monitoring stack: Prometheus (15d retention), Grafana (3 custom dashboards + kube-prometheus defaults), Loki (7d retention), Promtail, Alertmanager (Telegram), Blackbox Exporter (13 probe targets)
  • DORA measurement pipeline LIVE — exporter producing 726 metrics. Platform Overall: High-Elite. 262 PRs merged, 11.4/day, p50 lead time 10 min.
  • Ollama + NVIDIA device plugin deployed (GPU workloads, Qwen3-Embedding-4B)
  • Keycloak IdP LIVE — OIDC chain: Keycloak → basketball-api JWKS → westside-app Auth.js. 50 users, role-based access (admin/coach/player).
  • 6 services onboarded via pal-e-services: pal-e-docs, basketball-api, pal-e-app, westsidekingsandqueens, platform-validation, gcal-scheduler
  • Woodpecker CI automated — plan-on-PR + apply-on-merge for pal-e-platform. Merge = deploy.
  • Kustomize migration COMPLETE — all 6 services on centralized overlays in pal-e-deployments. ArgoCD reads from pal-e-deployments.
  • Network security COMPLETE (Phase 8) — three-layer defense: NetworkPolicies (15 namespaces), Tailscale ACLs (role-scoped), nftables host firewall (Salt-managed).
  • Alert tuning COMPLETE (Phase 16-alert) — 5 PRs across 4 repos. Alerts reduced from 19 to stale-only.
  • Resource usage: 12 cores / 125GB RAM / 1.8TB NVMe. Cluster uses ~11% CPU, ~9% RAM. Massive headroom for validation tooling.
  • Salt plan COMPLETE — host fully codified as 27 Salt states, GPG-encrypted pillar, nftables firewall applied.
  • Source of truth: Forgejo — migrated from GitHub 2026-02-27. GitHub is historical only.
  • 16 of 23 main phases completed + subphases — see plan-pal-e-platform. Three-tier framework: Tier 1 (SLO, OTel, Dashboard) → Tier 2 (Kyverno, Security 20a-d, Rollouts) → Tier 3 (k6 Load, LitmusChaos Capstone). Phase 17a (Woodpecker Secrets) in-progress.

Milestones

Date Milestone Impact
2026-03-14 Woodpecker Postgres Migration + DORA Pipeline Complete 5 PRs, 2 phases completed (5+13), DORA measurement pipeline reliable. Infra DF/LT moved from Medium→High. 726 metrics across 28 repos. Grafana | Alertmanager | Woodpecker
2026-03-14 Platform Hardening: 8/13 phases complete Phases 1-6, 10, 13 COMPLETED. Observability stack: 26 Grafana dashboards, 31 alert rule groups, 19 ServiceMonitors, 3 PodMonitors. Alert noise floor: 23→3.
2026-03-02 Platform bootstrap complete k3s + Tailscale + Forgejo + Woodpecker + Harbor + MinIO + kube-prometheus-stack + Loki + CNPG + ArgoCD all deployed via OpenTofu. Salt codifies host.

Repos

Repo Platform Role Status
pal-e-platform Forgejo OpenTofu IaC + SaltStack for base platform active
pal-e-services Forgejo OpenTofu IaC for service onboarding active
pal-e-deployments Forgejo Kustomize bases + per-service overlays active
minio-sdk Forgejo Pure Python S3 SDK with custom Signature V4 signing active
minio-playground Forgejo Mobile-first vanilla HTML/CSS/JS file browser prototype active
gmail-sdk Forgejo Gmail API SDK — OAuth auth, token lifecycle, email operations active
gmail-mcp Forgejo MCP server for Gmail — wraps gmail-sdk for Claude Code active

Infrastructure

Component Details
Host Arch Linux · 12 cores · 125GB RAM · 1.8TB NVMe · NVIDIA GPU
Cluster k3s single-node · Tailscale funnels for ingress/TLS
Control Plane 1 Terraform (OpenTofu) — pal-e-platform deploys Helm charts, pal-e-services onboards services
Control Plane 2 GitOps — ArgoCD reads kustomize overlays from pal-e-deployments. Image Updater polls Harbor tags.
Control Plane 3 SaltStack — 27 states, GPG-encrypted pillar, nftables firewall, k3s lifecycle
CI Woodpecker CI (Postgres-backed via CNPG). Plan-on-PR, apply-on-merge for Terraform repos. Test+build+push for app repos.
Container Registry Harbor — Trivy scanning, robot accounts per service, SBOM storage (future: Cosign signatures)
Object Storage MinIO — CNPG WAL archives, Loki chunks, Tempo traces (future)
Identity Keycloak — OIDC provider. Realms: westside-basketball, mcd-tracker. JWKS validation in app APIs.
Monitoring Prometheus (15d) · Grafana (26 dashboards) · Loki (7d) · Promtail · Alertmanager → Telegram · Blackbox (13 probes) · DORA Exporter (726 metrics)
Secrets Salt GPG pillar (21 secrets) + SOPS/Age in kustomize overlays (6 app secrets). Two paths per sop-secrets-management.
GPU Ollama + NVIDIA device plugin — Qwen3-Embedding-4B for pal-e-docs semantic search
Services onboarded 6: pal-e-docs, basketball-api, pal-e-app, westsidekingsandqueens, platform-validation, gcal-scheduler

Inbox

Untriaged TODOs awaiting scoping into plan-pal-e-platform. See convention-todo-lifecycle.

All 7 platform TODOs are now parked under plan phases. No unparented items in the inbox.