project-pal-e-pac updated 2026-03-15Project: pal-e-pac
Vision
The development experience that lives forever. A LangGraph-based CLI agent framework that replicates the DORA Elite AI Enterprise operating model using local models (Ollama + Qwen) as the sustainable foundation — so Lucas can continue developing with Betty Sue, Dev, QA, and Dottie regardless of Claude Code availability, cost, or policy. 2-pac lives forever.
Thesis: Benchmark-driven development. Define competence as 29 measurable test cases across 8 categories (tool selection, parameter correctness, safety compliance, instruction following, code generation, PR review, doc operations, multi-step reasoning). Every phase is validated against these benchmarks. “Which model handles the Dev prompt best?” becomes a measurable question, not a vibe.
User Stories
Who uses this project, what they need, and how we measure success. These stories drive the phased delivery and prompt evaluation priorities.
| # | Role | Story | Success Metric |
|---|---|---|---|
| 1 | Developer | I run pac on CLI and get the same experience — MCP servers loaded, hooks enforced, CLAUDE.md read, Betty Sue personality active |
pac starts Goose session with forgejo-mcp + pal-e-docs-mcp connected, ~/.claude/CLAUDE.md parsed, personality prompt applied |
| 2 | Developer | I can spawn a dev agent that reads a Forgejo issue and submits a working PR using a local model | End-to-end: Forgejo issue → local Qwen model → code changes → PR submitted → passes ruff hook. On a real repo, not a toy example |
| 3 | Developer | I can spawn a QA agent that reviews a PR with structured findings using a local model | QA reads PR diff via forgejo-mcp, posts structured review comment with VERDICT line, triggers label hook |
| 4 | Developer | I know which local model handles each agent role best | Promptfoo evaluation suite with scored results per role (coordinator, dev, QA, doc). Results published as reference note in pal-e-docs |
| 5 | Developer | My enforcement hooks work regardless of which agent framework runs | spawn gate, ruff check, PR template check, Closes #N check — all functional under Goose or ported equivalent |
| 6 | Developer | I have graceful degradation — best available model, not all-or-nothing | Config supports: Claude API (when funded) → Qwen-8B (local GPU) → Qwen-4B (fallback, always fits) |
| 7 | Strategic | The system has zero vendor lock-in in the enforcement/SOP layer | Every prompt in claude-custom has a tested local-model variant. No Claude-specific API calls in hooks or MCP servers |
Plan
Active plan: plan-pal-e-pac — Sovereign Development Experience
Completed plans: none (new project)
Board
Board: board-pal-e-pac
Status
Phase 3c COMPLETED (2026-03-14): MCP bridge wired (PR #8). Two hotfixes pushed to main: MultiServerMCPClient API change (no async context manager), tool_name_prefix=True for server-prefixed tool names, async graph invocation, permissions.yaml patterns updated to single underscore.
Phase 3 BLOCKER (2026-03-14): forgejo-sdk token auth not published. Source at ~/forgejo-sdk/ has token support but Forgejo PyPI registry has old version (password-only). MCP subprocess pulls old SDK via uv run. Fix: push forgejo-sdk to trigger CI publish. Until then, all MCP calls return 401.
Qwen3.5 baseline progress: Full pipeline proven: 51 tools loaded from 2 MCP servers, Qwen3.5-4B generates structured tool calls (picks forgejo_list_issues correctly), safety blocks writes (create_api_token blocked), audit logs written. Qwen3-4B still can't generate structured tool calls (text-only). Qwen3.5 is a massive upgrade for tool use.
Previous: Phase 3b COMPLETED — LangGraph StateGraph (PR #6). Phase 2 COMPLETED — SafetyLayer (PR #2). Phase 1 COMPLETED — Goose smoke test.
Milestones
None yet.
Architecture
Three views of the system:
- Domain Model — Agent Framework, Model Backend, Prompt Library, MCP Connectors, Evaluation Suite
- Data Flow — User → CLI → LangGraph StateGraph → Ollama → Qwen → tool calls → MCP → Forgejo/pal-e-docs
- Deployment — Local machine (LangGraph + Ollama) ↔ k8s cluster (MCP servers, Forgejo, pal-e-docs, Harbor)
Cross-project dependencies:
| Project | What pal-e-pac reads/uses | Direction |
|---|---|---|
| pal-e-agency | SOPs, conventions, agent definitions, hooks (claude-custom repo) | Reads. Ports prompts for local-model compatibility. Does NOT modify agency process. |
| pal-e-platform | Ollama deployment (Phase 6a), GPU resources (GTX 1070, 8GB VRAM) | Uses. Ollama already deployed. No new platform work needed. |
| pal-e-docs | MCP server (pal-e-docs-mcp), knowledge base | Uses. Connects MCP server via langchain-mcp-adapters. No pal-e-docs changes needed. |
Key architectural decisions:
| Decision | Rationale |
|---|---|
| LangGraph replaces Goose (Phase 3 pivot) | Phase 3 baselines proved Goose's generic orchestration can't compensate for Qwen3-4B's weaknesses (scored 28/100). LangGraph gives us deterministic state machines where the graph controls flow and the model fills slots. |
CLI-first, pac command (typer + rich + prompt_toolkit) |
Same UX as claude. Python CLI with interactive REPL mode. uv for packaging. No IDE dependency. |
| LangGraph over LangChain chains | We need a state machine, not a chain. The model doesn't decide the path — we do. Route → select → model → correct → safety → call → respond. Deterministic graph with model filling slots. |
| langchain-mcp-adapters for MCP bridge | Our MCP servers are already built (forgejo-mcp, pal-e-docs-mcp, woodpecker-mcp). The adapter converts MCP tools to LangChain tools. Zero MCP server changes needed. |
| Read ~/.claude for compatibility | CLAUDE.md, hooks, agent configs already exist and are maintained. Don't duplicate — read the same source of truth. |
| GTX 1070 8GB is the hard constraint | No hardware upgrades. Design for what we have. Qwen3-4B always fits, Qwen3-8B fits when embedding model paused. |
| Benchmark-driven development (Phase 1 finding) | Define competence benchmarks BEFORE porting agents. TDD for AI. Smoke test proved 4B models need explicit guidance — benchmarks quantify the gap and track progress. |
| Safety guardrails before capability (Phase 1 finding) | Qwen-4B called create_api_token unprompted. Weaker models are MORE dangerous with unrestricted MCP access. Permission system is a prerequisite. |
| Read-only testing until dev cluster | Only one cluster (production). No write operations through untested models against production MCP servers. |
| Promptfoo for systematic evaluation | YAML configs, CI-able, compares models on identical prompts. 29 test cases, 8 categories, proven baseline data. |
| NOT a fifth pillar | pal-e-pac is a project that consumes all three pillars (Platform, Docs, Agency). It doesn't define organizational structure — it ensures the operating model survives. |
| Graceful degradation over hard cutover | Claude when available → Qwen-8B when not → Qwen-4B as floor. Multi-tier, not all-or-nothing. |
Repos
| Repo | Platform | Role | Status |
|---|---|---|---|
| pal-e-pac | Forgejo | Goose extensions + pac CLI wrapper + benchmarks (extension-first, NOT a fork) | Created 2026-03-14. Issue #1 open (Phase 2: SafetyLayer). |
| claude-custom | Forgejo | Hooks/configs to port (read-only from pac's perspective) | Exists (owned by pal-e-agency) |
Inbox
All work scoped into plan phases. Query: list_notes(project="pal-e-pac", note_type="todo", status="open")
| Slug | Summary | Discovered |
|---|