Skip to content

The LLM Council: Multi-Agent Orchestration for Technical Planning

Running competing AI agents in parallel and having them review each other's work through a skeptical council — a pattern for better technical plans.

Matt Dennis

Our Salesforce team at Maven uses JIRA for ticket tracking and sf CLI for org operations. Planning tickets — reading the description, researching the codebase, querying production data, writing up a solution doc — is the kind of work that takes 30-90 minutes of context-switching. It’s also the kind of work an AI agent is surprisingly good at, if you give it the right tools and constraints.


But one agent isn’t enough. Different models catch different things. Opus finds the obscure validation rule; Sonnet catches the deployment ordering issue; Kimi notices the field history tracking gap; GPT-5.2 Codex spots the test class that would break. No single run is complete. So instead of trusting one agent’s output, I built a system that runs multiple agents in parallel and then has a council verify their work against the actual codebase.


The full source is available here: plan-ticket.sh, orchestrate.sh, council.sh.


The Architecture

The system has three layers:

  1. Plan generation — A script that turns a JIRA ticket into a structured solution plan using one AI agent
  2. Orchestration — A script that runs multiple plan generators in parallel across different models
  3. The Council — A script that feeds all competing plans to a skeptical verifier agent that fact-checks claims against the codebase

flowchart TD
    Start([orchestrate.sh<br/>Ticket ID]) --> Menu[Select Models<br/>Interactive Menu]
    Menu --> Parallel[Run Plans<br/>In Parallel]

    Parallel --> P1[Claude Opus]
    Parallel --> P2[Claude Sonnet]
    Parallel --> P3[Kimi K2.5]
    Parallel --> P4[GPT-5.2 Codex]
    Parallel --> P5[Cursor Agent]

    P1 --> Plan1[Plan A]
    P2 --> Plan2[Plan B]
    P3 --> Plan3[Plan C]
    P4 --> Plan4[Plan D]
    P5 --> Plan5[Plan E]

    Plan1 --> Wait[Wait for<br/>All Plans]
    Plan2 --> Wait
    Plan3 --> Wait
    Plan4 --> Wait
    Plan5 --> Wait

    Wait --> Summary[Summary<br/>Success/Failure]
    Summary --> Council{council.sh<br/>Skeptical Verifier}

    Council --> Verify[Fact-Check Claims<br/>Against Codebase]
    Verify --> Report[Council Report<br/>Verified/Refuted/Gaps]

    style Start fill:#e1f5ff
    style Parallel fill:#e8f5e9
    style Council fill:#f3e5f5
    style Report fill:#fff9c4

The whole thing is bash scripts and file-based coordination. No frameworks, no message queues — just background processes writing to docs/plans/drafts/ and a council script that reads them all back.


Plan Generation

Each agent gets the same inputs:

  • SKILL.md — A 300-line instruction set covering how to fetch tickets, research the codebase, query production data, and structure a plan
  • references/examples.md — Template and example plans matching the team’s format
  • references/learnings.md — Accumulated org knowledge from previous runs
  • Headless overrides — Instructions to run non-interactively (no stopping for clarification, no posting to JIRA)

The plan script assembles these into a mega-prompt, launches the agent with a strict tool allowlist, and streams the output:

claude -p "$PROMPT" \
  --allowedTools "$ALLOWED_TOOLS" \
  --model "$MODEL" \
  --output-format stream-json \
  --verbose | jq -r '...'

The tool allowlist is the safety mechanism. Every sf command is read-only — no deploys, no data creation, no apex execution. The agent can query production and describe objects, but it can’t change anything:

Bash(jira*),Bash(sf org list*),Bash(sf data query*),Bash(sf config get*),
Bash(sf org display*),Bash(sf sobject describe*),Bash(ls*),Bash(cat*),
Bash(mkdir*),Read,Write,Edit,Glob,Grep,Task

The script supports three agent backends — Claude Code, OpenCode, and Cursor’s agent mode — via a dispatch pattern. OpenCode is the bridge to non-Anthropic models: Kimi K2.5, GPT-5.2 Codex, Big Pickle, or anything else you wire up. All backends get the same prompt and produce plans in the same format. The model name gets embedded in the output filename so you can tell them apart.


A typical run takes 3-8 minutes. The agent fetches the JIRA ticket, checks the learnings file, searches the codebase for relevant objects/flows/triggers, queries production data, and writes a structured plan to docs/plans/drafts/.


Orchestration

The orchestrator’s job is simple: run multiple plans in parallel, wait for all of them, then invoke the council.


You give it a ticket ID and it shows an interactive menu:

Select models to run (toggle with number, 'd' when done):

[ ] 1. Claude Opus
[X] 2. Claude Sonnet
[ ] 3. OpenCode Big Pickle
[ ] 4. OpenCode Kimi K2.5
[ ] 5. OpenCode GPT-5.2 Codex
[ ] 6. OpenCode Custom Model
[ ] 7. Cursor Agent
[d] Done

Select models, hit ‘d’, and the script launches all selected plans as background processes. Each writes to its own log file. The orchestrator polls every 5 seconds and shows completion status:

Running 4 plan(s) in parallel...

  ✓ Claude Sonnet completed
  ✓ OpenCode Kimi K2.5 completed
  ✓ OpenCode GPT-5.2 Codex completed
  ✓ Claude Opus completed

All plans completed.

Total time is roughly equal to the slowest agent, not the sum of all agents. Four 5-minute plans finish in 5 minutes, not 20.


When all plans finish, the orchestrator automatically invokes the council.


The Council

This is the interesting part, and where the system earns its keep.


The council script finds all plans matching a ticket prefix, combines them into a megaprompt, and launches an Opus agent with a specific persona: a skeptical verifier that has full read access to the codebase and is instructed to use it.


The earlier version of this system had a fundamental limitation — the council only compared plans against each other. It could spot contradictions and gaps, but if every plan made the same wrong assumption, the council would nod along. Consensus is not correctness.


The current council prompt fixes this. The agent has Read, Grep, and Glob access, and the prompt explicitly tells it to spot-check factual claims against the code:

## Your Stance: Skeptical Verifier
- Treat every factual claim as unverified until YOU confirm it
  against the codebase
- When plans cite specific files, line numbers, field names, or
  behaviors — spot-check them with Read/Grep/Glob
- When all plans agree on something, ask: "Could they all be wrong?"
- When plans disagree, determine who's right by checking the source
  of truth (the code)
- Weight evidence quality: a plan that shows its work is more
  trustworthy than one that asserts without proof

There’s also a constraint that matters more than it sounds: the council runs on Opus, and one of the plans it’s reviewing was probably written by Opus. Without explicit guardrails, the council will re-solve the problem from scratch — it already knows what it would do. So the prompt is blunt about this:

Do NOT re-solve the problem — you are the judge, not a contestant.
One of these plans may already be yours. Evaluate, don't compete.

The council report has seven sections:


1. Summary Table

What each plan proposes, in a comparison table:

PlanModelKey Differentiator
opencode-8a3f8copencodeConcise; minimal risk analysis
opus-e7a3b1opusMost detailed; covers Apex, Flow, reports
cursor-b4e9d1cursorGood detail; explicit migration strategy
kimi-f2a7c9kimiStrong data analysis; novel edge case

2. Fact-Check & Correctness

This is the highest-value section. The council reads the actual files and classifies claims:

  • Verified — “Plan A says PGCollectionTriggerHandler.cls line 77 checks !String.isBlank(rep.Status__c). Confirmed — read the file, that’s what it does.”
  • Refuted — “Plan C says there’s a validation rule on Status__c. Checked — no validation rules exist on PG_Report__c.”
  • Unverifiable — “All plans reference a Google Doc with approved values. Can’t fetch it.”
  • Consensus Traps — “All four plans assume the picklist values in the JIRA description are authoritative. None verified this against the Google Doc.”

3. Pros and Cons

For each plan, what it does well and what it misses, scored by evidence quality. From a real council report on SFDC-1100:

opus-e7a3b1 (Evidence: Strong) — Most thorough analysis. Explicitly checked Apex (PGCollectionTriggerHandler.cls lines 77, 104) and Flow (PG_Report_After_Insert_Update). Correctly identifies that Salesforce doesn’t support in-place picklist rename.

opencode-8a3f8c (Evidence: Weak) — Says “Replace picklist values” without specifying deployment ordering. Doesn’t mention Apex or Flow verification at all — assumes no automation is affected without showing the work.


4. Contradictions

Where plans disagree, with the council’s determination of who’s right. Not just “these two disagree” — the council checks the code and resolves it:

Deployment ordering: opencode-8a3f8c says “migration before deployment.” The other three say add-new-value first, then migrate, then remove-old. Council verified: in a restricted picklist, you can’t migrate to a value that doesn’t exist. opencode-8a3f8c is wrong.


5. Gaps

What ALL plans miss. The council has an explicit checklist: validation rules, page layouts, record types, test classes with hardcoded values, reports, list views, deployment ordering, rollback plans, cross-object dependencies.


6. Recommendation

Which plan to start from, what to pull from others, and what’s still unresolved. The council isn’t allowed to generate a new plan — it points at existing material:

Start with opus-e7a3b1. Adopt cursor’s 5-step deployment ordering. Add the trackHistory=false finding from cursor/opencode. Before implementing: fetch the Google Doc, check for test classes referencing removed values, confirm whether this is a 1-deploy or 2-deploy process.


7. Confidence Assessment

Overall confidence level with the biggest remaining risk. Usually something like: “Medium confidence. Would be High if someone confirms the Google Doc matches the JIRA ticket values.”


The Learnings Flywheel

All agents share a learnings file — accumulated org knowledge from previous planning sessions:

  • Field behaviors (“Became_an_MQL_Date__c is set when lead qualifies as MQL, triggers round robin assignment”)
  • Automation maps (“LeadTrigger + LeadTriggerHandler — round robin, SLA calc, market category”)
  • Permission patterns (“RevOps bypass uses custom permissions: $Permission.RevOps_* in validation rules”)
  • Corrections (“‘Attempting’ is not a Lead Status value; it’s ‘Attempting Contact’”)

After ~30 tickets, the learnings file covers most of the org’s core objects and automation. New runs start with all that context loaded, so agents spend less time searching and more time reasoning. The --no-learnings flag strips this out for blind tests — useful for benchmarking how much the accumulated knowledge actually helps.


Usage

# Full orchestration: parallel plans + council
./scripts/plan-ticket/orchestrate.sh SFDC-1100

# Blind test (without learnings)
./scripts/plan-ticket/orchestrate.sh --no-learnings SFDC-1100

# Single plan with a specific model
./scripts/plan-ticket/plan-ticket.sh --opus SFDC-1091
./scripts/plan-ticket/plan-ticket.sh --sonnet SFDC-1091

# Different agent backends
AGENT=cursor-agent ./scripts/plan-ticket/plan-ticket.sh SFDC-1091
AGENT=opencode ./scripts/plan-ticket/plan-ticket.sh --model opencode/kimi-k2.5 SFDC-1091
AGENT=opencode ./scripts/plan-ticket/plan-ticket.sh --model opencode/gpt-5.2-codex SFDC-1091

# Council on existing plans (skip re-generating)
./scripts/plan-ticket/council.sh SFDC-1100

Why This Works

The single biggest insight from running this system is that models fail in different ways. Opus is thorough but sometimes over-engineers. Sonnet is fast but skips edge cases. Kimi finds things the others miss and misses things the others find. GPT-5.2 Codex is strong on code analysis but weak on Salesforce-specific deployment nuance. No single model consistently produces the best plan.


The council doesn’t make the plans better — it makes the gaps visible and the errors provable. When the council says “Plan C claims there’s a validation rule on this object — I checked, there isn’t,” that’s not opinion. That’s a fact-check with a citation. A human still reads the report and makes the call, but instead of mentally diffing four 100-line plans, they get a structured report that says “these three disagree about X, the code says Y, none of them considered Z, and the best starting point is Plan A with modifications.”


The coordination overhead is near zero. The agents don’t talk to each other during execution — there’s no shared state, no message passing, no complex framework. Each agent runs independently and writes a file. The council reads all the files afterward. It’s embarrassingly parallel, coordinated through the filesystem.


Limitations

The agents can’t read Google Docs, Confluence pages, or anything behind authentication. When tickets link to external requirements docs (which is often), plans will have an “External references (not fetched)” note in Open Questions. The council will flag this as an unverifiable assumption, but it can’t resolve it.


The learnings file has no expiration or validation. If the org changes and nobody updates the learnings, agents will plan based on stale information. In practice the codebase research step catches most discrepancies, but it’s a known gap.


The council’s fact-checking is spot-checks, not exhaustive verification. It reads files and greps for patterns, but it’s not running the full test suite or deploying to a sandbox. A claim can pass the council’s check and still be wrong in a way that only shows up at deploy time.


The plans are first drafts. They’re good enough to start a conversation and often catch things a human would miss (obscure validation rules, related automation, edge case data), but they still need human review before becoming implementation specs. The council report is the thing that makes the human review fast — it’s a structured diff with citations, not raw material.


What’s Next

The obvious extension is closing the loop — having the council’s gaps feed back into a second round of targeted research. Right now the council identifies open questions but doesn’t answer them. A second pass where agents specifically investigate the council’s unresolved items would catch more issues.


Another direction is specialization. Instead of running the same prompt through different models, you could run different prompts — one agent focused on data impact analysis, another on deployment risk, another on automation side effects. The council would then synthesize specialist perspectives rather than comparing generic plans.


For now, the simple version works. Five agents, one council, 5-10 minutes of wall time, and a structured report that consistently catches things no single agent would find alone. The source is here: plan-ticket.sh, orchestrate.sh, council.sh.