Why we promoted LLM nodes to first-class Agents

Maestro engineering · May 15, 2026

In v0.1, an LLM node in a Maestro score carried its own system prompt, model, and tool list as inline columns on the score_nodes row. If you wanted two scores to share a “draft cold-outreach opener” prompt, you copy-pasted it. If you wanted to tune the prompt later, you opened each LLM node individually.

That worked when there was one hero score with two LLM nodes. It started to feel wrong the second I cloned cold-leads-v2 to test a variant and ended up with two Draft opener nodes carrying byte-identical 200-line system prompts, one in each score, neither aware of the other.

In v0.2.0 we promoted LLM-node configs to agents — first-class workspace entities, listed at /agents, edited on a dedicated detail page. Each LLM node in each score references an agent by foreign key. One agent, many nodes. Edit the prompt once; everywhere that uses it picks the change up on the next run.

This post is the “why” — what the abstraction earned us, and what it cost.

The duplicate-prompt smell

The cold-leads-v2 graph has two LLM nodes:

Shortlist — pure-reasoning filter that takes Apollo’s 10 candidates and returns the top 3 fitting the ICP.
Draft opener — writes the cold-outreach email subject + body grounded in lead + company signal.

Reply-triage-v2 has one:

Classify intent — single-call classifier producing a structured {intent, confidence, explanation, new_stage, notify} JSON.

Three LLM nodes total on the lab box. Three system prompts. About 600 lines of prompt-engineering across them.

The first time I cloned cold-leads-v2 to test a healthcare-ICP variant, I expected the obvious thing: clone the score graph, edit the cloned Shortlist’s prompt to filter for healthcare titles instead of SaaS engineering. What I got was two completely independent prompts that started identical but immediately drifted as I edited one and forgot to mirror the change to the other. A week later, “fix the JSON-output instruction in the classifier” meant remembering which scores referenced which classifier and editing each one.

Three LLM nodes is enough to feel the smell. Five is enough that the smell becomes a real cost. Ten and you’re shipping bugs because you tuned one prompt and forgot another.

Two cheap reads

The path of least resistance was a dedupe-via-convention story: “operators agree to copy a prompt only once and reference it via comment.” That’s not a story — that’s a wish. Conventions don’t survive their first new contributor.

The second-cheapest read was a generic key-value store: a prompts table with a name and a body, referenced by score_nodes via name. Cheap to build but it ducks the question of what concept owns this prompt. The prompt is not the only attribute that travels with it — the model choice (Sonnet vs Haiku vs Opus) and the allowed-tools list belong to the same logical unit. A bare key-value store leaves those still scattered.

What the abstraction is actually for

The right model emerged once I named the thing. An agent in Maestro’s grammar is the reusable reasoning unit. Three attributes travel together because they are jointly the agent’s identity:

The system prompt — what the model sees before its first user message.
The model id — claude-sonnet-4-6, claude-haiku-4-5-..., etc.
The allowed tools — which skills the model may invoke during reasoning. (Often empty for pure-reasoning agents.)

That’s it. No version snapshot of conversation. No tool schemas (those come from the skills). No identity beyond a workspace-unique slug.

This isn’t a new invention. It’s the unit that frameworks like LangChain or AutoGen call “the agent.” Maestro’s twist is that the agent is the inner unit, not the outer one — the score graph is the outer unit, and agents are reasoning steps inside scores. Most multi-agent frameworks stack it the other way (agents at the top, with sub-tasks underneath). Inverted, you get the cost story Maestro wants: deterministic nodes do the bulk of the work, agents fire only where reasoning genuinely earns its keep.

The schema lift

Phase 5d’s migration (0012) was structurally trivial:

CREATE TABLE agents (
  id text PRIMARY KEY,
  workspace_id text NOT NULL REFERENCES workspaces(id),
  name text NOT NULL,
  slug text NOT NULL,
  description text,
  model text,                 -- nullable; null = workspace default
  system_prompt text NOT NULL,
  allowed_tools jsonb,         -- string[] of skill names
  is_template boolean DEFAULT false,
  source_agent_id text,        -- when cloned, original's id
  version integer DEFAULT 1,
  created_at timestamptz NOT NULL DEFAULT now(),
  updated_at timestamptz NOT NULL DEFAULT now(),
  UNIQUE (workspace_id, slug)
);

ALTER TABLE score_nodes
  ADD COLUMN agent_id text REFERENCES agents(id) ON DELETE RESTRICT;

Plus a data-migration pass that walked every existing score_nodes row of kind='llm', created an agents row holding the inline prompt + model + tools, and pointed the node’s new agent_id at it.

The migration ran on the lab box in one tick. The three seeded LLM nodes became three seeded agents (Shortlist, Draft opener, Classify intent) with deterministic ids ag_<original_node_id>. The orchestrator continued reading the inline columns at runtime — the agent_id became the new authoring surface, not the new runtime path. Phase 6 will switch the loader to JOIN agents and drop the inline columns; until then they stay as a fallback so the migration is rollback-safe.

What got harder than expected

Two things, and only one of them was the schema.

The interesting hard thing: edit-impact visibility. Once two scores reference the same agent, editing the agent’s prompt affects both their next runs. That’s the whole point — but it’s also a foot-gun. An operator tuning a prompt for one score can silently break behavior in another score that shares the agent.

The fix is mostly UI. The agent detail page renders a “Used by” panel listing every score that references the agent, with per-score node counts. When the operator clicks Save with a dirty form and usedByScoreCount > 0, a warning surfaces: “Used by N scores. Saving will apply to every score’s next run.” It’s the simplest possible affordance — but it’s the one that turns the abstraction from a foot-gun into a feature.

The DB column did one piece of it for free: agent_id references agents with ON DELETE RESTRICT. Try to delete an agent that’s still in use and the database refuses. The API translates this into a structured 409 with the list of referencing scores so the frontend can render “still in use by 3 scores” with click-throughs.

The annoying hard thing: naming. The legacy table named agents already existed — but it actually held cron-attached score deployments (what we now call “score runners”). The pre-Phase-5 codebase carried this misnomer for a year. Promoting LLM-node configs to first-class meant claiming the agents name, which meant renaming the legacy table to score_runners first, which meant a 7-file sweep across the Python runtime, the API routes, and the web app’s URLs.

The schema rename was a 50-line migration. The follow-on bug discovery — a single missed Postgres trigger function that still referenced NEW.agent_id and threw on every run insert — surfaced two days later. We caught it via an issue-tracked audit pass; the fix was migration 0013, four lines of CREATE OR REPLACE FUNCTION. Naming is hard.

Cloning is now self-contained

A nice side effect: cloning a score template now produces a fully independent workspace copy. The clone endpoint runs in one transaction:

Insert the new score row with source_score_id pointing at the original.
For every LLM node in the source’s graph, clone the agent it references (with a -copy slug suffix). Map source agent id → new agent id.
Clone every score node with the remapped agent reference.
Clone every edge with remapped node ids.

The clone has its own agents. Edit the clone’s Draft opener (copy) and the original’s Draft opener is unchanged. Re-clone the template later and you get fresh copies of everything. Operators can experiment without fear of breaking the working hero score.

For the rare case where two scores actually should share an agent — say, a workspace where five different cold-outreach scores all want the same shortlist filter — the Composer’s LLM-node side panel surfaces an explicit “Use shared agent” picker. Pick the agent, save. The shared usage shows up on the agent’s detail page so the operator sees the cross-score impact before editing.

When to promote

The general lesson is older than this codebase. When the same configuration travels with the same lifecycle to multiple sites in the system, it wants to be a first-class thing. The signs:

You’ve copy-pasted the configuration twice. Once is fine. Twice is the smell.
Edits to one site need to propagate to others manually. That’s a class of bug.
The configuration has more than one attribute and they all change together. A single string can stay denormalized; a struct with three coupled fields is starting to look like a noun.
You’re naming the duplicates with parenthesized suffixes (Draft opener, Draft opener (healthcare), Draft opener — old version). Those parens are telling you the entity wants a real identity.

For Maestro the signs were all there by the second cloned score. The promotion took half a day and recovered a week’s worth of paper-cuts that hadn’t yet shipped.

What’s next

Phase 6 makes the agent the runtime path, not just the authoring surface. The orchestrator’s load_score() will JOIN agents and the inline system_prompt / model / allowed_tools columns drop from score_nodes. That’s a one-day rewrite that we deferred until the new shape proved itself in production.

Beyond Phase 6 is the question of whether agents should themselves be composable — an agent that can call other agents as tools, not just deterministic skills. The schema supports it (allowed_tools is a JSON array; we could add agent slugs alongside skill names). Whether the operator experience supports it is a different question. We’ll wait for a real ask.

For now: agents are a first-class thing, scores reference them, the Composer’s editor surfaces the shared usage, and the lab box’s three seeded agents drive every LLM call across the hero score. Three nouns where there used to be one. Worth the rename.

← Back to all posts