Maestro

Skills and tools

The word “skill” does double duty in Maestro and the overload is worth getting straight:

  • Skill (package) — the unit of distribution. Everything under skills/catalog/<name>/ is “the foo skill” or “the foo skill package.” gmail, apollo, pipeline, notify, compose — all of these are skill packages, regardless of what their operations look like internally.
  • Tool vs. skill (operation kind) — the conceptual category each operation falls into. A tool is deterministic (send a Gmail message, write a row to the contacts table). A skill operation in this sense is LLM-backed and non-deterministic (draft an opener, classify a reply’s intent).

So a skill package can contain a mix of tool operations and LLM-backed operations. Most of our packages are pure tool — gmail, apollo, pipeline, notify. The compose package is pure LLM-backed. None of our v1 packages mix the two within one package, but the architecture supports it (a hypothetical gmail-with-summarize could).

Knowing which kind an operation is matters for three reasons: cost forecasting, debugging, and trust. When something goes wrong in a run, the first question is “was the failure in a tool or an LLM-backed op?” — they fail differently and you fix them differently.

Anatomy of a skill package

A skill package contains:

  • A manifest (manifest.yaml) declaring the package’s name, version, required secrets, and concurrency limits.
  • One or more operations — methods on a class decorated with @skill(...). Each operation method is decorated with @operation. Each operation has a kind (tool by default; llm when LLM-backed).
  • Optional icon and description for the catalog UI.

The Maestro UI surfaces all operations in the Tools catalog and surfaces the packages they belong to in the Skills catalog. When a run executes, the timeline pill is colored by kind so you can see at a glance which steps were LLM-driven.

A note on the kind annotation. The @operation(kind="tool" | "llm") annotation drives the run timeline’s pill colors. Today the kind is captured in operation descriptions and is informational; the colored-pill UI lands in an upcoming release.

Why one package, two pill colors

Bundling the deterministic and LLM-backed operations of a domain together (e.g. Gmail’s reads + sends could live alongside an LLM-backed Gmail summarizer) keeps the credentials in one place and the related code colocated. Splitting them by kind at the operation level keeps the cost/trust distinction visible in every run.

Where skills live

skills/
├── sdk/                     # Decorators, registry, secret resolution
│   └── src/maestro_skills/
└── catalog/                 # Shipping skill packages
    ├── http/                # Built-in HTTP escape hatch
    ├── web-research/        # Tavily-backed search and extract
    └── gmail/               # OAuth + read/send/label

The SDK is published to the runtime as a Python package; the catalog directory is scanned at boot for skill.toml manifests. Adding a skill is dropping a directory under skills/catalog/.

Anatomy of an operation

from maestro_skills import skill, operation
from pydantic import BaseModel

class ListInboxIn(BaseModel):
    query: str = ""
    max_results: int = 25

@skill(name="gmail", version="0.1.0")
class Gmail:
    @operation(id="list_inbox", kind="tool")
    async def list_inbox(self, input: ListInboxIn) -> list[dict]:
        """Return recent threads matching `query`."""
        ...

Three things to notice:

  1. JSON Schema is generated from the Pydantic input model. The agent doesn’t see Python types — it sees the JSON Schema and decides what to pass.
  2. kind="tool" means the timeline renders a deterministic-color pill. Use kind="llm" for operations that call a model.
  3. Secrets are not parameters. They are resolved from the workspace’s vault by the SDK’s Secrets class — the LLM never sees them and can’t accidentally leak them in a tool call.

The full author guide lives at Skills overview.

Why this matters at runtime

When a run step is recorded, it carries a kind field (llm, skill_op, or decision). The dashboard renders these as differently-colored pills:

  • LLM steps are the cost ledger. If a run cost more than expected, the pill colors tell you whether the volume was in cheap tool calls or expensive model calls.
  • Tool steps are the determinism ledger. If a run produced a bad result, the pill colors tell you whether the failure was an API quirk (deterministic) or a model hallucination (non-deterministic).

The split also gives the run timeline a “shape.” A healthy cold-leads run looks like: tool (find leads) → tool (enrich) → llm (draft openers) → tool (queue sends). Anything wildly different from that shape is worth investigating.

What about MCP, function calling, “tools”?

Anthropic’s API has a tools parameter that lets the model call functions. Maestro uses it. The tools Maestro passes to the model are the JSON Schemas of the operations the agent has access to — both deterministic and LLM-backed alike, since from the model’s perspective they’re all just callable functions.

The conceptual split (tool vs. skill) is Maestro’s abstraction for cost and trust. The wire-protocol detail (Anthropic tools) is unrelated. Both happen to be called “tools” in their respective contexts — context disambiguates which one is meant.