Evals

Evals are structured assessments of agent performance, output quality, and baseline integrity across the matic system. They run during the mandatory Assessment state in an agent's lifecycle and produce scored records that feed directly into staffing decisions, capability profile updates, and regression gates. This section explains how evals are defined, executed, scored, and used, including the five eval types matic supports: automated, human-scored, outcome-driven, coverage, and regression.

Eval Schema

Eval Schema defines the structure and fields of an eval, including type, scope, trigger conditions, and expected output format.

Eval Runners

Eval Runners explains how evals are executed at runtime, including the automated eval pipeline and its integration with the agent lifecycle.

Eval Targets

Eval Targets describes what evals measure: agent capability during Assessment, work item output at delivery, baseline adherence post-delivery, and charter alignment during onboarding.

Scoring Rubrics

Scoring Rubrics covers how eval results are scored, including coverage dimensions such as floor-adherence and ceiling-adherence, regression pass/fail criteria, and human-scored narrative feedback.

Probe Evals

Probe Evals documents lightweight, side-effect-free evaluation of agent probes using pure data reads and pattern-matching heuristics without requiring GenAI invocation.

Baseline Management

Baseline Management explains how baselines are recorded, versioned, and referenced by work items through baselines_at_risk to protect existing desired state from regressions.

Regression Checks

Regression Checks covers the automated test suites run against touched baselines post-delivery, where a baseline violation is treated as a hard delivery block rather than a retryable error.

Work Items

Ingestion

Iteration

Processing

Library

Collab Modes

Validation

Evals

Chat

Scheduler

Watchers

Slack

Terminal

Claude Code

MCP Manager

MCP Server

First-Party Plugins

Contributions

Implementation Architecture

Glossary

Ontology

Agents and Actors

Governance Commands

Infrastructure

Knowledge and Memory

Org and setup

Scheduling

Work Lifecycle

Evals ​

Eval Schema ​

Eval Runners ​

Eval Targets ​

Scoring Rubrics ​

Probe Evals ​

Baseline Management ​

Regression Checks ​

Evals

Eval Schema

Eval Runners

Eval Targets

Scoring Rubrics

Probe Evals

Baseline Management

Regression Checks