Auditability and evaluation
of agentic workflows.

Two interconnected initiatives to transform StackAI's observability and quality story, org-wide analytics and continuous evaluation as a deployment gate.

Company

StackAI

Product

Enterprise AI workflow platform

Role

Product Designer

Year

2026

Impact

Usage grew 5.5× with 3× deeper user engagement. Runs drill downs drove 22,000+ views in six months.

Discovery

Personas

Executive Sponsors

Prove AI adoption + ROI across teams

Admins

Quickly monitor production health

Builders

Debug, iterate, and improve workflows

Internal Teams (StackAI)

Understand platform usage

Insights Analytics

Split across two surfaces with no org-wide view. Run logs buried under charts, missing caller context, and timezone bugs across the UI and exports.

Insights Evaluator

Barely used, manual test uploads and deployment friction. Competitors had moved to continuous, trace-native evaluation.

DESIGN PROCESS

Benchmark evaluator

Conducted competitive research (Laminar, Langfuse, Arize Phoenix), mapped the full eval journey, iterated from lo-fi proposal → interactive prototype → stakeholder testing → engineering handoff.

Benchmark analytics

Mapped the opportunity space in FigJam, ran multiple design iterations across global and project-level views, and validated with stakeholders before handing off specs to engineering.

Evaluator user flow

Mapped the end-to-end journey from sandbox testing through production, covering build, publish, monitor, and continuous evaluation with human-in-the-loop gates.

User journeys

Documented how builders run workflows, debug failures, and loop back through analytics to audit runs and fix issues.

Evaluator user flow from sandbox to production

Analytics concepts

Sketched run-level detail views, trace timelines, and error clustering patterns for the analytics experience.

Ideation with PM + Eng

Ran collaborative design sessions exploring multiple UI directions, annotating trade-offs across node editors, data tables, and evaluation workflows before converging on a shared concept.

SOLUTION

Improved analytics

Org-wide analytics with caller context, feedback filtering, and a simplified layout that puts debugging first.

Analytics view a la Temporal

Time-series metrics and workflow health across your org, success rates, latency percentiles, and error trends at a glance.

Org-wide analytics view inspired by Temporal workflow metrics

Agent evaluator

A run can be viewed in analytics, sent to an evaluator, and used to build a test dataset, one shared data model by design.

INSIGHTS

Two sides of the same coin

Analytics and evaluation are interconnected. A run can be viewed in analytics, sent to an evaluator, and used to build a test dataset, the two systems share a data model by design.

Adoption ≠ debugging

Executive sponsors and builders need separate views of the same data, not just more filters.

Friction kills good ideas

The evaluator existed but went unused because deploying it required infrastructure work. The new model makes signals zero-configuration.

ADLC makes evaluation strategic

When evals are deployment gates, defining "good" becomes a required step before shipping, not a retrospective.