Monday.com Achieves 8.7x Faster AI Agent Testing with LangSmith Integration

Bitbuy
Debate Over Model Context Protocol (MCP): A Passing Trend or a Game-Changer?
BTCC




Rebeca Moen
Feb 18, 2026 08:39

Monday Service reveals eval-driven development framework that cut AI agent testing from 162 seconds to 18 seconds using LangSmith and parallel processing.





Monday.com’s enterprise service division has slashed AI agent evaluation time by 8.7x after implementing a code-first testing framework built on LangSmith, cutting feedback loops from 162 seconds to just 18 seconds per test cycle.

The technical deep-dive, published February 18, 2026, details how the monday Service team embedded evaluation protocols into their AI development process from day one rather than treating quality checks as an afterthought.

Why This Matters for Enterprise AI

Monday Service builds AI agents that handle customer support tickets across IT, HR, and legal departments. These agents use LangGraph-based ReAct architecture—essentially AI that reasons through problems step by step before acting. The catch? Each reasoning step depends on the previous one, so a small error early in the chain can cascade into completely wrong outputs.

“A minor deviation in a prompt or a tool-call result can cascade into a significantly different—and potentially incorrect—outcome,” the team explained. Traditional post-deployment testing wasn’t catching these issues fast enough.

Ledger

The Technical Stack

The framework runs on two parallel tracks. Offline evaluations function like unit tests, running agents against curated datasets to verify core logic before code ships. Online evaluations monitor production traffic in real-time, scoring entire conversation threads rather than individual responses.

The speed gains came from parallelizing test execution. By distributing workloads across multiple CPU cores while simultaneously firing off LLM evaluation calls concurrently, the team eliminated the bottleneck that had been forcing developers to choose between thorough testing and shipping velocity.

Benchmarks on a MacBook Pro M3 showed sequential testing took 162 seconds for 20 test tickets. Concurrent-only execution dropped that to 39 seconds. Full parallel plus concurrent processing? 18.6 seconds.

Evaluations as Production Code

Perhaps more significant than the speed improvements: monday Service now treats their AI judges like any other production code. Evaluation logic lives in TypeScript files, goes through PR reviews, and deploys via CI/CD pipelines.

A custom CLI command—yarn eval deploy—synchronizes evaluation definitions with LangSmith’s platform automatically. When engineers merge a PR, the system pushes prompt definitions to LangSmith’s registry, reconciles local rules against production, and prunes orphaned evaluations.

This “evaluations as code” approach lets the team use AI coding assistants like Cursor and Claude Code to refine complex evaluation prompts directly in their IDE. They can also write tests for their judges themselves, verifying accuracy before those judges ever touch production traffic.

What’s Next

The monday Service team expects this pattern—managing AI evaluations with the same rigor as infrastructure code—to become standard practice as enterprise AI matures. They’re betting the ecosystem will eventually produce standardized tooling similar to Terraform modules for infrastructure.

For teams building production AI agents, the takeaway is clear: slow evaluation loops force uncomfortable tradeoffs between testing depth and development speed. Solving that bottleneck early pays dividends throughout the product lifecycle.

Image source: Shutterstock



Source link

Ledger

Be the first to comment

Leave a Reply

Your email address will not be published.


*