It's a surprisingly hard question to answer, and can be even harder when you have both the human consumer and the AI agent side to think about.
Accuracy matters. Did Mav say the right thing? Stay on track? So does the way it feels. Did the lead feel heard? Annoyed? Thankful? Conversation can be technically correct and still feel completely off. But how do you even begin to make these determinations?
We've been thinking hard about this for years as we've built Mav. To me personally (a conversation designer), conversation is at the heart of Mav. A big step forward in how we begin to answer these questions is with Mav Evals, a feature that instantly scores every single conversation Mav has with a lead across three main dimensions: Understanding, Relevance, and Empathy.
I chatted with Dan Rodriguez, Mav's Head of Engineering, who shared how it works, why those metrics and categories matter, what it means for the future of AI conversations in insurance, and where we want to go next. To me, this is just the beginning of what feels possible.
"Evals" is short for evaluations. After every conversation Mav has with a lead, we automatically review how that conversation went. Think of it like a manager listening in on a call and filling out a scorecard — except it happens for every single conversation, instantly, and consistently. We look at things like: did Mav understand what the lead was asking? Were the responses relevant? Did the conversation feel empathetic and human? Each conversation gets scored on those dimensions, so we always have a clear picture of quality.
Before evals, the only way to know if conversations were going well was to manually read through them. That doesn't scale — you can't read thousands of conversations a day. Now we have an automatic quality check on every single interaction. If something starts going sideways, we catch it quickly instead of finding out weeks later when conversion numbers drop.
Generic tools don't understand the insurance world. A basic sentiment analysis might tell you a conversation was "positive" or "negative," but that doesn't really help. A lead could be perfectly polite and still have a terrible experience. We needed something that understands the context of what Mav is actually trying to do — qualify leads, collect the right information, and connect them with insurance agents. By building it in-house, we can evaluate the things that actually matter for your business.
Most tools treat a conversation like a single block of text and give you one score. Real conversations are messy — the lead might change their mind, ask something unexpected, or go off-topic. We evaluate the full back-and-forth: did Mav pick up on what the lead actually needed? Did it stay on track? Did it handle the conversation with the right tone? That nuance gets lost with simpler tools.
One of the biggest uses is evaluating new AI models. When a new model comes out from OpenAI, Anthropic, Gemini, etc., we can run the same conversations through it and compare the eval scores side by side — did understanding go up? Did empathy drop? It takes the guesswork out of those decisions. We also use it as an early warning system. If we see empathy scores start to dip across conversations, that tells us something has changed and we need to dig in before it becomes a bigger problem.
We measure three things for every conversation: Understanding, Relevance, and Empathy. Understanding is about whether Mav correctly grasped what the lead was saying and what they needed. Relevance checks that the responses actually addressed those needs. And Empathy measures whether the conversation felt warm and human — not robotic. We chose these three because together they capture what makes a conversation actually good from the lead's perspective. You can have a technically accurate conversation that still feels cold, or a friendly one that completely misses the point. We wanted to catch both.
Honestly, a lot of manual reading. Someone on the team would pull up conversations and skim through them looking for patterns — are leads dropping off? Are responses making sense? It worked when the volume was small, but it was slow and inconsistent. Two people could read the same conversation and walk away with different opinions about whether it went well.
With most software, you can check if the output is right or wrong — did the math add up, did the page load. Conversations aren't like that. There's rarely a single "right" answer. A good response depends on what the lead said, their tone, where they are in the process, and a dozen other things. Plus, a conversation might go perfectly for 10 messages and then go off the rails in message 11. You have to evaluate the whole thing in context, not just individual responses.
We want to close the loop. Right now evals tell us how conversations are going — the next step is using that information to automatically make them better. Imagine if Mav could learn from the patterns in high-scoring conversations and adjust in real time, approaching conversations in different ways depending on lead behavior. We're also looking at giving our clients more visibility into the eval data directly in the dashboard so they can see trends across all leads, including historically.
The dream is personalization at scale. If we know certain types of conversations score higher with a certain approach, we can start tailoring how Mav talks to different leads. We could also surface insights for you — like "leads who ask about bundling tend to convert 20% better when Mav responds with X." Ultimately, evals aren't just a report card — they're the foundation for making every conversation smarter than the last.
We built it right into Mav's workflow. When a conversation wraps up, we automatically send the full transcript to be evaluated against our three criteria. The results get attached to that lead's activity feed, so you can see the eval right alongside the conversation itself. It all happens in the background — no extra steps for anyone.
The technical side was actually pretty straightforward — the harder part was figuring out what to measure. We went back and forth on the criteria a lot. It's easy to measure things that don't actually matter. Getting to Understanding, Relevance, and Empathy as the right dimensions took more thought than writing the code did.
Evals are available for Mav customers today, with conversations receiving an eval instantly when an outcome is reached, and results visible right in Copilot.
I know we're only scratching the surface of what's possible when we think about continuing to improve our conversations with data, and that's an exciting place to be. As Dan puts it, evals aren't a report card — they're the foundation for making every conversation smarter than the last. Hyper personalization at scale, real-time adjustments, and insights that help you understand what's actually driving conversions. That's where we're headed, plus some big ideas that we'll save for the next update 👀.
If you're curious about what this can look like for your agency, we'd love to show you. Get in touch below.