Real-World LLM Challenges: A Deep Dive into Failure Modes, Evals, and Production Lessons - Part 1
Today, I'm sharing a goldmine of insights from a recent deep-dive session where a group of AI practitioners, including seasoned pros Hamel Husain and Shreya, tackled some of the nitty-gritty questions that come up when working with Large Language Models (LLMs). We covered everything from bootstrapping synthetic datasets and wrangling ambitious system prompts to defining effective rubrics and evaluating complex tool usage. If you're in the thick of LLM development, these are likely questions you've grappled with too. Let's dive into the Q&A – consider this your cheat sheet!
Note: This blog post has been synthesized using LLMs from my detailed discussion notes from a live training session. The insights and Q&A content are from the LLM Evaluations course by Parlance Labs, where I'm currently learning these methodologies. The course features industry experts Hamel Husain and Shreya sharing practical, battle-tested approaches to LLM evaluation and development.
Bootstrapping Synthetic Data & Managing Ambitious Prompts
An engineer, Ariel, started with a fundamental challenge: getting started with synthetic data and managing complex prompts.
Q: I'm trying to bootstrap my synthetic dataset. How can I leverage LLMs to effectively map out potential failure modes? It feels like there are so many systematic errors to anticipate. Also, my system prompt feels too ambitious with multiple output threads – how do I know what's too much?
This represents a classic "where to start?" problem. Here's the practical guidance from the discussion:
On Mapping Failure Modes with LLMs for Synthetic Data:
Start with Error Analysis: Before generating large volumes of synthetic data, Hamel emphasized conducting initial error analysis on your existing system or prototype. This builds intuition about where your system will likely fail.
Identify Key Dimensions: Consider "dimensions" – aspects of your input that might trigger failures. Examples include different user personas, specific topics, or communication styles. Hamel introduced "tuples" for specific values along these dimensions (persona = 'novice user', topic = 'complex financial advice').
Focus Your Efforts: Error analysis highlights which dimensions are most critical and generate more failures. Focus initial synthetic data generation on these high-signal dimensions. You want to identify clear, obvious errors quickly to understand improvement areas.
Iterate: Start Focused, Then Branch Out: Once you understand primary failure modes and build product intuition, you can explore other dimensions more broadly. Don't attempt to solve everything from the beginning.
On Ambitious System Prompts & Multiple Outputs:
Don't Prematurely Optimize the Prompt Itself: Shreya advised against worrying excessively about system prompt length initially. Instead, focus on the outputs.
Let Error Analysis Guide You: If your prompt requests multiple distinct pieces of information or "output streams" (Ariel mentioned five aspects), error analysis reveals which streams the LLM struggles to generate reliably or accurately.
Analyze and Measure First: Before splitting prompts or re-architecting, understand the failure modes. How often do specific outputs fail? Where are the weaknesses?
Consider Splitting or Chaining as a Solution: If an LLM struggles with generating a comprehensive report with 5 distinct sections in one go, you might consider splitting it. This could mean multiple LLM calls (one for sections 1-2, another for 3-5) or chaining prompts. However, this is a solution after diagnosis.
Tuples for Modular Prompts: Hamel clarified that "tuples" (dimension-value pairs) are extremely useful for making prompts modular through string templating. Instead of a monolithic prompt trying to cover every possibility, you can dynamically insert relevant phrases or entire sections based on tuple values. This applies more to user prompts or content you're feeding in to guide generation, ensuring your system prompt doesn't become unmanageable.
Defining and Using Rubrics for Evaluation
Tomás brought up essential questions about "rubrics," a term central to consistent evaluation.
Q: What exactly is a "rubric" in the context of LLM evaluation? Is it just a strict pass/fail guide for a defined failure mode? And what happens if we update a rubric but our old understanding or data is still based on the previous version?
Understanding Rubrics:
Detailed Definition of Failure (or Success): Shreya explained that a rubric is a very detailed definition of a specific failure mode (or, conversely, a success criterion).
Goal: Consistency: The primary purpose of a well-defined rubric is ensuring that anyone applying it to an LLM's output (a "trace") will arrive at the same conclusion (pass/fail, or a specific grade). This consistency is essential.
LLM-as-Judge: Hamel highlighted that a good rubric effectively becomes the prompt for an "LLM judge." You're essentially codifying your evaluation criteria so precisely that another LLM can use it to assess outputs.
Improving Rubrics: You improve a rubric by making it clearer and more robust. This involves:
- Clarifying language
- Adding concrete examples of "pass" cases and "fail" cases
- Eliminating ambiguity so different evaluators (human or LLM) are aligned
Handling Rubric Updates: If a rubric changes, it's essential that this new understanding is propagated. If your development set or evaluations are based on an outdated rubric, your measurements will be inconsistent and misleading. The goal is always for everyone (and every process) to use the most current, refined rubric. Course materials will be updated to include a formal definition.
Evaluating Tool Use and Multi-Component Prompts (MCPs)
Andrei explored the complexities of evaluating LLMs that use tools, especially within Multi-Component Prompts (MCPs).
Q: How should we approach evaluations for tool use, particularly with MCPs? It seems like a fractal problem – we might tweak the tool's description in the MCP, the agent's prompt, or the tool itself. What does a test harness for this look like?
Strategies for Evaluating Tool Usage:
Driven by Real Failures: Hamel stressed that any evaluation should be justified by an observed failure. If your error analysis shows the LLM is calling the wrong tool or failing to call a tool when it should, then you design an evaluation for it.
Reference-Based Evaluations for Tool Calling: Tool calling is often deterministic. You expect a specific tool (or set of tools) to be invoked for a given query. This makes it a good candidate for reference-based evaluations:
- Test Harness: Create a dataset of input queries and the expected tool calls
- Assertion: Your test asserts whether the actual tool call matches the expected one
When and What to Tweak (MCPs, Prompts):
Low-Hanging Fruit First: If you spot an obvious fix (poorly worded tool description in the MCP, vague system prompt guiding tool use), try fixing it directly. You might not always need a comprehensive evaluation for every minor tweak, Hamel suggested. Evaluations have cognitive overhead.
Iterate with a Playground: For trickier issues, before formal evaluations, use a prompt playground:
- Create a small dataset of queries designed to trigger the specific tool-use behavior you're debugging (various ways to ask for a "contact lookup")
- Use an LLM to help generate diverse and challenging "perturbations" of these queries
- Rapidly iterate on your prompts/tool descriptions with this small dataset
Formal Evaluations for Persistent/Critical Issues: If the problem is persistent, critical, or requires careful iteration, then a formal evaluation is warranted.
Shreya's Structured Approach:
- "Tool" as a Dimension: In your error analysis, make "tool used" (or "intended tool") a dimension
- Synthetic Data for Coverage: Generate synthetic queries that specifically aim to trigger all your available tools. This helps identify which tools the agent struggles with
- Automated Evaluator: Once you pinpoint problematic tools, design an automated evaluator for that specific tool-use scenario
- Experiment and Measure: With the evaluator in place, you can experiment: Does improving the tool description in the MCP work better, or is tweaking the agent's main prompt more effective? Measure the impact on your evaluation
Skylar Payne's Caution (via Hamel): A key point was flagged: many teams think a problem is obvious and skip evaluations, only to find their "obvious" solution didn't work. Calibrate your judgment carefully before deciding an evaluation isn't needed.
Unit vs. Integration Thinking (Andrei & Hamel):
- Andrei likened small, focused tool tests to "unit tests"
- These can build up to broader "integration tests" that evaluate a whole flow, including the agent and MCP, for business value
- Hamel clarified that even in complex, multi-turn conversations, you can still scope tests to see if the correct user intent (needs contact details) triggers the correct tool call
Handling Ambiguous User Prompts
Hong raised a practical issue: what if the problem isn't the LLM's processing, but the user's unclear query?
Q: During error analysis, what if we find that errors are caused by ambiguous end-user prompts, not necessarily a flaw in our system prompt or LLM logic? How do we handle this?
Addressing Ambiguity at the Product Level:
Error Analysis Beyond the AI: Hamel provided excellent guidance: error analysis shouldn't just look at the AI components. It should identify anything wrong with the product experience.
Ambiguity as a Product Problem: If "ambiguous user question" is a major category in your failure analysis, this signals a product issue.
Solution: Product Evolution: The fix often lies in changing the product. Your AI system should be designed to handle ambiguity gracefully, perhaps by:
- Asking clarifying follow-up questions. Just like a human would, the AI can prompt the user for more detail if their initial request is unclear. (Hamel mentioned DeepResearch as an example of a system that does this)
Bias in Labeling and Persona-Specific Rubrics
Shrirang brought up concerns about bias in data labeling and tailoring experiences for different user personas.
Q: You mentioned gathering labels and notes first to avoid bias. Could you elaborate? Also, we're building a customer service chatbot for different personas (technical buyer, initiative owner, VP). Should we use different rubrics for each?
On Labeling Bias and Persona-Driven Evaluation:
Human-First Labeling: Hamel's core point about bias was: humans should do the initial labeling, especially when defining failure modes or success criteria. Don't just throw your data at an LLM and ask it to categorize everything from scratch at the very beginning.
- Why? You need to instill your nuanced understanding and criteria into the process
- LLMs as Assistants (Later): An LLM can assist with labeling later, once it's well-aligned with your definitions and the UX of the labeling tool is designed to prevent the LLM's suggestions from overly biasing the human annotator
Persona-Specific Rubrics & Prompts: Absolutely, yes! This is a common and effective practice.
- Different Needs, Different Evaluations: If different user personas have distinct needs, success criteria, and expectations (technical buyer vs. VP), then it makes sense to:
- Potentially route them to different, tailored system prompts
- Evaluate the LLM's responses for each persona using different, persona-specific rubrics and evaluations
- LLM for Persona Classification: An LLM can often be used to classify the incoming user query or interaction to determine the likely persona, which then guides the subsequent prompting and evaluation strategy
Crafting the First Rubric & The Role of Domain Expertise
Pastor Soto asked about the challenging first step: defining that initial rubric, especially when you're not the target user.
Q: How do we define a "good, helpful response" in our first rubric iteration, especially if we're not the domain expert (building an education bot but not being a student)? How do we avoid making it overwhelming or underwhelming?
Creating Your Initial Rubric Effectively:
Start with Examples (Shreya): For a first pass, examples are often more powerful than abstract definitions.
- Have a true domain expert (or the CEO, product owner, etc.) review a set of actual interaction traces
- Ask them to identify a few stellar examples of "good" responses and clear examples of "bad" responses
- Your initial rubric can then be:
- A rough, high-level description of success and failure
- These concrete examples
- This approach is less biased because you're not immediately making prescriptive claims like "it must include X" or "it must avoid Y"
- Refine this initial rubric as more people use it to label more ambiguous or challenging examples
The Essential Domain Expert (Hamel): Hamel strongly emphasized a point that's fundamental but sometimes overlooked:
- Are you the right person to judge? If you're building an educational tool for medical students, but you're not a medical student or medical expert, how can you accurately judge the quality or helpfulness of the content?
- Involve the Experts: The domain experts must be involved in defining and applying the rubrics. They are the arbiters of what "good" looks like in their specific field
- Upstream Failure: Not having the right people judge quality is a fundamental failure point that occurs even before you write the first line of a rubric
- Hamel referenced a guest lecture by Isaac from Anki Hub (an education startup for medical students), where medical students themselves are the ones annotating and validating content, precisely because they have the requisite domain knowledge
Evaluations for Search and Retrieval-Augmented Generation (RAG)
Rodo Yabut specifically asked about evaluations for search-centric applications, like knowledge graphs.
Q: I'm building a knowledge graph that uses many prompts for search. The "recipe bot" example for evaluations doesn't seem to fit. How do evaluations differ for search-heavy systems?
Evaluating Search and Retrieval Systems:
Core Principles Apply, but with Specialization: Many fundamental evaluation principles remain the same. However, search and retrieval have a rich history and specialized metrics.
Upcoming Course Content: Hamel and Shreya assured that the course will delve deeper into architecture-specific evaluations, including multiple lessons dedicated to retrieval metrics (coming up soon, some as early as next week).
Focus on Retrieval Quality: For any system using retrieval (like RAG):
- You need to evaluate whether the right documents/context are being retrieved in response to a query. Metrics like precision, recall, MRR (Mean Reciprocal Rank), and NDCG (Normalized Discounted Cumulative Gain) are common in search
Shreya's "Isolate and Conquer" Strategy for RAG:
- If Search is the Product: Evaluate it thoroughly as a search product using established search metrics
- If Search is a Tool for a Broader AI Product (like an agent using RAG):
- Evaluate Retrieval Separately: First, assess the performance of your retrieval component. Is it fetching relevant and accurate information?
- Evaluate Generation Separately: Assuming your retrieval is working well (or after you've improved it), then evaluate the LLM's generation based on the retrieved context and any subsequent agent steps
- Essentially, you're evaluating two distinct systems that both need to perform well for the overall application to succeed
Evaluations for Broadly Scoped Applications
Qasim raised a thought-provoking question about evaluating applications that are broadly scoped by design.
Q: The evaluation examples given (like a nurture bot or recipe bot) are narrowly scoped. What about products with a much broader scope, like a sales platform with bots that can sell many different things? The exact context will vary widely. Do we need evaluations for every single customer or product variation? That sounds very cumbersome.
Scoping Evaluations for Broad Applications:
Error Analysis is Still Your Compass (Shreya): Even for broad applications, error analysis is the key to defining manageable and meaningful evaluations.
- Analyze your production data (or diverse test data) to find common patterns of failure
- Try to cluster these failures into a manageable number of categories (5-6 major failure themes). These clusters don't need to be of equal size
- These data-driven clusters become the basis for your (more narrowly defined) evaluations. They emerge from the breadth of your application, rather than you trying to predefine every possible scenario
- If these initial clusters still feel too broad, you might need to subdivide them further
- The idea is not to create evaluations based on arbitrary segments like "per customer" unless error analysis specifically points to customer-specific failure patterns
The "Foundation Model" Threshold (Hamel): Hamel offered essential perspective:
- How Broad is "Broad"? If your application is so broadly scoped that it starts to resemble a general-purpose foundation model ("ChatGPT for all of sales," where you genuinely don't know what users will do), then you might be in a different territory
- Questioning the Premise: At that point, Hamel suggested it's worth a deeper conversation: "Why are you building something that open-ended as an application?"
- Foundation Model Evaluations: Evaluating true foundation models is a different, more complex discipline than evaluating specific applications, even broad ones
- Most applications, even if they cover a wide domain like "sales," will still have identifiable common interaction patterns, goals, and potential failure points that error analysis can uncover and that targeted evaluations can address
Ensuring Generalization for Broad Applications
We were discussing Qasim's challenge with a sales bot that needs to work across various domains (like restaurants, gyms, and now potentially shoes).
Q: (Qasim's follow-up) My sales bot, which includes tool calls, works well for restaurants and gyms. How do I ensure it will also perform well for a new domain like "selling shoes" without exhaustively creating specific evaluations for every new customer or product type? What does "simulation" mean in this context – an LLM acting as a user?
This is a fundamental question about generalization and the practicalities of scaling.
On Generalization and High-Fidelity Simulation:
Error Analysis for Generalization Gaps (Hamel): Your best friend, error analysis, will give you strong intuition on where your product is most likely to fail when encountering new domains. While there's no absolute guarantee it won't fail, the goal is to maximize the chance it will generalize.
- If error analysis reveals that new domains consistently throw it off ("selling shoes is vastly different from selling cars, and my bot can't cope"), then you know you have a generalization problem
The Need for "Good Simulation" (Hamel):
- Simply testing a user query in isolation might not be enough. You often need to simulate the entire application context. This means building a more complex test harness
- What does simulation entail? It could indeed involve having an LLM act as a user, engaging in multi-turn conversations. This is especially relevant for complex, interactive systems
- More broadly, "simulation" means deeply considering all the contextual differences between domains. Selling cars versus shoes involves different customer information, metadata, types of data being retrieved, and typical user journeys. Your simulation needs to reflect this
- A simple "academic exercise" of just swapping out a "domain" dimension ("shoes" vs. "cars") in a synthetic data generator might not be high-fidelity enough if it doesn't capture these deeper contextual differences. You need to try and replicate what actually happens when a real customer uses your application in that new domain
Proactive Synthetic Generation (Shreya):
- Don't wait for a new customer in the "shoes" domain to appear. Proactively include "shoes" (and perhaps 10 other diverse, hypothetical domains) as a dimension in your synthetic query generation process
- Generate traces across these varied domains, conduct error analysis on these synthetic traces, and specifically check failure rates for critical components like tool calling. This can give you an early warning system
(Hamel noted that deeply nuanced simulation strategies for very specific, complex scenarios might be best discussed in a more interactive setting, like a Discord channel, to flesh out all the details.)
Isolating and Fixing Complex Retrieval Issues
Wayde Gilliam brought up a common pain point: what to do when retrieval, a central component, is the main source of errors.
Q: If I have a system with a complex retrieval aspect, and my error analysis shows that many errors are due to the system not pulling in the right context, should I try to make changes and experiment holistically? Or is it better to treat retrieval as a "mini-project," get that working well first, and then look at the whole system?
Divide and Conquer for Retrieval Problems (Hamel):
Mini-Project for Retrieval: Absolutely! If you've identified that retrieval is specifically where things break down, treat it as a focused mini-project. Go off, dedicate effort to fixing retrieval, and optimize it in isolation.
Focus and Fix: Don't try to tweak everything else while retrieval is fundamentally broken. You know it needs fixing, so fix it.
Synthetic Data for Retrieval: There are specialized techniques for generating synthetic data specifically for training and evaluating retrieval systems. For instance, you can "invert" your dataset: take your existing documents and use an LLM to generate relevant questions for those documents. This creates a question-document pair dataset ideal for retrieval testing.
Once retrieval is working reliably, then you can integrate it back and evaluate the holistic system.
Evaluating Complex Chains, Agents, and Retrieval at Scale
Akshobhya Dahal raised some excellent, multi-part questions about evaluating more sophisticated LLM architectures involving chains of calls, agentic patterns, and large-scale retrieval.
Q: Part 1 - For complex systems like chains (LangGraph) or agents using multiple tools and system prompts, how do you approach evaluation? Is it about doing error analysis on each "node" or component as a mini-project?
Evaluating Nodes in Chains and Agentic Systems:
Tool-Level and End-to-End Evaluations (Shreya):
- It's beneficial to design evaluations for every tool or distinct component
- Even if you conduct an end-to-end error analysis (using synthetically generated queries that require a sequence of tools), when failures occur, you'll inevitably need to drill down to the specific tool or component that failed
- So, in practice, you'll likely end up doing both:
- Tool-specific evaluations: Especially for tools that are common points of failure
- Evaluations for combinations of tools: Particularly if certain sequences or interactions between tools are problematic
- The exact order might not be critical, but both levels of analysis are valuable
Isolate and Test Nodes (Shreya & Hamel):
- Yes, for workflows with multiple LLM calls or state transitions (like in LangGraph), the approach is to divide the entire system into these nodes or components and evaluate them, if not individually, then at least in manageable sub-sections
- Hamel: The goal is to trigger a specific failure mode in the simplest, most isolated way possible. If a transition between agent states is failing, try to reduce that scenario to a minimal, reproducible test case. (He also mentioned that later in the course, they'd touch on analytical tools for agents, like transition matrices, that can help visualize and debug these flows.)
Q: Part 2 - I'm working with retrieval over a large number of documents (20,000 PDFs) and seeing it fail miserably, especially as we approach around 1,000 documents in an index. Any advice on evaluation metrics or strategies for this scale? (My current approach is to subdivide by domain, create smaller indexes, and use an LLM to route).
Strategies and Evaluations for Retrieval at Scale:
Isolate Retrieval First (Hamel): As with Wayde's question, if retrieval is the bottleneck, isolate it and evaluate it as a dedicated search/retrieval system. This topic will get significant coverage in the course, as retrieval is often an Achilles' heel. Use specific retrieval metrics to measure and optimize this component until it's satisfactory.
Shreya's Tips for Improving Large-Scale Retrieval:
- Partitioning/Clustering: Your approach of creating smaller, domain-specific indexes is a good one
- Document Augmentation: Enhance your documents with rich metadata. This could include:
- Generating summaries and embedding those summaries (can help with semantic similarity matching)
- Extracting and indexing table of contents
- The general idea is to add as much intelligent metadata as possible during an upfront indexing step
- Agentic Retrieval: Make your retrieval process more dynamic. An LLM agent could:
- Review initial retrieved results
- Make decisions on whether to refine the query, retrieve more results (paginate), or search in a different way or different index
Error Analysis on Retrieval Agents (Akshobhya & Shreya):
- Akshobhya: So, should I do error analysis on these specialized retrieval agents and align them to get specific results from particular indexes, ensuring they know what to expect from each domain-specific index?
- Shreya: Yes, precisely. And when evaluating, use specific retrieval metrics (like NDCG, which will be covered next week) to assess these components, rather than just a holistic pass/fail of the entire system.
Evaluations for Non-Text or Structured Outputs & Aligning with A/B Tests
Zee brought up interesting scenarios: evaluating systems where the final output isn't free-form text and how to connect offline evaluations with online A/B test results.
Q: Part 1 - How do evaluation practices differ when the product's output is more structured, like a list of recommended titles (as in a conversational search), rather than raw text? Traditional ML metrics might apply to the output, but the input is still high-dimensional natural language.
Evaluating Structured Outputs from LLM Systems:
Search Metrics for Search Systems (Shreya): If your conversational system outputs a list of titles as search results, evaluate it like a search system. Parse the output (the list of titles) and compute standard search/ranking metrics (NDCG@10, Precision@K). The generative AI aspect (understanding the conversational query) can be evaluated separately or as part of the input processing leading to the search.
Spectrum of Evaluations (Hamel):
- If the task is very narrow and maps well to traditional ML (a classification task), use established ML metrics
- If you're extracting specific, structured information (like book titles), aim for reference-based metrics if you have ground truth data
- The key is to be use-case specific. Always prefer the simplest, most deterministic evaluations possible, ideally those that don't require another LLM call if a simpler check suffices
Error Analysis for Input Parsing (Zee & Hamel):
- Zee: The LLM's role is parsing the complex, expressive natural language input, even if the output is structured. Error analysis on this input parsing stage still feels essential.
- Hamel: Emphatically agreed. "Error analysis is never not useful. It's probably like the most useful thing out of everything we're teaching, honestly."
Q: Part 2 - How do you incorporate A/B test results to assess the predictiveness of your offline evaluations, especially for LLM judges or ambiguous criteria? Do you correlate LLM judge scores with actual A/B test outcomes?
Connecting Offline Evaluations to Online Performance (A/B Tests):
Essential Alignment (Hamel): Yes, absolutely. While A/B testing is a vast topic in itself (involving experiment design, sample sizes, etc., and not covered in detail in this specific LLM course), it's vital to connect your offline evaluation metrics with real-world user signals and business outcomes.
Validate Your Evaluations: You want to check if your LLM-as-a-judge scores (and other offline evaluations) correlate with what you observe in A/B tests (conversion rates, user engagement, task success).
Downstream Signals: The goal is to ensure your offline metrics are good proxies for the ultimate downstream signals and business metrics you care about. If your offline evaluation says "this is great" but it tanks in an A/B test, your offline evaluation isn't predictive and needs rethinking.
Lean Startup vs. Pre-Launch Optimization: Finding the Balance
Aditya Kabra asked a classic product development question in the LLM context.
Q: Following a Lean Startup approach, we want to quickly build an MVP and get it to users. To what extent should we optimize our initial prompt (using synthetic data, evaluations, error analysis) before this first user validation step?
Balancing Speed with Quality (Hamel):
Ship if Acceptable: If it's genuinely acceptable within your specific use case and for your target users to release an early version without extensive pre-optimization, then by all means, do it.
Set Expectations: Crucially, set clear expectations with early users that they are interacting with a prototype or beta. This can mitigate negative experiences if the system isn't yet polished.
Real Data is Gold: Getting real user interaction data (traces) early on is incredibly valuable. You can then use this real-world data as the foundation for your error analysis, which in turn guides more targeted synthetic data generation and prompt refinement.
Judgment Call: Ultimately, it's a judgment call based on your product, market, risks, and resources. There's no universal "right" amount of pre-launch optimization. Some products can tolerate rougher edges initially; others can't.
Bootstrapping with Real Data: If you can get real user data safely and effectively, it's often preferred over purely synthetic bootstrapping.
Tactical Evaluation Questions: Proactiveness and Cadence of Alignment
Sebi Lozano had some sharp tactical questions on measuring bot behavior and the frequency of auto-evaluator maintenance.
Q: Part 1 - For measuring something like "proactiveness" in a Nurture Bot or a sales bot (it's not offering next steps or follow-ups when it should), would you evaluate this at the turn level or session level? And how do you avoid it being too proactive?
Evaluating "Proactiveness" and Bot Initiative:
Frame Beyond "Proactiveness" (Hamel): Hamel suggested reframing from just "proactiveness." For the Nurture Bot example discussed in class, the underlying issue might be: "Is the user's intent clear?" or "Has the bot gathered sufficient information to proceed effectively?" These could be evaluation criteria.
Avoid Rigid Rules: You probably don't want hard rules like "the bot must offer a follow-up N times." The goal is often dynamic, like effectively discovering and satisfying user intent.
Process for Addressing Lack of Initiative:
- Error Analysis: If analysis shows the bot fails to gather intent or guide the user, this is a failure mode
- Product/Prompt Fix First: Before complex evaluations, improve the product. Enable the AI to ask clarifying follow-up questions. Update the system prompt to specify what information is essential to gather or how to guide the user towards a clear objective
- Re-analyze: After these fixes, conduct more error analysis. Is there still a persistent problem with the bot not guiding the conversation effectively? If so, then iterate with more targeted evaluations
Session Level for Goal Accomplishment (Hamel, for sales bot): When the goal is something like "convert a user" in a sales context, evaluating at the session level is usually more appropriate. This is because the overall goal (a sale) can be accomplished in many different ways across multiple turns. Turn-level evaluations are better for more granular, tactical aspects ("did the bot correctly parse this specific piece of information?").
Q: Part 2 - The course materials recommend rerunning the alignment of auto-evaluators (like an LLM judge) every week. What's the reasoning behind "weekly" versus other cadences?
The "Magic Number" Principle for Cadence (Hamel):
Action-Oriented Numbers: Numbers like "100 traces" or "weekly" serve as practical, action-oriented guidelines. Hamel explained that decades of ML experience have shown that vague instructions ("look at enough data") often lead to inaction. Specific numbers get people started.
Starting Point, Not Dogma: "Weekly" is a good starting point to build a habit and ensure your auto-evaluators don't drift too far from your desired behavior.
Adjust Based on Stability (Judgment Call): If you consistently rerun alignment and find that your auto-evaluator is very stable (your product isn't changing rapidly, the LLM judge's performance isn't degrading), you can use your judgment to relax the frequency (bi-weekly or monthly). It's a dynamic process.
The Solo Developer as Domain Expert (For Now)
Peter Cardwell, a solo developer, asked about navigating the domain expert role temporarily.
Q: As a solo developer, I'm currently wearing the "domain expert" hat while building the initial product and the evaluation machinery. I plan to onboard a real domain expert later. What should I anticipate when they come in and potentially want to change rubrics, error codes, etc., midstream? How can I set myself up for success?
Navigating the Interim Domain Expert Role (Hamel):
Do Your Best with What You Have: This is the core principle. If you're the closest thing to a domain expert available right now, then you make the calls.
The "Best Available" Expert: Hamel used the Anki Hub (medical flashcards) example. While medical students annotating cards might not be seasoned professors, they are the most relevant and best available experts for the specific problem of creating useful flashcards for students. The goal is always to get the best domain expert you practically can for your specific situation.
Learn and Apply the Process: Even if you're not the ultimate domain expert, still go through all the recommended processes: look at your data, analyze traces, try to define rubrics based on your current best understanding, and build out your evaluation infrastructure. Your understanding of the domain will grow.
Be Prepared for Change: When a true domain expert comes on board, expect them to bring new insights that may lead to changes in your rubrics, error definitions, and priorities. This is a good thing – it means your product is getting more refined.
Foundation for the Expert: The work you've done (your initial rubrics, data analysis, evaluation setup) provides a valuable foundation and starting point for the incoming expert. They won't be starting from absolute zero. You've explored the problem space, which can accelerate their onboarding.
Managing Parallel Development and the Evolving Nature of Evaluations
Stefan Drakulich brought up the very real-world scenario of building complex systems with many moving parts, often developed in parallel.
Q: When new products are built quickly, the architecture and the prompt pipeline often get developed in parallel. The prompt pipeline's effectiveness is usually dependent on the overall system architecture (the "piping"). How do you handle evaluating this when things are being built simultaneously, and the model's understanding of the task might shift as new agents or steps are added to the system?
This is a challenge many of us face in fast-paced development environments.
Strategies for Evaluating Systems Built in Parallel (Hamel):
Component-Level Evaluation: If you have multiple teams or individuals building interlocking pieces (different agents that will eventually work together):
- Encourage each team to evaluate their own component/agent as thoroughly as possible using the standard techniques we've discussed (error analysis, targeted synthetic data, specific evaluations). (Hamel mentioned that more specific analytical tools for agents would be covered later in the course)
Integration Testing (Carefully Timed):
- As components mature, conduct integration tests. This means looking at session-level error analysis when these pieces come together
- Essential Judgment Call: Don't rush into extensive integration testing too prematurely. You don't want to waste valuable time and effort doing detailed error analysis on a combined system if the individual components are clearly not ready or fundamentally flawed
- It's perfectly acceptable to wait until individual pieces reach a satisfactory level of stability and performance before investing heavily in larger-scale integration tests
Continuous Evaluation – Not a One-Off:
- Stefan clarified if the idea was to test if a pipeline generally functions and then treat that architecture as somewhat fixed for a round of evaluation
- Hamel's Response: Evaluation is not a one-time activity. Your product will change, user behavior will evolve, the underlying AI models will update, and your understanding of the problem space will deepen. You must constantly revisit and rerun your evaluations
The Flywheel Effect:
- The first time you set up a comprehensive evaluation process, it's a significant amount of work
- However, once established, a "flywheel" effect kicks in. You'll build automation (like LLM-as-a-judge systems, automated data pipelines for evaluations). You'll develop tools and scripts to assist with labeling or identifying patterns
- With experience and repeated cycles ("reps"), the process becomes much more efficient. You'll get better at spotting areas for improvement and even start building your own custom tooling to streamline your specific evaluation needs
Can an AI Become the Domain Expert?
Q: Following up on the discussion about a solo developer initially acting as the domain expert: Can we actually make an AI become a domain expert by iteratively refining its prompts and "co-working" with it? How feasible is that as a goal?
The AI as the Embodiment of Expertise (Hamel):
That is the Goal: Hamel's response was direct and insightful: Yes, the process of making an AI become a domain expert for a specific task is precisely what building an AI product is all about.
Task-Specific Expertise: Whether you're building a developer assistant, a sales support tool, a medical diagnosis aid, or any other AI application, your fundamental task is to imbue that AI with the necessary knowledge, reasoning capabilities, and "expertise" to perform its designated function effectively within its domain.
The prompts, the data, the architecture, and the evaluation loops are all means to this end: creating an AI that acts as a proficient expert for the job you've designed it to do.
Comprehensive Key Takeaways
This extensive discussion has covered substantial ground in LLM development and evaluation. Here's a consolidated list of the most fundamental takeaways for any AI engineer working with LLMs:
1. Error Analysis is Your Essential Guide: It's the starting point and ongoing guide for synthetic data strategy, prompt engineering, evaluation design, identifying generalization gaps, and even informing product-level decisions.
2. Start Focused, Systematically Expand: Don't try to solve everything simultaneously. Identify core problems or components first, stabilize them, and then broaden your scope. This applies to features, evaluations, and data generation.
3. Rubrics Demand Clarity, Consistency, and Expertise: Define them with meticulous detail, use concrete examples of good and bad outputs, and ensure genuine domain experts are involved in their creation and application.
4. Evaluate Complex Systems via Divide and Conquer: For tool use, chained prompts, or agentic architectures, isolate and evaluate individual components or nodes rigorously before (or alongside) assessing the integrated end-to-end system.
5. Distinguish Product-Level Gaps from AI Flaws: Sometimes, issues like ambiguous user input or missing conversational capabilities require changes to the product's design or flow, not just tweaks to the LLM.
6. Strive for High-Fidelity Simulation for Generalization: To ensure your LLM application performs reliably across diverse domains or novel scenarios, your testing and synthetic data must accurately reflect the true complexity and context of those situations.
7. Align Offline Evaluations with Online Reality: Continuously validate that your offline evaluation metrics (including LLM-as-a-judge outputs) are predictive of real-world user behavior and key business outcomes observed in A/B tests or production monitoring.
8. Human Expertise is Irreplaceable (But Be Pragmatic): Always aim to involve the best possible domain expertise in defining "good" and evaluating quality. If you're the interim expert (a solo developer), do your best, document your assumptions, and be prepared to iterate when more specialized expertise becomes available.
9. Evaluation is a Continuous, Iterative Process: It's not a one-time setup. As your product, users, and the underlying AI models evolve, your evaluation suite must also evolve. Embrace the flywheel effect as automation and experience make this more efficient over time.
10. The Goal IS an "AI Expert": Developing an AI product is fundamentally about creating a system that embodies the necessary expertise to perform its specific task effectively and reliably.
11. Be Practical with Guidelines and Cadences: Recommendations like "100 traces" or "weekly alignment checks" are practical starting points to build good habits. Use your judgment and adapt these based on your system's stability, the pace of change, and your team's observations.
12. Component-Level Evaluation Before Integration: Whether dealing with retrieval systems, tool-using agents, or complex chains, evaluate individual components thoroughly before attempting comprehensive end-to-end testing.
13. Match Evaluation Approach to Output Type: Use search metrics for search results, ML metrics for classification tasks, and reference-based metrics for structured extraction. Always prefer the simplest evaluation method that provides reliable signals.
14. Proactive Simulation for Generalization: Don't wait for new domains or use cases to appear. Include diverse scenarios in your synthetic data generation to identify potential generalization failures early.
15. Balance Speed with Quality in MVPs: Real user data is often more valuable than extensive pre-launch optimization, but set clear expectations with early users about prototype status.
Building with Large Language Models is a journey filled with both immense potential and unique challenges. By adopting a principled, data-driven, and iterative approach to development and evaluation, we can build truly remarkable and reliable AI systems. The comprehensive guidance provided here represents battle-tested approaches from practitioners who have navigated these challenges in production environments. The field continues to evolve rapidly, but these fundamental principles provide a solid foundation for building robust, scalable LLM applications.