Key Takeaway
By the end of this blueprint you will have an automated LLM evaluation framework with versioned test datasets, LLM-as-judge scoring with calibrated rubrics, deterministic assertion checks, regression detection against baselines, and a CI/CD gate that blocks prompt and model changes that fail quality thresholds.
Prerequisites
- An LLM application with at least one prompt-based feature to evaluate
- Python 3.11+ with pytest for the test harness
- An LLM API key for judge evaluations (separate from the system under test)
- At least 50 representative test cases for your application domain
- A CI system (GitHub Actions, GitLab CI, etc.) for automated evaluation runs
Why Traditional Tests Fail for LLMs
Traditional unit tests assert exact equality: assertEqual(output, expected). LLM outputs are non-deterministic — the same prompt produces different wording every time. You cannot assert exact matches. Instead, you need tests that evaluate along dimensions: is the response factually accurate? Does it follow the format instructions? Is it safe and appropriate? Does it use the provided context rather than hallucinating? Each dimension requires its own evaluation method, ranging from simple regex checks to LLM-as-judge scoring.
Evaluation Dataset Design
Your evaluation dataset is the foundation of your testing framework. Each test case specifies an input (the user request and any context), the expected behavior (not the exact output, but what a good output should contain or avoid), and metadata (difficulty, category, source). Start with 50-100 test cases covering your most important scenarios, and grow the dataset over time by adding cases for every bug you find in production. Version the dataset alongside your prompts so you can track how quality changes over time.
"""Evaluation dataset schema and management."""
from __future__ import annotations
import json
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal
@dataclass
class EvalCase:
"""A single evaluation test case."""
id: str
category: str # e.g., "factual", "safety", "format", "reasoning"
difficulty: Literal["easy", "medium", "hard"]
# Input
user_message: str
system_prompt: str | None = None
context: str | None = None # RAG context, if applicable
# Expected behavior (not exact output)
must_contain: list[str] = field(default_factory=list)
must_not_contain: list[str] = field(default_factory=list)
expected_format: str | None = None # "json", "markdown", "bullet-list"
reference_answer: str | None = None # For similarity scoring
# Scoring dimensions to evaluate
dimensions: list[str] = field(
default_factory=lambda: ["accuracy", "relevance", "safety"]
)
@dataclass
class EvalDataset:
"""Versioned evaluation dataset."""
name: str
version: str
cases: list[EvalCase]
@classmethod
def load(cls, path: Path) -> "EvalDataset":
"""Load dataset from a JSONL file."""
with open(path) as f:
metadata = json.loads(f.readline())
cases = [EvalCase(**json.loads(line)) for line in f]
return cls(
name=metadata["name"],
version=metadata["version"],
cases=cases,
)
def filter_by_category(self, category: str) -> list[EvalCase]:
return [c for c in self.cases if c.category == category]Unlock the full Knowledge Base
This article continues for 16 more sections. Upgrade to Pro for full access to all 93 articles.
That's just $0.11 per article
- Full access to all blueprints, frameworks, and playbooks
- Interactive checklists with progress tracking
- Downloadable templates (.xlsx, .pptx, .docx)
- Quarterly Technology Radar updates