Evaluating LLM Output: Metrics, Frameworks, and Why It's Harder Than You Think

Evaluating LLM Output Is Not a Metrics Problem — It Is a Philosophy Problem

Most teams building LLM-powered applications underestimate evaluation until they have shipped something to production and discovered that “it looked good in testing” is not a methodology. LLM evaluation is hard in ways that traditional software testing is not: there is no ground truth for subjective tasks, outputs are probabilistic and variable, and the failure modes are qualitative rather than binary. This guide covers the evaluation frameworks, metrics, and tools that teams actually use in production — plus the conceptual framework for thinking about evaluation problems that have no clean automated solution.

Why LLM Evaluation Is Genuinely Hard

Traditional software tests have a clear structure: given input X, expect output Y. Deterministic, binary, automatable. LLM evaluation breaks every assumption of that model.

Consider a customer support bot that should answer questions about your product accurately and helpfully. How do you test it? “Accurately” requires knowing the correct answer to compare against — but for open-ended questions, there may be multiple valid answers, none of which exactly match your reference. “Helpfully” is a subjective quality assessment. And the same prompt sent to the same model twice may produce meaningfully different outputs.

This is not a problem you solve with better tooling. It is a problem you manage with the right combination of automated metrics, human judgment, and production monitoring — recognizing that no single approach is sufficient.

The Four Evaluation Paradigms

1. Reference-Based Metrics

When you have a ground truth — a set of questions with known correct answers — reference-based metrics compare model output to the reference answer.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures n-gram overlap between the generated text and a reference. ROUGE-1 compares unigrams, ROUGE-2 bigrams, ROUGE-L longest common subsequence. Originally developed for summarization evaluation.

BLEU (Bilingual Evaluation Understudy): Similar n-gram overlap metric, originally for machine translation. Measures precision (how much of the output appears in the reference) rather than recall.

BERTScore: Uses contextual embeddings from BERT to measure semantic similarity between output and reference, rather than surface-level token overlap. Better at capturing paraphrases and semantically equivalent outputs that differ in phrasing.

from bert_score import score as bert_score
from rouge_score import rouge_scorer

# Example: evaluate summarization output
references = [
    "The ACME protocol automates TLS certificate issuance using challenge-response verification.",
]
candidates = [
    "ACME is a protocol that handles automatic certificate management by verifying domain ownership.",
]

# ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(references[0], candidates[0])
print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.3f}")

# BERTScore — captures semantic similarity beyond token overlap
P, R, F1 = bert_score(candidates, references, lang="en")
print(f"BERTScore F1: {F1.mean():.3f}")

Reference-based metrics have a fundamental limitation: they require high-quality reference answers, and they measure similarity to those references, not correctness or quality. An output that is factually correct but phrased differently from the reference will score poorly. Use them for tasks where you have high-quality references and the output space is constrained (summarization, translation, factual Q&A).

2. LLM-as-Judge

The current most widely-used paradigm for qualitative evaluation uses a capable LLM (typically GPT-4o or Claude Sonnet) to judge the output of another LLM. The judge LLM receives a rubric and evaluates outputs against it.

import anthropic

client = anthropic.Anthropic()

def evaluate_with_llm_judge(
    question: str,
    model_output: str,
    criteria: list[str]
) -> dict:
    """
    Use Claude as a judge to evaluate LLM output quality.
    Returns scores and reasoning for each criterion.
    """
    criteria_text = "\n".join(f"{i+1}. {c}" for i, c in enumerate(criteria))
    
    prompt = f"""You are evaluating the quality of an AI assistant's response.

Question asked: {question}

Response to evaluate:
{model_output}

Evaluate the response against these criteria:
{criteria_text}

For each criterion, provide:
- Score: 1 (poor), 2 (acceptable), 3 (good), 4 (excellent)
- Brief reasoning (1-2 sentences)

Respond in JSON format:
{{
  "scores": {{
    "criterion_name": {{"score": X, "reasoning": "..."}}
  }},
  "overall_score": X,
  "overall_assessment": "..."
}}"""

    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    import json
    return json.loads(message.content[0].text)

# Usage example
result = evaluate_with_llm_judge(
    question="How do I implement rate limiting in a REST API?",
    model_output="...",
    criteria=[
        "Technical accuracy — is the information correct?",
        "Completeness — does it cover the key approaches?",
        "Code quality — are code examples correct and idiomatic?",
        "Clarity — is the explanation easy to follow?"
    ]
)
print(f"Overall score: {result['overall_score']}/4")

LLM-as-judge correlates well with human judgment for many tasks (LMSYS’s research shows 80%+ agreement with human preference ratings), but it has well-documented biases: position bias (preferring the first option when comparing), verbosity bias (preferring longer outputs regardless of quality), and self-enhancement bias (a model tends to prefer its own outputs when acting as judge).

Mitigate these with: positional swap testing (run the comparison twice with candidates swapped, flag disagreements), using a judge model different from the model being evaluated, and calibrating the judge against human labels on a representative sample.

3. Human Evaluation

For high-stakes tasks, human evaluation remains the ground truth. The question is not whether to use human evaluation but how to make it efficient and consistent.

Key principles for reliable human evaluation:

Blind evaluation: Evaluators should not know which model or prompt generated the output being scored. Knowing the source introduces bias even in well-intentioned evaluators.

Clear rubrics with examples: “Is this response helpful?” is not a rubric. “Rate the response on helpfulness: 1 = does not address the question, 2 = partially addresses the question, 3 = addresses the question but with gaps or errors, 4 = fully and accurately addresses the question” — with a worked example at each level — produces consistent scores.

Inter-annotator agreement: Have multiple evaluators score a random sample and measure agreement (Cohen’s Kappa for categorical ratings). Low agreement signals that your rubric is ambiguous, not that your evaluators are unreliable.

4. Task-Specific Automated Metrics

For structured tasks, build custom automated evaluations that test specific properties:

import ast
import subprocess
import tempfile
import os

def evaluate_code_output(
    generated_code: str,
    test_cases: list[dict]
) -> dict:
    """
    Evaluate generated code by actually running it against test cases.
    Much more reliable than textual similarity for code evaluation.
    """
    results = {
        "syntax_valid": False,
        "tests_passed": 0,
        "tests_total": len(test_cases),
        "errors": []
    }
    
    # Check syntax validity
    try:
        ast.parse(generated_code)
        results["syntax_valid"] = True
    except SyntaxError as e:
        results["errors"].append(f"Syntax error: {e}")
        return results
    
    # Run against test cases
    for i, test_case in enumerate(test_cases):
        test_code = f"""
{generated_code}

# Test case {i+1}
result = {test_case['call']}
expected = {repr(test_case['expected'])}
assert result == expected, f"Expected {{expected}}, got {{result}}"
print("PASS")
"""
        with tempfile.NamedTemporaryFile(
            mode='w', suffix='.py', delete=False
        ) as f:
            f.write(test_code)
            f.flush()
            
            proc = subprocess.run(
                ["python3", f.name],
                capture_output=True, text=True, timeout=5
            )
            os.unlink(f.name)
            
            if "PASS" in proc.stdout:
                results["tests_passed"] += 1
            else:
                results["errors"].append(
                    f"Test {i+1} failed: {proc.stderr[:200]}"
                )
    
    results["pass_rate"] = results["tests_passed"] / results["tests_total"]
    return results

Evaluation Frameworks

RAGAS: RAG-Specific Evaluation

For Retrieval-Augmented Generation systems, RAGAS provides a framework measuring four dimensions: faithfulness (does the answer only use information from the retrieved context?), answer relevance (how relevant is the answer to the question?), context recall (does the retrieved context contain the information needed?), and context precision (is the retrieved context relevant?).

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is mTLS?", "How does ACME work?"],
    "answer": ["mTLS requires both client and server to authenticate...", "ACME uses challenge-response verification..."],
    "contexts": [
        ["mTLS or mutual TLS is a protocol where both parties authenticate..."],
        ["The ACME protocol automates certificate issuance by verifying..."]
    ],
    "ground_truth": ["Mutual TLS authenticates both client and server...", "ACME automates TLS certificate issuance..."]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[
    faithfulness, answer_relevancy, context_recall, context_precision
])
print(result)

PromptFoo: Systematic Prompt Testing

PromptFoo is a CLI tool for testing prompts systematically across models and datasets. It integrates into CI pipelines and can run assertions against model outputs automatically.

# promptfooconfig.yaml
prompts:
  - file://prompts/code-reviewer.txt

providers:
  - id: ollama:qwen2.5-coder:7b
  - id: openai:gpt-4o-mini

tests:
  - description: "Should identify SQL injection"
    vars:
      code: |
        query = f"SELECT * FROM users WHERE id = {user_input}"
        cursor.execute(query)
    assert:
      - type: contains
        value: "SQL injection"
      - type: llm-rubric
        value: "The response identifies the SQL injection vulnerability and provides a parameterized query fix"

  - description: "Should not flag safe code"
    vars:
      code: |
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    assert:
      - type: not-contains
        value: "SQL injection"

# Run evaluation
promptfoo eval

# Compare results across providers
promptfoo view

Production Monitoring: Evaluation Does Not End at Deployment

Model behavior in production differs from model behavior in your evaluation set — sometimes dramatically. Users ask questions you did not anticipate, in formats you did not test, and the model’s response quality drifts as you update prompts, change models, or the underlying model is updated by the provider.

Production monitoring for LLM applications requires:

Logging all inputs and outputs: This is your ground truth for post-hoc analysis. Store every request and response with timestamps, model version, and prompt template version.
Sampling for human review: Review a random 1–5% of production outputs weekly. This is how you catch quality degradation before users do.
User feedback signals: Thumbs up/down, explicit corrections, follow-up clarification requests — these are weak but real quality signals at scale.
Automated regression tests on each deployment: Before changing a prompt template or updating to a new model version, run your full evaluation suite and require it to meet a minimum quality threshold.

The Uncomfortable Truth About LLM Evaluation

There is no metric combination that definitively tells you your LLM application is working well. Reference-based metrics miss semantically correct paraphrases. LLM judges have biases. Human evaluation is expensive and slow. Task-specific metrics only cover what you thought to test.

The teams building reliable LLM applications in production use all of these approaches in combination — and they maintain healthy skepticism about each one. They invest heavily in logging so they can learn from production data, they have clear quality thresholds that block deployment when violated, and they treat evaluation as an ongoing practice rather than a pre-launch checklist.

The goal is not perfect evaluation. It is good-enough evaluation that catches regressions before users do and generates the feedback loop needed to improve the system over time.

Key Takeaways

No single evaluation metric is sufficient. Use reference-based metrics for structured tasks with ground truth, LLM-as-judge for qualitative assessment, and task-specific automated tests for code and structured outputs.
LLM-as-judge correlates well with human judgment but has documented biases — mitigate with positional swap tests and judge calibration against human labels.
RAGAS provides standardized evaluation dimensions for RAG systems (faithfulness, relevance, recall, precision) that are difficult to measure manually.
PromptFoo integrates LLM evaluation into CI pipelines with declarative test configuration and multi-model comparison.
Production monitoring — logging, sampling, user feedback — is not optional. Evaluation sets do not capture the full distribution of production inputs.

Evaluating LLM Output: Metrics, Frameworks, and Why It’s Harder Than You Think

ByMichael Sun

Evaluating LLM Output Is Not a Metrics Problem — It Is a Philosophy Problem

Why LLM Evaluation Is Genuinely Hard

The Four Evaluation Paradigms

1. Reference-Based Metrics

2. LLM-as-Judge

3. Human Evaluation

4. Task-Specific Automated Metrics

Evaluation Frameworks

RAGAS: RAG-Specific Evaluation

PromptFoo: Systematic Prompt Testing

Production Monitoring: Evaluation Does Not End at Deployment

The Uncomfortable Truth About LLM Evaluation

Key Takeaways

By Michael Sun

Related Post

The Rise of Agentic Workflows: Building Reliable Multi-Agent Systems

AI Coding Agents in CI/CD: Automated Code Review, Test Generation, and Deployment

MCP (Model Context Protocol): How Anthropic’s Open Standard Is Reshaping AI Tool Integration

Leave a Reply Cancel reply

You missed

Technical Writing for Engineers: How Documentation Becomes Your Competitive Advantage

WebAssembly Beyond the Browser: Server-Side Wasm in 2026

Local-First Software: CRDTs, Sync Engines, and Why the Cloud Isn’t Always the Answer

Observability in 2026: OpenTelemetry, eBPF, and the Death of Traditional Monitoring