Beyond Unit Tests: Property-Based Testing, Mutation Testing, and Contract Testing

The Most Underinvested Skill in Software Engineering

Every developer knows how to write tests. Far fewer know how to test their tests — to verify that the tests actually catch the bugs they’re supposed to catch. Property-based testing, mutation testing, and contract testing are three approaches that expose gaps in conventional test suites that no amount of additional unit tests will fix. This guide covers all three with practical examples, explaining not just how to use them but when each earns its keep.

The Limits of Example-Based Testing

Standard unit tests are example-based: you choose specific inputs, run the function, and assert expected outputs. The fundamental limitation is that you, as the author, choose the examples. Your test suite reflects your mental model of how the code works — and your mental model is what produced the bug in the first place.

This is why you can have 90% code coverage and still ship bugs. Coverage measures whether your code was executed, not whether it was executed correctly. The inputs you didn’t think to test are usually the ones that fail in production.

Property-Based Testing: Let the Computer Find Your Edge Cases

Property-based testing inverts the example-based model. Instead of specifying exact inputs and outputs, you specify properties — invariants that should hold true for all valid inputs. The framework generates hundreds or thousands of random inputs to try to violate your property.

In Python, Hypothesis is the standard library:

from hypothesis import given, strategies as st, settings
from hypothesis import assume
import pytest

# Testing a simple example: a function that sorts a list
def my_sort(lst):
    # Deliberately buggy: has an off-by-one error for lists of length 1
    if len(lst) <= 1:
        return lst
    pivot = lst[len(lst) // 2]
    left = [x for x in lst if x < pivot]
    middle = [x for x in lst if x == pivot]
    right = [x for x in lst if x > pivot]
    return my_sort(left) + middle + my_sort(right)

# Property 1: output length equals input length
@given(st.lists(st.integers()))
def test_sort_length_preserved(lst):
    result = my_sort(lst)
    assert len(result) == len(lst)

# Property 2: output is sorted
@given(st.lists(st.integers()))
def test_sort_result_is_sorted(lst):
    result = my_sort(lst)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

# Property 3: output contains the same elements
@given(st.lists(st.integers()))
def test_sort_same_elements(lst):
    result = my_sort(lst)
    assert sorted(result) == sorted(lst)  # Both sorted == same multiset

# Property 4: idempotent — sorting twice gives same result as once
@given(st.lists(st.integers()))
def test_sort_idempotent(lst):
    assert my_sort(my_sort(lst)) == my_sort(lst)

Hypothesis will automatically try empty lists, single-element lists, lists with duplicates, lists with negative numbers, and lists with extreme values. When it finds a failing case, it shrinks the input to the minimal failing example:

Falsifying example: test_sort_result_is_sorted(lst=[1, 1, 1, 2, 1])
Traceback:
  AssertionError: result[2] <= result[3] violated: 2 > 1

Property-Based Testing for Data Parsing

Where property-based testing shines especially bright: parser and serializer round-trips. Any data format that can be serialized and deserialized should satisfy: deserialize(serialize(x)) == x.

import json
from hypothesis import given, strategies as st

# Define a strategy for JSON-serializable data
json_strategy = st.recursive(
    st.one_of(
        st.none(),
        st.booleans(),
        st.integers(min_value=-(2**53), max_value=2**53),
        st.floats(allow_nan=False, allow_infinity=False),
        st.text()
    ),
    lambda children: st.one_of(
        st.lists(children),
        st.dictionaries(st.text(), children)
    ),
    max_leaves=20
)

@given(json_strategy)
def test_json_roundtrip(data):
    """Verify JSON serialization/deserialization is lossless"""
    assert json.loads(json.dumps(data)) == data

# Test your custom serializer
@given(st.builds(User, id=st.integers(min_value=1),
                  email=st.emails(),
                  name=st.text(min_size=1, max_size=100)))
def test_user_serialization_roundtrip(user):
    serialized = user.to_dict()
    deserialized = User.from_dict(serialized)
    assert deserialized == user

Mutation Testing: Verify Your Tests Actually Catch Bugs

Mutation testing answers the question: “If I introduce a bug into this code, will my tests fail?” The framework automatically creates hundreds of mutants — modified versions of your code with small changes (flipping a > to >=, changing a + to -, deleting a line) — and checks whether your tests catch each mutation. Mutations that survive (tests don’t fail) indicate gaps in your test suite.

In Python, mutmut is the most practical tool:

# Install and run mutmut
pip install mutmut
mutmut run --paths-to-mutate src/

# Check results
mutmut results

# Show surviving mutants (the dangerous ones)
mutmut show 42  # Show mutant #42

# Example surviving mutant output:
# --- src/pricing.py
# +++ src/pricing.py
# @@ -14,7 +14,7 @@
#  def calculate_discount(price, quantity):
# -    if quantity >= 10:
# +    if quantity > 10:
#          return price * 0.9
#      return price

This surviving mutant tells you: you don’t have a test that verifies the discount applies when quantity is exactly 10. That’s a real bug — off-by-one errors on boundary conditions are notoriously common in production.

# After mutation testing, you add the missing test:
def test_discount_applied_at_exactly_ten_units():
    assert calculate_discount(100, 10) == 90  # 10% discount

def test_discount_not_applied_at_nine_units():
    assert calculate_discount(100, 9) == 100  # No discount

def test_discount_boundary_conditions():
    assert calculate_discount(100, 9) == 100
    assert calculate_discount(100, 10) == 90   # Boundary: 10 exactly
    assert calculate_discount(100, 11) == 90   # Above threshold

Mutation Score and What It Means

Mutation score = killed mutants / total mutants. A score below 80% typically indicates significantly undertested code. Above 90% is good; 95%+ is excellent for critical path code.

Running mutation testing on every commit is expensive — it’s O(mutants * test_suite_time). Practical approach: run it in CI on the critical modules (payment logic, auth, data validation) only, not the entire codebase.

# .mutmut-config
[mutmut]
paths_to_mutate=src/payments/,src/auth/,src/validation/
runner=python -m pytest src/tests/ -x --timeout=10

Contract Testing: Preventing Integration Failures Between Services

Contract testing verifies that a service consumer and its provider agree on their API contract — without requiring both services to be running simultaneously. It’s particularly valuable in microservice architectures where integration tests are slow and brittle.

Pact is the standard framework for consumer-driven contract testing:

# Consumer test (order-service consuming user-service API)
import pytest
from pact import Consumer, Provider

@pytest.fixture
def pact(request):
    pact = Consumer('order-service').has_pact_with(Provider('user-service'))
    pact.start_service()
    yield pact
    pact.stop_service()
    pact.verify()

def test_get_user_for_order_creation(pact):
    # Define what the consumer expects
    (pact
        .given('user 4821 exists and is active')
        .upon_receiving('a request for user 4821')
        .with_request(
            method='GET',
            path='/v1/users/4821',
            headers={'Authorization': 'Bearer valid-token'}
        )
        .will_respond_with(
            status=200,
            headers={'Content-Type': 'application/json'},
            body={
                'id': 4821,
                'email': 'alice@example.com',
                'status': 'active',
                'subscription_tier': Like('pro')  # Any string is acceptable here
            }
        ))

    # Run the actual consumer code against the Pact mock
    result = user_service_client.get_user(4821)

    assert result.id == 4821
    assert result.status == 'active'

# Provider test (user-service verifying it satisfies the contract)
from pact import Verifier

def test_user_service_satisfies_order_service_contract():
    verifier = Verifier(
        provider='user-service',
        provider_base_url='http://localhost:8001'
    )

    output, _ = verifier.verify_pacts(
        './pacts/order-service-user-service.json',
        provider_states_setup_url='http://localhost:8001/pact/provider-states'
    )

    assert output == 0, "Provider failed to satisfy consumer contracts"

The key insight: the consumer defines what it needs from the provider. The contract is published (to a Pact Broker or the file system). The provider’s CI runs the contract tests against its actual implementation. Both sides can deploy independently as long as contracts are satisfied.

Choosing the Right Advanced Testing Tool

These three approaches complement rather than replace each other:

Property-based testing: Best for pure functions, parsers, serializers, algorithms, and anything where you can describe invariants more easily than enumerate examples. Biggest ROI when you’re writing new code from scratch.
Mutation testing: Best as an audit tool — run it on existing code to find coverage gaps in critical paths. Not worth running on every commit; weekly or per-release is sufficient.
Contract testing: Best for multi-team microservice environments where integration tests are slow or unreliable. The ROI is highest when deployment of services is independent.

The common thread: all three approaches are ways of making testing more systematic and less dependent on the individual developer’s imagination. Your bugs live in the cases you didn’t think of. These tools help you find those cases before your users do.

Beyond Unit Tests: Property-Based Testing, Mutation Testing, and Contract Testing

ByMichael Sun

The Most Underinvested Skill in Software Engineering

The Limits of Example-Based Testing

Property-Based Testing: Let the Computer Find Your Edge Cases

Property-Based Testing for Data Parsing

Mutation Testing: Verify Your Tests Actually Catch Bugs

Mutation Score and What It Means

Contract Testing: Preventing Integration Failures Between Services

Choosing the Right Advanced Testing Tool

By Michael Sun

Related Post

WebAssembly Beyond the Browser: Server-Side Wasm in 2026

Local-First Software: CRDTs, Sync Engines, and Why the Cloud Isn’t Always the Answer

The Platform Engineering Playbook: Building Internal Developer Platforms That Teams Actually Use

Leave a Reply Cancel reply

You missed

Technical Writing for Engineers: How Documentation Becomes Your Competitive Advantage

WebAssembly Beyond the Browser: Server-Side Wasm in 2026

Local-First Software: CRDTs, Sync Engines, and Why the Cloud Isn’t Always the Answer

Observability in 2026: OpenTelemetry, eBPF, and the Death of Traditional Monitoring