Language:English VersionChinese Version

Sunday at 3 AM Is When the Pipeline Decides to Teach You a Lesson

I have been paged on more Sundays than I care to count, and the root cause is almost never a hardware failure or a code bug. It is the data pipeline. Specifically, it is an ETL job that worked perfectly Monday through Friday and then quietly broke because the data it expected on Sunday looks nothing like the data it gets the rest of the week.

Production ETL pipelines fail on Sundays (and holidays, and month-end boundaries, and daylight saving transitions) because they are built on implicit assumptions about data that are only true most of the time. The most dangerous assumption is that the future will look like the recent past. Data pipelines are where that assumption goes to die.

The Seven Ways Pipelines Break on Weekends

1. Volume Drops Trigger False Alarms (or Mask Real Ones)

Most monitoring for data pipelines includes row count checks: if today’s batch has significantly fewer rows than yesterday’s, fire an alert. On Sundays, transaction volumes legitimately drop 40-70% for many B2B applications. The alert fires every Sunday. After a few weeks, the on-call engineer silences it. Then one Sunday, the pipeline actually breaks and nobody notices because the alert was already muted.

# The naive approach: static threshold
def check_row_count(table, expected_min=10000):
    count = db.query(f"SELECT COUNT(*) FROM {table} WHERE date = CURRENT_DATE")
    if count < expected_min:
        alert(f"Low row count in {table}: {count} < {expected_min}")

# The better approach: day-of-week aware thresholds
def check_row_count_smart(table):
    current_day = datetime.now().strftime("%A")
    
    # Calculate expected range from same day-of-week in past 4 weeks
    stats = db.query("""
        SELECT 
            AVG(row_count) as avg_count,
            STDDEV(row_count) as std_count
        FROM pipeline_metrics 
        WHERE table_name = %s 
          AND day_of_week = %s 
          AND run_date > CURRENT_DATE - INTERVAL '28 days'
    """, [table, current_day])
    
    lower_bound = max(0, stats.avg_count - 2 * stats.std_count)
    upper_bound = stats.avg_count + 2 * stats.std_count
    
    count = get_today_count(table)
    if count < lower_bound or count > upper_bound:
        alert(
            f"Anomalous row count in {table}: {count}. "
            f"Expected {lower_bound:.0f}-{upper_bound:.0f} for {current_day}."
        )

2. Source System Maintenance Windows

Many upstream data sources schedule maintenance on weekends. Your pipeline tries to connect at its usual time, gets a connection refused or a timeout, and either fails loudly or — worse — succeeds with an empty or partial dataset.

# A robust extraction step should handle maintenance windows
import tenacity
import httpx

@tenacity.retry(
    stop=tenacity.stop_after_attempt(5),
    wait=tenacity.wait_exponential(multiplier=1, min=60, max=3600),
    retry=tenacity.retry_if_exception_type((httpx.ConnectError, httpx.ReadTimeout)),
    before_sleep=lambda retry_state: logger.warning(
        "source_unavailable",
        attempt=retry_state.attempt_number,
        wait=retry_state.next_action.sleep,
        source="billing-api",
    ),
)
async def extract_billing_data(date: str):
    response = await httpx.AsyncClient().get(
        f"https://billing-api.internal/export?date={date}",
        timeout=300,
    )
    
    if response.status_code == 503:
        raise httpx.ConnectError("Source in maintenance mode")
    
    data = response.json()
    
    # Critical: validate that we got a complete dataset
    if data.get("is_partial"):
        raise PartialDataError(
            f"Source returned partial data for {date}. "
            f"Expected ~{data.get('expected_count', 'unknown')} records, "
            f"got {len(data['records'])}. Likely maintenance window."
        )
    
    return data

3. Timezone and Daylight Saving Transitions

Twice a year, the clocks change. And twice a year, ETL pipelines that assume 24 hours in a day break. A pipeline that processes data in daily batches from midnight to midnight will either process 23 or 25 hours of data during a DST transition, leading to undercounting or duplicate processing.

# Bad: naive date arithmetic
from datetime import datetime, timedelta

def get_date_range(date):
    start = datetime(date.year, date.month, date.day)
    end = start + timedelta(days=1)  # NOT always 24 hours!
    return start, end

# Good: timezone-aware date boundaries
from zoneinfo import ZoneInfo

def get_date_range_safe(date, tz_name="America/New_York"):
    tz = ZoneInfo(tz_name)
    start = datetime(date.year, date.month, date.day, tzinfo=tz)
    end = datetime(date.year, date.month, date.day, tzinfo=tz) + timedelta(days=1)
    
    # During DST transition:
    # Spring forward: end - start = 23 hours (this is correct!)
    # Fall back: end - start = 25 hours (this is also correct!)
    
    # Convert to UTC for database queries
    return start.astimezone(ZoneInfo("UTC")), end.astimezone(ZoneInfo("UTC"))

# Even better: store and process everything in UTC
# and only convert to local time at the presentation layer

4. Late-Arriving Data

Weekend data often arrives late. Batch systems that process “yesterday’s data” at 6 AM assume the data is complete by then. On weekdays, this is usually true. On Sundays, data from Saturday night might not land until noon because the source system’s batch export runs on a weekend schedule.

# Watermark-based completeness check
async def wait_for_data_completeness(
    source: str,
    date: str,
    expected_watermark: str = "23:59",
    timeout_hours: int = 6,
    check_interval_minutes: int = 15,
):
    """Wait until the source data includes events up to the expected watermark."""
    start_time = datetime.now()
    timeout = timedelta(hours=timeout_hours)
    
    while datetime.now() - start_time < timeout:
        latest_event = await db.fetchval(
            "SELECT MAX(event_timestamp) FROM raw_events "
            "WHERE source = $1 AND event_date = $2",
            source, date,
        )
        
        if latest_event and latest_event.strftime("%H:%M") >= expected_watermark:
            logger.info("data_complete", source=source, date=date, 
                       latest_event=str(latest_event))
            return True
        
        logger.info("waiting_for_data", source=source, date=date,
                    latest_event=str(latest_event), expected=expected_watermark)
        await asyncio.sleep(check_interval_minutes * 60)
    
    raise DataIncompleteError(
        f"Data for {source}/{date} not complete after {timeout_hours} hours. "
        f"Latest event: {latest_event}. Expected watermark: {expected_watermark}."
    )

5. Schema Changes That Land on Fridays

Someone deploys a schema change on Friday afternoon (against all common sense). The application is fine because it handles the new schema. But the ETL pipeline, which reads from a replica or an event stream, does not know about the new column or the renamed field. It runs fine Friday night and Saturday because the new field is nullable and the pipeline ignores extra fields. On Sunday, the first record with a required new field arrives, and the pipeline chokes.

# Defensive schema handling in transformation step
import pandas as pd

def transform_orders(raw_df: pd.DataFrame) -> pd.DataFrame:
    # Define expected schema with defaults
    expected_columns = {
        "order_id": {"type": "str", "required": True},
        "user_id": {"type": "str", "required": True},
        "amount": {"type": "float", "required": True},
        "currency": {"type": "str", "default": "USD"},
        "discount_code": {"type": "str", "default": None},
        "fulfillment_type": {"type": "str", "default": "standard"},
    }
    
    # Check for missing required columns
    missing_required = [
        col for col, spec in expected_columns.items()
        if spec.get("required") and col not in raw_df.columns
    ]
    if missing_required:
        raise SchemaError(
            f"Missing required columns: {missing_required}. "
            f"Available columns: {list(raw_df.columns)}. "
            f"This may indicate an upstream schema change."
        )
    
    # Add missing optional columns with defaults
    for col, spec in expected_columns.items():
        if col not in raw_df.columns and "default" in spec:
            raw_df[col] = spec["default"]
            logger.warning("schema_drift", column=col, action="added_default",
                          default=spec["default"])
    
    # Warn about unexpected new columns
    known_columns = set(expected_columns.keys())
    unknown_columns = set(raw_df.columns) - known_columns
    if unknown_columns:
        logger.warning("schema_drift", new_columns=list(unknown_columns),
                      action="ignored")
    
    return raw_df[list(expected_columns.keys())]

6. Resource Contention from Backup Jobs

Many organizations schedule database backups, index rebuilds, and VACUUM operations on weekends. These maintenance tasks compete for I/O and CPU with ETL pipelines that run on the same schedule. The pipeline that takes 20 minutes on a Tuesday takes 3 hours on a Sunday because the database is also running a full backup.

# Check for competing operations before running heavy queries
async def check_database_load(max_active_queries: int = 20):
    active = await db.fetchval("""
        SELECT COUNT(*) FROM pg_stat_activity 
        WHERE state = 'active' 
          AND query NOT LIKE '%pg_stat_activity%'
          AND backend_type = 'client backend'
    """)
    
    if active > max_active_queries:
        logger.warning("high_database_load", active_queries=active,
                      threshold=max_active_queries)
        return False
    return True

# Also check for long-running maintenance operations
async def check_for_maintenance():
    maintenance = await db.fetch("""
        SELECT pid, query, state, now() - query_start as duration
        FROM pg_stat_activity
        WHERE query ILIKE '%vacuum%' 
           OR query ILIKE '%reindex%'
           OR query ILIKE '%pg_dump%'
        ORDER BY duration DESC
    """)
    
    if maintenance:
        logger.info("maintenance_detected", 
                   operations=[{
                       "pid": m["pid"], 
                       "type": m["query"][:50],
                       "duration": str(m["duration"])
                   } for m in maintenance])
    return maintenance

7. The “No Data Is Valid Data” Problem

On some Sundays, there legitimately is no data. A B2B SaaS product might have zero transactions on a Sunday. The pipeline needs to distinguish between “no data because it is Sunday” and “no data because the extraction broke.” This is harder than it sounds.

# Track extraction metadata to distinguish empty from broken
@dataclass
class ExtractionResult:
    source: str
    date: str
    record_count: int
    extraction_started: datetime
    extraction_completed: datetime
    source_responded: bool
    source_reported_complete: bool
    
    @property
    def is_empty_but_valid(self) -> bool:
        return (
            self.record_count == 0 
            and self.source_responded 
            and self.source_reported_complete
        )
    
    @property
    def is_suspicious_empty(self) -> bool:
        return (
            self.record_count == 0 
            and (not self.source_responded or not self.source_reported_complete)
        )

async def extract_and_validate(source: str, date: str):
    result = await extract_data(source, date)
    
    if result.is_suspicious_empty:
        alert(
            f"Extraction from {source} for {date} returned 0 records "
            f"but source did not confirm completeness. "
            f"Source responded: {result.source_responded}. "
            f"Investigate before marking this run as successful."
        )
    elif result.is_empty_but_valid:
        logger.info("empty_extraction_valid", source=source, date=date,
                    note="Source confirmed no data for this period")

Building Resilient Pipelines: The Checklist

  1. Use day-of-week-aware thresholds for volume checks and anomaly detection.
  2. Handle maintenance windows with retries and exponential backoff.
  3. Store and process timestamps in UTC. Convert to local time only at the presentation layer.
  4. Implement watermark-based completeness checks instead of assuming data is ready at a fixed time.
  5. Validate schemas defensively with known-column checking and default handling.
  6. Monitor resource contention and avoid running heavy ETL during backup windows.
  7. Distinguish between valid empty data and extraction failures using source metadata.
  8. Make pipelines idempotent. You should be able to rerun any pipeline for any date without duplicating data.
  9. Test with weekend data. Your CI pipeline should include test fixtures that simulate Sunday volumes, DST transitions, and late-arriving data.
  10. Keep a pipeline calendar that maps known irregular events (holidays, DST, month-end, quarter-end) to expected pipeline behavior changes.

The Operational Maturity Model

Level Characteristics Sunday Behavior
1: Ad hoc No monitoring, manual fixes Nobody knows it broke until Monday
2: Reactive Basic alerts, manual remediation On-call gets paged, reruns manually
3: Defensive Retries, validation, day-aware thresholds Most issues self-heal, real failures page
4: Proactive Anomaly detection, completeness checks, auto-remediation Pipeline adjusts to weekend patterns automatically

Most teams are at level 2. Getting to level 3 requires a week of focused engineering work. Getting to level 4 requires ongoing investment but pays for itself in reduced on-call burden and data quality.

Data pipelines break on Sundays because engineers build them on Tuesdays. The data on Tuesday is clean, complete, and arrives on time. Sunday data is messy, partial, late, and sometimes entirely absent. If your pipeline cannot handle Sunday gracefully, it is not a production pipeline — it is a Tuesday pipeline that happens to run every day. Build for Sunday, and every other day takes care of itself.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *