The Certificate That Expires at 3 AM: A Practical Guide to SSL/TLS Automation
SSL/TLS certificate expiry is one of the most embarrassing and most preventable production outages. LinkedIn, Spotify, Microsoft Teams, and countless smaller organizations have all experienced certificate-related outages — not because the certificate management problem is technically hard, but because manual processes eventually fail. This guide covers the full certificate lifecycle: how ACME and Let’s Encrypt work under the hood, how to automate issuance and renewal, how to manage certificates across multiple servers, and how to set up monitoring that catches problems before users do.
How ACME Works: The Protocol Behind Let’s Encrypt
Let’s Encrypt issues free, publicly-trusted TLS certificates using the ACME protocol (Automatic Certificate Management Environment, RFC 8555). Understanding ACME’s mechanics helps you debug failures and choose the right challenge type for your infrastructure.
The ACME flow has four steps:
- Account registration: Your ACME client generates a keypair and registers with the ACME server (Let’s Encrypt’s Boulder CA). The public key identifies your account.
- Order creation: You request a certificate for one or more domain names. The ACME server creates an order and returns a set of authorization challenges — proofs you must complete to demonstrate control of each domain.
- Challenge completion: You complete one of the offered challenge types (HTTP-01, DNS-01, or TLS-ALPN-01). The ACME server verifies your completion and marks the authorization as valid.
- Certificate issuance: You submit a Certificate Signing Request (CSR) containing your domain names and public key. The CA signs it and returns the certificate chain.
Challenge Types and When to Use Each
HTTP-01: The ACME server expects to find a specific file at http://yourdomain.com/.well-known/acme-challenge/{token}. Your server must be publicly reachable on port 80. This is the simplest challenge for web servers, but it fails for internal services, wildcard certificates, and servers behind strict firewalls.
DNS-01: You create a TXT record at _acme-challenge.yourdomain.com with a specific value. The ACME server verifies it via DNS lookup. This challenge works for wildcard certificates and internal services. The tradeoff is that it requires API access to your DNS provider — a meaningful security consideration since DNS credentials carry significant blast radius.
TLS-ALPN-01: Less commonly used, this challenge works entirely over TLS on port 443. Useful when you cannot modify DNS records and cannot serve HTTP traffic, but you can accept TLS connections.
Certbot: The Reference Implementation
Certbot is the EFF’s ACME client and the most widely documented option. For a standalone Nginx or Apache server, it handles the full certificate lifecycle.
# Install certbot with Nginx plugin on Ubuntu/Debian
apt install certbot python3-certbot-nginx
# Obtain and install certificate (Nginx plugin modifies nginx.conf automatically)
certbot --nginx -d example.com -d www.example.com
# Dry run to test renewal without actually renewing
certbot renew --dry-run
# Certbot installs a systemd timer for automatic renewal
systemctl status certbot.timer
# certbot.timer runs twice daily and renews certs expiring within 30 days
# Manual renewal with pre/post hooks (e.g., to reload services)
certbot renew \
--pre-hook "systemctl stop nginx" \
--post-hook "systemctl start nginx" \
--deploy-hook "systemctl reload nginx"
For DNS-01 challenges with Certbot, you need a DNS plugin matching your provider. Certbot maintains plugins for most major DNS providers:
# Wildcard certificate via DNS-01 with Cloudflare
pip install certbot-dns-cloudflare
# Create credentials file (restrict permissions carefully)
cat > /etc/letsencrypt/cloudflare.ini << 'EOF'
dns_cloudflare_api_token = your_api_token_here
EOF
chmod 600 /etc/letsencrypt/cloudflare.ini
# Obtain wildcard certificate
certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
-d "*.example.com" \
-d example.com \
--agree-tos \
--email admin@example.com
Caddy: Automatic HTTPS Without Configuration
Caddy takes a different approach: HTTPS is automatic by default. Every domain in your Caddyfile gets a certificate from Let's Encrypt (or ZeroSSL) without any explicit certificate configuration. Caddy handles issuance, storage, renewal, and reload automatically.
# Caddyfile — HTTPS is automatic for all these sites
example.com {
reverse_proxy localhost:8080
}
api.example.com {
reverse_proxy localhost:3000
# Rate limiting via Caddy plugin
rate_limit {
zone dynamic {
key {remote_host}
events 100
window 1m
}
}
}
# Internal service with self-signed cert (for non-public domains)
internal.example.com {
tls internal
reverse_proxy localhost:9090
}
Caddy stores certificates in /var/lib/caddy/.local/share/caddy by default and renews them automatically when they are within 30 days of expiry. For multi-server deployments, Caddy supports distributed certificate storage via Redis or a shared filesystem, preventing each server from independently requesting certificates for the same domains.
cert-manager: Certificate Automation in Kubernetes
cert-manager is the de facto standard for TLS certificate management in Kubernetes. It introduces Issuer/ClusterIssuer resources to represent certificate authorities and Certificate resources to request specific certificates.
# Install cert-manager via Helm
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true
# ClusterIssuer using Let's Encrypt production with DNS-01 challenge (Cloudflare)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-api-token
key: api-token
# Apply this solver only to example.com and subdomains
selector:
dnsZones:
- "example.com"
---
# Request a wildcard certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: wildcard-example-com
namespace: production
spec:
secretName: wildcard-example-com-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- "*.example.com"
- "example.com"
# Renew 30 days before expiry
renewBefore: 720h
cert-manager also integrates with Ingress and Gateway API resources. Add the annotation and cert-manager handles the rest:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: production
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- api.example.com
secretName: api-example-com-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
Certificate Monitoring: Catching Expiry Before Users Do
Automated renewal handles the happy path. Monitoring handles everything else: renewal failures, certificates outside your automation that someone added manually, certificates on third-party services you do not control.
# Shell script: check certificate expiry for a list of domains
#!/bin/bash
DOMAINS=(
"example.com"
"api.example.com"
"admin.example.com"
)
WARNING_DAYS=30
CRITICAL_DAYS=7
for domain in "${DOMAINS[@]}"; do
expiry=$(echo | openssl s_client -servername "$domain" \
-connect "$domain:443" 2>/dev/null | \
openssl x509 -noout -enddate 2>/dev/null | \
cut -d= -f2)
if [ -z "$expiry" ]; then
echo "CRITICAL: Cannot connect to $domain"
continue
fi
expiry_epoch=$(date -d "$expiry" +%s 2>/dev/null || \
date -jf "%b %d %T %Y %Z" "$expiry" +%s)
now_epoch=$(date +%s)
days_remaining=$(( (expiry_epoch - now_epoch) / 86400 ))
if [ "$days_remaining" -lt "$CRITICAL_DAYS" ]; then
echo "CRITICAL: $domain expires in $days_remaining days ($expiry)"
elif [ "$days_remaining" -lt "$WARNING_DAYS" ]; then
echo "WARNING: $domain expires in $days_remaining days ($expiry)"
else
echo "OK: $domain expires in $days_remaining days"
fi
done
For production monitoring, use dedicated tools rather than cron-based scripts. Checkly and UptimeRobot both offer certificate expiry monitoring with Slack/PagerDuty integration. Prometheus with the blackbox_exporter can monitor certificate expiry as a metric:
# prometheus/blackbox.yml — TLS probe configuration
modules:
https_cert_check:
prober: http
timeout: 10s
http:
valid_status_codes: [200]
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: ip4
# Grafana alert rule: fire when certificate expires within 14 days
# Metric: probe_ssl_earliest_cert_expiry - time()
# Condition: < 14 * 24 * 60 * 60 (seconds)
Multi-Server Certificate Distribution
When you run multiple web servers behind a load balancer, you need a strategy for distributing certificates. Three approaches are common:
Terminate TLS at the load balancer: The cleanest approach. AWS ACM, Cloudflare, or a dedicated load balancer handles the certificate. Your backend servers receive plain HTTP. Certificates are managed in one place. The downside is that traffic between the load balancer and backends is unencrypted — acceptable for VPC-internal traffic, problematic for compliance-sensitive environments.
Shared filesystem mount: Let Certbot run on one node, store certificates on a shared NFS/EFS mount, configure all web servers to read from that path. Simple but creates a single point of failure in the NFS mount.
cert-manager with Kubernetes Secrets replication: If your servers run in Kubernetes, cert-manager writes certificates to Secrets and the external-secrets-operator can replicate them across namespaces or clusters.
The Operational Checklist
- Inventory every certificate your organization uses, including those on third-party services, internal services, and client certificates
- Set up monitoring with alerts at 30 days, 14 days, and 7 days before expiry — three separate alert thresholds, escalating severity
- Test renewal in staging before relying on automation in production — run
certbot renew --dry-runor cert-manager's test issuer - Store ACME account private keys in a secrets manager (Vault, AWS Secrets Manager), not on the filesystem
- Document the manual renewal procedure for every automated process — automation fails, and someone needs to know what to do at 2 AM
- Use certificate transparency log monitoring (crt.sh or Facebook's CT monitor) to detect unauthorized certificates issued for your domains
Key Takeaways
- ACME's DNS-01 challenge is required for wildcard certificates and services not reachable on port 80. Use it with an API-scoped DNS provider token, not your root account credentials.
- Caddy automates HTTPS entirely, making it ideal for new deployments where simplicity outweighs customization needs.
- cert-manager is the standard for Kubernetes certificate management. Integrate with Ingress annotations for the lowest-friction workflow.
- Monitoring must cover certificates outside your automation. A certificate manually installed three years ago on a forgotten subdomain will not renew itself.
- Always have a documented manual renewal procedure. Automation reduces the frequency of manual intervention, not the need to understand how.
