The Problem With Running Processes That Need to Stay Up

Every production server eventually faces the same question: how do you keep a long-running process alive, restart it cleanly after a crash, capture its output, and integrate it into the system lifecycle? The wrong answer is a nohup ./app & in a tmux session. The right answer depends on your stack, your team, and how much control you actually need.

Linux process management is not a single tool problem. systemd is the init system on virtually every modern Linux distribution and handles service management at the OS level. Supervisor is a lightweight Python daemon that manages arbitrary processes with minimal configuration overhead. PM2 is a Node.js-native process manager that adds cluster mode, zero-downtime reloads, and built-in log rotation on top of the basics. Each occupies a distinct position, and each has failure modes that will bite you if you use it outside its strengths.

This guide covers real configuration patterns for all three, a direct feature comparison, and a decision framework based on what you are actually running.


systemd: The OS-Level Standard

systemd is the process supervisor that ships with every major Linux distribution since 2015. If you are running Ubuntu 16.04 or later, Debian 8+, CentOS 7+, or any modern derivative, you already have systemd. Using it for your application processes is the most native approach to linux process management available.

Writing a Unit File

A systemd service is defined in a unit file, typically placed in /etc/systemd/system/ for system-managed services or ~/.config/systemd/user/ for user-scoped services. Here is a minimal but production-ready example for a Python API:

[Unit]
Description=MyApp Python API
After=network.target postgresql.service
Wants=postgresql.service

[Service]
Type=simple
User=appuser
Group=appuser
WorkingDirectory=/opt/myapp
EnvironmentFile=/opt/myapp/.env
ExecStart=/opt/myapp/venv/bin/python app.py
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=60
StartLimitBurst=3
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

The Type directive is where most beginners make mistakes. Type=simple tells systemd that the process started by ExecStart is the main process. systemd tracks it directly. Use this for processes that do not fork. Type=forking is for traditional Unix daemons that call fork() and exit the parent. If you set Type=simple on a forking process, systemd will consider the service started when it launches the parent, then mark it failed when the parent exits — even if the child is running fine. If you are unsure, run strace -e fork yourprocess and see what actually happens at startup.

Type=notify is worth knowing: processes that use sd_notify() from the systemd library can signal readiness explicitly. This avoids the race condition where systemd considers a service ready before it has bound to its port.

Restart Policies

The Restart= directive controls when systemd will attempt a restart:

  • no — never restart (default, appropriate only for one-shot tasks)
  • on-failure — restart only on non-zero exit codes or signals; does not restart on clean exits or SIGTERM
  • always — restart regardless of exit status; use with caution on services that can exit cleanly
  • on-abnormal — restart on signals, watchdog timeout, or non-zero exit but not clean stop

Pair these with StartLimitBurst and StartLimitIntervalSec to prevent infinite restart loops on a consistently broken application. The example above allows three restarts within sixty seconds before systemd gives up and marks the unit as failed. You will need systemctl reset-failed myapp to clear that state.

Reading Logs With journalctl

Because stdout and stderr are captured by the journal when StandardOutput=journal, you get structured log access without configuring a separate log file:

# Follow live output
journalctl -u myapp -f

# Last 100 lines
journalctl -u myapp -n 100

# Since last boot
journalctl -u myapp -b

# Between timestamps
journalctl -u myapp --since "2026-03-20 08:00:00" --until "2026-03-20 09:00:00"

# Output in JSON for log shipping
journalctl -u myapp -o json

One common pitfall: journal storage defaults to volatile memory (/run/log/journal/) on some distributions. If logs disappear after a reboot, check /etc/systemd/journald.conf and set Storage=persistent to force disk storage in /var/log/journal/.

When systemd Falls Short

systemd requires root privileges (or sudo) to manage system-level units. On shared hosting environments or containers where you lack root, this is a hard blocker. Unit file syntax is also verbose — managing fifty microservices each with a slightly different environment file becomes tedious fast. And if you need per-process log rotation with different retention policies, you are reaching for logrotate as a separate tool. Finally, systemd has no concept of process groups with shared configuration; each service is an independent unit file.


Supervisor: Pragmatic Process Control Without Root

Supervisor (supervisord) is a Python-based process manager that has been a staple of Django and Flask deployments for over a decade. It does not replace the init system — it runs as a daemon itself — but it provides a simpler interface for managing a collection of related processes under a single configuration umbrella.

The supervisord.conf Structure

Install with pip install supervisor or apt install supervisor. The main configuration file at /etc/supervisor/supervisord.conf handles the daemon itself; individual programs go in /etc/supervisor/conf.d/:

; /etc/supervisor/conf.d/myapp.conf

[program:myapp-web]
command=/opt/myapp/venv/bin/gunicorn -w 4 -b 0.0.0.0:8000 wsgi:app
directory=/opt/myapp
user=appuser
autostart=true
autorestart=true
startsecs=5
startretries=3
stopwaitsecs=30
stdout_logfile=/var/log/myapp/web.out
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=5
stderr_logfile=/var/log/myapp/web.err
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=5
environment=FLASK_ENV="production",DATABASE_URL="postgresql://..."

[program:myapp-worker]
command=/opt/myapp/venv/bin/celery -A tasks worker --loglevel=info
directory=/opt/myapp
user=appuser
autostart=true
autorestart=true
startsecs=10
stdout_logfile=/var/log/myapp/worker.out
stderr_logfile=/var/log/myapp/worker.err

The startsecs parameter defines the window during which a process must stay up before Supervisor considers it successfully started. If the process exits before that window closes, it counts as a failed start. Set this to a value longer than your application’s startup time to avoid false failure counts.

Process Groups

Supervisor supports grouping related programs so you can manage them together:

[group:myapp]
programs=myapp-web,myapp-worker
priority=999

With a group defined, supervisorctl stop myapp:* stops all programs in the group atomically. This is genuinely useful for deployments: stop the whole group, update code, restart the whole group. No need to remember every individual process name.

Managing With supervisorctl

# Reload configuration without restarting running processes
supervisorctl reread
supervisorctl update

# Status of all processes
supervisorctl status

# Restart a single program
supervisorctl restart myapp-web

# Tail logs interactively
supervisorctl tail -f myapp-web stdout

One practical pitfall: supervisorctl reread only reads new configuration; it does not apply changes to existing programs. You need supervisorctl update after a reread to start newly added programs or stop removed ones. This trips up almost everyone the first time.

Supervisor’s built-in log rotation via stdout_logfile_maxbytes and stdout_logfile_backups is simple but limited. It does not compress rotated files and does not support time-based rotation. For anything beyond basic size-based rotation, you still need logrotate.


PM2: Process Management for Node.js Workloads

PM2 is built specifically for Node.js, though it can run any executable. Its killer feature is cluster mode, which forks your Node.js application across all available CPU cores using the Node.js cluster module — with built-in load balancing and zero-downtime reloads.

The ecosystem.config.js File

The declarative configuration approach in PM2 centers on ecosystem.config.js:

module.exports = {
  apps: [
    {
      name: 'api-server',
      script: './src/server.js',
      instances: 'max',          // fork one per CPU core
      exec_mode: 'cluster',
      watch: false,
      max_memory_restart: '500M',
      env: {
        NODE_ENV: 'production',
        PORT: 3000
      },
      log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
      error_file: '/var/log/myapp/api-error.log',
      out_file: '/var/log/myapp/api-out.log',
      merge_logs: true,
      restart_delay: 4000,
      exp_backoff_restart_delay: 100
    },
    {
      name: 'background-worker',
      script: './src/worker.js',
      instances: 2,
      exec_mode: 'fork',
      cron_restart: '0 2 * * *',  // restart nightly at 2am
      max_memory_restart: '300M'
    }
  ]
}

The instances: 'max' with exec_mode: 'cluster' combination is where PM2 earns its place in Node.js production stacks. A reload with pm2 reload api-server cycles through workers one at a time, keeping the application available throughout the process. Compare this to a systemd systemctl restart, which sends SIGTERM to all processes simultaneously and leaves a gap in service.

Cluster Mode Trade-offs

Cluster mode requires your application to be stateless. Any in-memory session storage, in-process caching, or shared mutable state breaks immediately across workers. If you are using sticky sessions, those have to be handled at the load balancer level, not in the application. This is not a PM2 limitation — it is the standard constraint of horizontal scaling — but PM2 makes it easy to hit this wall without realizing it.

The exp_backoff_restart_delay setting enables exponential backoff on restarts. Starting at 100ms, each subsequent restart doubles the delay up to a maximum of 15 seconds. This prevents crash-loop flooding on a broken deployment from hammering your database or downstream services.

Log Rotation

PM2’s log rotation is handled through a separate module:

pm2 install pm2-logrotate
pm2 set pm2-logrotate:max_size 50M
pm2 set pm2-logrotate:retain 7
pm2 set pm2-logrotate:compress true
pm2 set pm2-logrotate:rotateInterval '0 0 * * *'

Unlike Supervisor’s built-in rotation, pm2-logrotate supports compression and both size-based and time-based rotation. The configuration is stored in PM2’s own data store, which means it survives restarts but is not tracked in your version control unless you export it explicitly.

Startup Integration

PM2 bridges to systemd for boot persistence:

pm2 start ecosystem.config.js
pm2 save
pm2 startup systemd

The pm2 startup systemd command generates a systemd unit that runs PM2 itself as a system service. PM2 then manages your Node processes as children. This is a useful hybrid approach, though it adds a layer of indirection that complicates debugging when something breaks at boot.


Direct Comparison: systemd vs Supervisor vs PM2

Feature systemd Supervisor PM2
Requires root Yes (system units) No (runs as any user) No
Config format INI-style unit files INI-style .conf files JavaScript / JSON / YAML
Cluster / multi-instance Manual (templates) Manual (multiple programs) Native (cluster mode)
Zero-downtime reload Requires custom logic No Yes (reload command)
Log management journald (structured) File-based (size rotation) File-based + pm2-logrotate
Process groups Targets (indirect) Native groups Apps array in config
Memory restart No native support No native support max_memory_restart
Language affinity Language-agnostic Language-agnostic Node.js optimized
Container suitability Poor (init conflict) Good Good
Monitoring integration systemd-exporter, journald HTTP status endpoint pm2 monit, Keymetrics

When to Use Which Tool

Choose systemd When

  • You are deploying a service that needs to start at boot before any user logs in (databases, network daemons, infrastructure services)
  • You want deep integration with the OS service lifecycle, including dependency ordering via After= and Requires= directives
  • You need structured, searchable logs that integrate with your existing journald or log shipping pipeline
  • Your team is comfortable with Linux administration and already manages other system services this way
  • The application is written in any language and you want a single, consistent management interface across your entire server

Choose Supervisor When

  • You are on a shared server, VPS with restricted permissions, or an environment where you cannot install systemd units
  • You have a Python web application (Django, Flask, FastAPI) with associated background workers that need to be managed as a logical unit
  • You want simpler configuration than systemd without sacrificing the ability to run as a non-root user
  • You need to manage a heterogeneous group of processes — a web server, a queue worker, and a scheduler — under one config file and restart them together during deployments

Choose PM2 When

  • Your workload is Node.js and you want to use all available CPU cores without managing a reverse proxy and multiple port assignments manually
  • Zero-downtime reloads are a hard requirement and you cannot afford even a brief gap during deployments
  • You want memory-based auto-restart as a safety valve against memory leaks in long-running Node processes
  • Your team prefers JavaScript-native configuration over INI files

Monitoring and Restart Strategies in Practice

Restart policies alone do not constitute a monitoring strategy. A process that restarts every five seconds is not healthy — it is broken and papering over the failure. Complement your process manager with actual health checks.

For systemd, the watchdog mechanism provides an application-level health check. Set WatchdogSec=30 in your unit file and have your application call sd_notify(0, "WATCHDOG=1") periodically. If the application fails to check in, systemd kills and restarts it. This catches deadlocks and stuck event loops that do not cause a process exit.

For applications behind Supervisor or PM2, a common pattern is a lightweight health check sidecar — a small script or service that hits the application’s /health endpoint and calls supervisorctl restart or pm2 restart if it fails. This is crude but effective for applications that run but stop responding.

PM2’s max_memory_restart is not a substitute for fixing memory leaks, but it is a practical safety net. Node.js applications with even minor leaks will accumulate memory over days or weeks of uptime. Setting a restart threshold of 500MB–1GB catches this before it triggers the OOM killer, which is far more disruptive.

Common Pitfalls Across All Three

  • Environment variable inheritance: All three tools isolate the environment from the interactive shell. Variables set in ~/.bashrc or ~/.profile are not available. Always declare environment variables explicitly in EnvironmentFile= (systemd), environment= (Supervisor), or env: (PM2).
  • Working directory: Relative paths in your application that work fine when run interactively will break under a process manager if WorkingDirectory / directory is not set explicitly. Always use absolute paths or set the working directory.
  • Signal handling: Make sure your application handles SIGTERM gracefully. systemd sends SIGTERM by default and waits TimeoutStopSec (default 90s) before sending SIGKILL. Supervisor uses SIGTERM then SIGKILL after stopwaitsecs. PM2 reload sends SIGINT to workers before spinning up replacements. An application that ignores SIGTERM will always be killed hard, interrupting in-flight requests.
  • Log buffering: Many runtimes buffer stdout when not connected to a terminal. Python buffers stdout in production mode unless you use PYTHONUNBUFFERED=1 or python -u. Node.js writes to stdout unbuffered by default, but libraries may not. Missing logs under a process manager is often a buffering issue, not a configuration error.

Practical Decision Making for Your Stack

The choice between these tools is not purely technical — it also reflects operational complexity. systemd requires Linux administration knowledge and root access. Supervisor requires Python on the host and a comfort level with its quirky reload workflow. PM2 adds Node.js as a server-side dependency even for non-Node applications if you want its full feature set.

For a typical production server running a single application stack, the most maintainable approach is to use one tool consistently rather than mixing all three. A common architecture that works well: systemd manages the Supervisor daemon itself (and any system-level dependencies like PostgreSQL), while Supervisor manages the application processes. This keeps OS-level concerns in systemd and application-level process management in Supervisor, without requiring root for day-to-day restarts.

For Node.js-heavy shops deploying to dedicated servers, PM2 with a systemd unit at the base layer is the standard approach. The PM2 ecosystem.config.js file lives in version control, changes are reviewed like code, and the systemd unit at the bottom provides boot persistence without complexity.

The worst outcome is not choosing any of these — it is running production processes in screen sessions, using & backgrounding, or relying on manual intervention to restart a crashed service at 3am. Any of these three tools solves that problem. The right choice depends on what you are already running, who is maintaining it, and how much operational surface area you want to carry.

Start with systemd if you have root access and are running a standard Linux server. Move to Supervisor when you need simpler group management or lack root. Add PM2 when your stack is Node.js and you need cluster mode or zero-downtime reloads. Every choice here is reversible, and the configuration examples above give you a working starting point for all three.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *