Slack alerts when cron fails — minimal setup

Cron jobs are the unsung heroes of many systems. They handle everything from daily backups and data synchronization to report generation and cache invalidation. But because they run silently in the background, often with their output redirected to /dev/null, they're notoriously difficult to monitor effectively. A cron job that silently fails or, worse, stops running altogether, can lead to stale data, broken features, or missed critical operations without you even knowing until it's too late.

This article will show you how to set up robust Slack alerts for your cron jobs using a heartbeat monitoring approach, focusing on minimal setup and practical considerations. We'll ditch complex log parsing for a simpler, more reliable method that tells you not just when your job failed, but crucially, when it didn't run at all.

The Problem: Silent Cron Failures

Imagine you have a critical cron job that archives old database records every night. It's been running fine for months. Then, one day, an underlying dependency changes, or a disk fills up, or the script itself gets a subtle bug.

What happens? * The script might exit with an error code, but if you're not checking /var/log/syslog or parsing its output, you won't know. * The script might hang indefinitely, consuming resources and never completing. * The cron daemon itself might fail to start the job due to a misconfiguration or system issue.

In all these scenarios, your system continues to operate, but with a growing problem brewing silently in the background. Days or weeks later, you discover your database is bloated, your backups are missing, or your reports are based on outdated information. This "silent failure" is a common and dangerous blind spot in many operational setups.

Traditional monitoring often focuses on application health or server metrics. While important, these don't directly tell you if a specific, scheduled task actually completed. Log parsing can help, but it adds complexity, requires robust log aggregation, and still doesn't tell you if the job never even started.

The Heartbeat Approach to Cron Monitoring

The solution lies in a simple, proactive approach: heartbeat monitoring. Instead of waiting for an error to manifest and then trying to detect it, your cron job actively reports its status to a monitoring service. If the service doesn't receive this "heartbeat" within an expected timeframe, it assumes something is wrong and alerts you.

Here's how it works: 1. You define a monitor for each cron job in a service like Heartfly. This monitor has an expected interval (e.g., "every 24 hours") and a grace period (e.g., "allow 15 minutes late"). 2. Heartfly provides you with unique "heartbeat" URLs for this monitor. These are simple HTTP endpoints. 3. Your cron job makes an HTTP GET request to these URLs at key points: * When it starts * When it successfully finishes * When it fails (optionally) 4. Heartfly keeps track of these pings. If a "success" ping doesn't arrive within the expected interval + grace period, it triggers an alert (e.g., to Slack). If a "start" ping arrives but no "success" ping follows, it can alert that the job is running late or hung.

This method is incredibly robust because it detects all failure modes: * The job didn't start at all. * The job started but crashed immediately. * The job started but hung indefinitely. * The job finished, but with an error (if you send a "fail" ping).

Setting Up a Basic Heartbeat with curl (Example 1)

Let's say you have a nightly shell script, backup.sh, that archives some files and uploads them to S3. You want to be alerted if this backup doesn't complete successfully.

First, you'd create a monitor in Heartfly. Let's assume it provides you with three URLs: * https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/start * https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/success * https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/fail

Now, integrate these into your backup.sh script:

#!/bin/bash
set -euo pipefail

# --- Configuration ---
S3_BUCKET="s3://your-backup-bucket"
BACKUP_DIR="/var/lib/data"
BACKUP_FILENAME="data_backup_$(date +%Y%m%d%H%M%S).tar.gz"

# --- Heartfly URLs (replace with your actual URLs) ---
HEARTFLY_START_URL="https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/start"
HEARTFLY_SUCCESS_URL="https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/success"
HEARTFLY_FAIL_URL="https://cron2.91-99-176-101.nip.io/ping/YOUR_MONITOR_ID/fail"

# Function to send pings (silent, non-blocking)
send_ping() {
  # -fsS: Fail silently, show errors, don't show progress meter
  # -m 10: Timeout if no response within 10 seconds
  # &>/dev/null: Redirect all output to null
  curl -fsS -m 10 "$1" &>/dev/null || true
}

# Trap function to send a 'fail' ping if the script exits with a non-zero status
# This executes before the script actually exits
trap 'exit_code=$?; if [ $exit_code -ne 0 ]; then send_ping "$HEARTFLY_FAIL_URL?exit_code=$exit_code"; fi' EXIT

echo "Starting backup at $(date)"
send_ping "$HEARTFLY_START_URL"

# --- Main backup logic ---
tar -czf "/tmp/$BACKUP_FILENAME" "$BACKUP_DIR"
echo "Backup created: /tmp/$BACKUP_FILENAME"

aws s3 cp "/tmp/$BACKUP_FILENAME" "$S3_BUCKET/$BACKUP_FILENAME"
echo "Backup uploaded to S3."

rm "/tmp/$BACKUP_FILENAME"
echo "Local backup file removed."

echo "Backup completed successfully at $(date)"
send_ping "$HEARTFLY_SUCCESS_URL"

Explanation: * `set -eu