How Watchdog Timer Detects a System Crash: 5 Powerful Ways to Improve System Reliability

0b63979cd9494aa401d1fce2d73bb002
On: September 30, 2025
How Watchdog Timer Detects a System Crash

Discover How Watchdog Timer Detects a System Crash in embedded systems. Learn its working principle, benefits, and C++ implementation

When we talk about reliability in embedded systems, one term that always comes up is the Watchdog Timer (WDT). A watchdog timer is a crucial hardware or software component that helps detect system crashes, hangs, or unexpected behavior in microcontrollers and operating systems.

In this article, we’ll explore how a watchdog timer detects system crashes, why it’s important, and how it keeps embedded devices running smoothly.

How Watchdog Timer Detects a System Crash Step by Step Guide

What is a Watchdog Timer?

A watchdog timer is a special timer built into most microcontrollers and processors. Its job is simple:

  • It counts down from a predefined value.
  • The running program must regularly reset, or “kick,” the watchdog before it reaches zero.
  • If the program fails to do so, the watchdog assumes the system has crashed and automatically resets the device.

This makes the watchdog timer a self-recovery mechanism that improves system stability.

How Does a Watchdog Timer Detect a System Crash?

The watchdog timer doesn’t actually “see” a crash. Instead, it detects problems by monitoring the absence of activity. Let’s break it down step by step:

  1. Normal Operation
    The application code runs as expected. After completing critical tasks, it sends a signal to refresh the watchdog.
  2. Watchdog Refresh (Petting the Dog)
    Each refresh tells the watchdog: “The system is alive and working fine.”
  3. System Crash or Hang
    If the system enters an infinite loop, deadlock, or freeze, the refresh signal never reaches the watchdog.
  4. Timeout Occurs
    Once the countdown reaches zero without being reset, the watchdog flags a system failure.
  5. Automatic Reset
    The watchdog responds by resetting the system or triggering an interrupt, restoring normal operation.

In short: If the watchdog isn’t fed in time, it assumes the system is dead.

Why Watchdog Timers Are Essential

Watchdog timers play a critical role in:

  • Detecting CPU hangs – when the processor stops executing instructions.
  • Identifying infinite loops – when software is stuck repeating the same code.
  • Catching deadlocks – when processes block each other and nothing moves forward.
  • Preventing memory corruption effects – when unexpected bugs skip normal code execution.

In industries like automotive, aerospace, IoT devices, and medical electronics, watchdog timers are mandatory to ensure system reliability and safety.

Real-World Example

Imagine a smart home IoT device controlling lights and sensors. If the software crashes and stops responding, the watchdog detects the missing refresh signal and resets the device automatically. This ensures that your smart device doesn’t stay frozen — it recovers by itself.

Nice — I’ll give you ready-to-use C++ examples that show different ways to implement/use a watchdog. I include:

  • A portable software watchdog (pure C++ for apps & tests)
  • A Linux hardware watchdog example using /dev/watchdog (real device reset)
  • A microcontroller (STM32) IWDG example (embedded/HAL-style pseudo-code)

Portable software watchdog (pure C++)

Use this when you don’t have hardware WDT or for testing. A watchdog thread expects a periodic keepalive from the monitored worker; if not received within a timeout, it performs a user-provided recovery action (e.g., restart task, log, exit).

// file: software_watchdog.cpp
// Build: g++ -std=c++17 -pthread software_watchdog.cpp -o software_watchdog

#include 
#include 
#include 
#include 
#include 
#include 
#include 

using namespace std::chrono_literals;

class SoftwareWatchdog {
public:
    SoftwareWatchdog(std::chrono::milliseconds timeout, std::function on_timeout)
        : timeout_(timeout), on_timeout_(on_timeout), running_(false) {}

    ~SoftwareWatchdog() { stop(); }

    void start() {
        running_ = true;
        watchdog_thread_ = std::thread([this]() { this->watcher_loop(); });
    }

    void stop() {
        running_ = false;
        cv_.notify_all();
        if (watchdog_thread_.joinable()) watchdog_thread_.join();
    }

    // Call this from monitored code to "kick" the watchdog
    void kick() {
        std::lock_guard<:mutex> lk(mutex_);
        last_kick_ = std::chrono::steady_clock::now();
        cv_.notify_all();
    }

private:
    void watcher_loop() {
        std::unique_lock<:mutex> lk(mutex_);
        last_kick_ = std::chrono::steady_clock::now();

        while (running_) {
            // Wait until either notified (kick) or timeout expires
            if (cv_.wait_for(lk, timeout_) == std::cv_status::timeout) {
                // timed out => no kick received in timeout_ period
                running_ = false; // stop further checks by default
                lk.unlock();
                try { on_timeout_(); } catch (...) {}
                return;
            }
            // else we were kicked; loop and wait again
        }
    }

    std::chrono::milliseconds timeout_;
    std::function on_timeout_;
    std::thread watchdog_thread_;
    std::mutex mutex_;
    std::condition_variable cv_;
    std::chrono::steady_clock::time_point last_kick_;
    std::atomic running_;
};

// Demo: a worker that occasionally hangs
int main() {
    SoftwareWatchdog wdt(2000ms, []() {
        std::cerr << "[WDT] Timeout! Performing recovery action (exit)\n";
        // Recovery action: we could restart threads, restart service, or exit.
        std::exit(EXIT_FAILURE);
    });

    wdt.start();

    std::thread worker([&wdt]() {
        for (int i = 0; i < 10; ++i) {
            std::this_thread::sleep_for(500ms);
            wdt.kick(); // normal operation: kick every 500ms
            std::cout << "Worker: tick " << i << "\n";
        }

        std::cout << "Worker: simulating hang now (no more kicks)\n";
        std::this_thread::sleep_for(10s); // hang longer than watchdog timeout
    });

    worker.join();
    wdt.stop();
    return 0;
}

Notes

  • This is NOT a hardware reset — it’s a software-only mechanism. Useful for services to self-monitor and attempt graceful recovery.
  • Replace on_timeout_() with your actual recovery logic.

Linux hardware watchdog (/dev/watchdog) — C++ with POSIX

This talks to a kernel watchdog driver. If you stop kicking it, the device will reset the whole machine (hardware reset). Use with caution (run on a VM or test board).

// file: linux_watchdog.cpp
// Build: g++ -std=c++17 linux_watchdog.cpp -o linux_watchdog
// Run as root: sudo ./linux_watchdog

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
    const char *dev = "/dev/watchdog";
    int fd = open(dev, O_RDWR);
    if (fd < 0) {
        std::perror("open /dev/watchdog");
        return 1;
    }

    // Optional: query or set timeout (uses linux/watchdog.h)
    int timeout = 10; // seconds
    if (ioctl(fd, WDIOC_SETTIMEOUT, &timeout) < 0) {
        std::perror("WDIOC_SETTIMEOUT");
        // Not fatal — continue with kernel default
    } else {
        std::cout << "Watchdog timeout set to " << timeout << " seconds\n";
    }

    // Keepalive loop — write or ioctl to keep alive periodically
    for (int i = 0; i < 30; ++i) {
        // Send keepalive
        int dummy = 0;
        if (ioctl(fd, WDIOC_KEEPALIVE, &dummy) < 0) {
            std::perror("WDIOC_KEEPALIVE");
            close(fd);
            return 1;
        }
        std::cout << "Kicked watchdog (" << i << ")\n";
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }

    std::cout << "Stopping kicks — system will reset after timeout if this device is real\n";

    // If you close the file descriptor, many drivers trigger immediate reset.
    // If you want to disable the watchdog gracefully, write 'V' before close (if supported).
    // Uncomment the following block only if you know your driver supports it.
    /*
    if (write(fd, "V", 1) != 1) {
        std::perror("write V to /dev/watchdog");
    } else {
        std::cout << "Watchdog disarmed with magic 'V'\n";
    }
    */
    close(fd);
    return 0;
}

Important warnings

  • Running this on a real system can reboot the machine. Test on a dev board or VM.
  • Only root can open /dev/watchdog. Some drivers reset immediately on close; others require the magic V to disarm. Behavior is driver-dependent.
  • Use WDIOC_SETTIMEOUT, WDIOC_GETTIMEOUT, and WDIOC_KEEPALIVE ioctl calls — make sure is available (typical on Linux).

STM32 microcontroller — IWDG (HAL-style) — C++ flavored embedded code

Hardware watchdog on Cortex-M microcontrollers is independent from CPU and will reset if not refreshed. Below is HAL-style pseudo-code (real STM32 code mixes C/C++).

// Pseudocode: stm32_iwdg_example.cpp
// This is HAL-style; adapt to your STM32CubeMX generated project.

#include "stm32f4xx_hal.h"

// Global IWDG handle (usually in C generated by CubeMX)
IWDG_HandleTypeDef hiwdg;

void Watchdog_Init() {
    // Example values — configure according to datasheet
    hiwdg.Instance = IWDG;
    hiwdg.Init.Prescaler = IWDG_PRESCALER_64;
    hiwdg.Init.Reload = 4095; // sets timeout (approx), check reference manual
    if (HAL_IWDG_Init(&hiwdg) != HAL_OK) {
        // Initialization Error
        Error_Handler();
    }
}

void Watchdog_Refresh() {
    // Call this regularly before timeout
    HAL_IWDG_Refresh(&hiwdg);
}

int main() {
    HAL_Init();
    SystemClock_Config();

    Watchdog_Init();

    while (1) {
        // Normal application work
        do_some_task();

        // Kick the watchdog periodically (must be within IWDG timeout)
        Watchdog_Refresh();

        HAL_Delay(100); // milliseconds
    }
}

Notes

  • Once started, IWDG typically cannot be stopped until reset (design for safety). Choose prescaler & reload to get desired timeout.
  • Use HAL_IWDG_Refresh() in main loop or a dedicated watchdog task/ISR.
  • For window watchdog (WWDG), refreshing must occur within a window (not too early/late) — check your MCU docs.

How to design a Watchdog service in C++ for a multi-threaded application

Designing a robust watchdog service in C++ for a multi-threaded application means building a small, thread-safe supervisor that monitors heartbeats (keepalives) from important threads or components and triggers configurable recovery actions when one or more components fail. Below is an SEO-friendly, human-tone explanation with a production-ready design, code example, design considerations, and interview-ready talking points.

Summary (what you’ll get)

  • Clear architecture for a multi-threaded C++ watchdog service
  • Thread-safe API for components to register and send heartbeats
  • Example implementation (C++17) you can reuse or adapt
  • Best practices: timeouts, recovery actions, logging, testing, and integration

High-level design

  1. Central Watchdog Manager
    • Single object responsible for tracking registered “clients” (threads, tasks, or subsystems).
    • Runs a dedicated monitoring thread that checks last-heartbeat timestamps.
  2. Client registration & heartbeat API
    • Each critical component registers with a unique ID and a desired timeout.
    • Components call kick() / heartbeat() periodically.
  3. Recovery strategy
    • On timeout, the watchdog can: log the failure, restart only that component, perform a graceful shutdown, or escalate to a full process/system restart.
    • Recovery actions are user-supplied callbacks to keep the watchdog generic.
  4. Thread safety & low overhead
    • Use std::mutex/std::shared_mutex and atomic types to guard state.
    • Keep monitoring loop low-overhead and sleep with condition variables.
  5. Observability
    • Expose metrics (timeouts, last-kick times), logging, and optionally health endpoints (for services).

C++ Example (compact, production-style)

// file: multi_thread_watchdog.cpp
// Build: g++ -std=c++17 -pthread multi_thread_watchdog.cpp -o mwdt

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

using namespace std::chrono_literals;
using Clock = std::chrono::steady_clock;

struct WatchdogClient {
    std::string id;
    std::chrono::milliseconds timeout;
    Clock::time_point last_beat;
    std::function on_timeout; // user recovery callback
};

class WatchdogService {
public:
    WatchdogService(std::chrono::milliseconds poll_interval = 500ms)
        : poll_interval_(poll_interval), running_(false) {}

    ~WatchdogService() { stop(); }

    // Start monitoring thread
    void start() {
        bool expected = false;
        if (!running_.compare_exchange_strong(expected, true)) return;
        monitor_thread_ = std::thread(&WatchdogService::monitor_loop, this);
    }

    // Stop monitoring thread
    void stop() {
        running_ = false;
        cv_.notify_all();
        if (monitor_thread_.joinable()) monitor_thread_.join();
    }

    // Register a client. If already exists, updates timeout and callback.
    void register_client(const std::string& id,
                         std::chrono::milliseconds timeout,
                         std::function on_timeout = nullptr) {
        std::unique_lock lock(mutex_);
        WatchdogClient c{ id, timeout, Clock::now(), on_timeout };
        clients_[id] = std::move(c);
        cv_.notify_all();
    }

    // Unregister a client when it is shutting down
    void unregister_client(const std::string& id) {
        std::unique_lock lock(mutex_);
        clients_.erase(id);
    }

    // Called by the client to send heartbeat
    void heartbeat(const std::string& id) {
        std::shared_lock lock(mutex_);
        auto it = clients_.find(id);
        if (it != clients_.end()) {
            it->second.last_beat = Clock::now();
        } else {
            // Optional: log unknown client
        }
    }

private:
    void monitor_loop() {
        while (running_) {
            auto now = Clock::now();
            std::vector<:pair std::function std::string>>> to_handle;

            {   // Locked region for safe iteration
                std::unique_lock lock(mutex_);
                for (auto &kv : clients_) {
                    const auto &id = kv.first;
                    auto &client = kv.second;
                    auto elapsed = std::chrono::duration_cast<:chrono::milliseconds>(now - client.last_beat);
                    if (elapsed >= client.timeout) {
                        // Capture for handling outside lock
                        to_handle.emplace_back(id, client.on_timeout);
                        // update last_beat to avoid repeated triggers until client resets or is removed
                        client.last_beat = now;
                    }
                }
            }

            // Execute user callbacks outside lock to avoid deadlocks
            for (auto &p : to_handle) {
                const std::string &id = p.first;
                auto &cb = p.second;
                std::cerr << "[Watchdog] Timeout for client: " << id << "\n";
                if (cb) {
                    try { cb(id); }
                    catch (const std::exception &e) {
                        std::cerr << "[Watchdog] callback exception: " << e.what() << "\n";
                    } catch (...) {
                        std::cerr << "[Watchdog] callback unknown exception\n";
                    }
                }
            }

            // Sleep until next poll or earlier if new client registered
            std::unique_lock lock_cv(cv_mutex_);
            cv_.wait_for(lock_cv, poll_interval_, [&](){ return !running_.load(); });
        }
    }

    std::map<:string watchdogclient> clients_;
    std::shared_mutex mutex_;         // protect clients_
    std::chrono::milliseconds poll_interval_;
    std::thread monitor_thread_;
    std::atomic running_;
    std::condition_variable_any cv_;
    std::mutex cv_mutex_;
};

// ------------------------------------
// Example usage
// ------------------------------------
void restart_component(const std::string &id) {
    // example recovery action
    std::cerr << "[Recovery] Restarting component: " << id << "\n";
    // Insert restart logic: signal thread, spawn helper, or set a flag for supervisor
}

int main() {
    WatchdogService wdt(300ms);
    wdt.start();

    wdt.register_client("workerA", 1000ms, restart_component);
    wdt.register_client("workerB", 1500ms, restart_component);

    // simulate worker A periodically heartbeating
    std::thread workerA([&wdt]() {
        for (int i = 0; i < 5; ++i) {
            std::this_thread::sleep_for(300ms);
            wdt.heartbeat("workerA");
            std::cout << "workerA heartbeat\n";
        }
        // simulate a hang (no more heartbeats) -> watchdog will trigger
        std::this_thread::sleep_for(3s);
    });

    // simulate worker B healthy
    std::thread workerB([&wdt]() {
        while (true) {
            std::this_thread::sleep_for(500ms);
            wdt.heartbeat("workerB");
        }
    });

    workerA.join();
    // For demo, stop after some time
    std::this_thread::sleep_for(5s);
    wdt.stop();
    return 0;
}

Design considerations & interview talking points

  • Choosing timeouts: Select per-component timeouts based on worst-case execution time plus margin. Avoid too-short timeouts that false-trigger and too-long that delay recovery.
  • Where to place heartbeat calls: Place heartbeat() in code paths that only run during healthy operation (not in trivial periodic timers) — e.g., after finishing important work or inside watchdog-specific health-check thread.
  • Avoid single-point-of-failure: Don’t let one misbehaving component hide others by always heartbeating on their behalf. Use per-client timeouts and independent checks.
  • Recovery actions: Keep callbacks simple and non-blocking. Prefer signaling a supervisor thread to perform restarts to avoid doing heavy work inside the monitor.
  • System vs software watchdog: Software watchdog handles app-level failures; for kernel or total system failure, use platform/hardware watchdog (e.g., /dev/watchdog on Linux or MCU IWDG). Combine both for high reliability.
  • Thread-safety & performance: Prefer std::shared_mutex for many readers and few writers; minimize lock time. Monitor thread should do minimal work and call user handlers outside locks.
  • Observability: Log watchdog events, expose metrics (Prometheus / stats), and save last-fail reason to persistent storage for post-mortem analysis.
  • Testing: Unit-test registration and timeout paths. Simulate long-running tasks and delayed heartbeats. Use chaos testing to validate recovery logic.
  • Security: Sanitize client IDs and callbacks if loaded dynamically. Limit what recovery callbacks can do if running in restricted environments.

Quick comparison & best-practices

  • Software WDT (C++ thread) — easy, portable, good for apps/services but cannot recover from kernel/OS crash.
  • Linux /dev/watchdog — uses kernel/hardware-backed driver, can reset the entire machine. Use for high reliability.
  • MCU IWDG — independent hardware watchdog, best for deeply embedded systems; survives CPU locks.

Design tips

  • Put the kick() as close as possible to a place that only runs during healthy operation (not just a trivial periodic timer), or use multiple health checks (task heartbeat, resource check).
  • Keep the watchdog timeout long enough to allow legitimate long tasks, but short enough to catch faults quickly.
  • Log last alive state to non-volatile storage, so on reboot you can analyze cause of reset.
  • On systems with safety requirements, combine software and hardware watchdogs.

Final Thoughts

A watchdog timer detects system crashes by monitoring missed refresh signals. It acts like a guardian for embedded systems, making sure the device can recover from unexpected errors without human intervention.

If you’re building reliable embedded software, always integrate a watchdog timer. It could be the difference between a system that fails silently and one that self-heals in real time.

Leave a Comment