Node Problem Detector is a Kubernetes-native agent that monitors node health and reports hardware failures, kernel issues, and other node-level problems to the cluster control plane. It detects problems by scanning kernel ring buffer messages for error patterns, running user-defined health check scripts, and collecting system metrics from CPU, memory, disk, and network interfaces.
The agent distinguishes between permanent and temporary problems by mapping plugin failures to either persistent node conditions visible in kubectl describe node or one-time node events. It supports running multiple custom plugins concurrently with configurable concurrency limits, and can execute repair commands on unhealthy node components after a cooldown period. The detector also monitors systemd service health by scanning journald logs for repeated start/stop patterns, and checks for read-only filesystems or filesystem corruption.
Beyond problem detection, the agent collects system health metrics from /proc files on Linux and gopsutil on Windows, exposing them as Prometheus endpoints for external scraping. It includes built-in monitoring for CPU load averages, memory usage, disk I/O operations, network interface statistics, and host uptime. The project provides container images for both Linux and Windows deployments, and creates a dedicated Kubernetes service account with cluster role bindings for node problem detection permissions.