Node Problem Detector | Awesome Repository

Node Problem Detector is a Kubernetes-native agent that monitors node health and reports hardware failures, kernel issues, and other node-level problems to the cluster control plane. It detects problems by scanning kernel ring buffer messages for error patterns, running user-defined health check scripts, and collecting system metrics from CPU, memory, disk, and network interfaces.

The agent distinguishes between permanent and temporary problems by mapping plugin failures to either persistent node conditions visible in kubectl describe node or one-time node events. It supports running multiple custom plugins concurrently with configurable concurrency limits, and can execute repair commands on unhealthy node components after a cooldown period. The detector also monitors systemd service health by scanning journald logs for repeated start/stop patterns, and checks for read-only filesystems or filesystem corruption.

Beyond problem detection, the agent collects system health metrics from /proc files on Linux and gopsutil on Windows, exposing them as Prometheus endpoints for external scraping. It includes built-in monitoring for CPU load averages, memory usage, disk I/O operations, network interface statistics, and host uptime. The project provides container images for both Linux and Windows deployments, and creates a dedicated Kubernetes service account with cluster role bindings for node problem detection permissions.

Features

Node Condition Reporters - An agent that monitors node health and reports hardware, kernel, and systemd issues as Kubernetes conditions or events.

Kubernetes Node Failure Detections - Detecting hardware failures, kernel issues, and other node-level problems in a Kubernetes cluster and reporting them as node conditions or events.

Kernel Log Pattern Matchers - Scans kernel ring buffer messages with regex patterns to detect hardware and software faults.

Plugin-Based - Invokes user-defined scripts on a schedule and reports their output as node conditions or events.

Node Condition Definitions - Maps plugin failures to persistent node conditions that the cluster scheduler can act upon.

Permanent Node Failure Detections - Identifies unrecoverable kernel problems and marks nodes with permanent conditions for cluster action.

Hardware Failure Detectors - Monitors kernel logs and system files to identify hardware malfunctions and reports them to the cluster.

Dual Condition and Event Reporting - Distinguishes between permanent node conditions and temporary node events for problem reporting.

Problem Detection Plugins - Runs user-defined scripts as plugins that report node problems via exit codes and structured output.

Systemd Service Monitoring - Scans journald logs for repeated start/stop patterns and reports conditions when restarts exceed thresholds.

Kernel Log Exporters - Scans kernel log files for patterns indicating issues like OOM kills and reports them as node conditions.

Custom Health Check Executors - Executes user-defined scripts or commands on each node and reports their exit status as node conditions.

Health Check Interval Configuration - Invokes user-defined health checks at fixed intervals with configurable timeouts and cooldown periods.

Custom Plugin Health Monitors - Running user-defined scripts or commands on Kubernetes nodes on a schedule to monitor custom health checks and report results.

Kernel Error Pattern Reporters - Scans kernel logs for known error patterns and surfaces them as node conditions or events.

Kernel Ring Buffer Streams - Scans kernel ring buffer messages in real time to detect hardware or software faults.

Kernel Problem Reporters - Parses kernel ring buffer messages to detect software or driver issues and surfaces them to the cluster.

System Metrics Collection - Collects CPU, memory, disk, and network metrics from system files for node health monitoring.

Kubernetes Node Metrics - Collecting CPU, memory, disk, and network metrics from Kubernetes nodes and exposing them for Prometheus scraping.

Node Health Tracking - Periodically checks node component health and reports conditions when components become unhealthy.

Prometheus System Metrics Exporters - Exposes CPU, memory, disk, and network metrics in Prometheus format through a dedicated REST endpoint.

Custom Script Exit-Code Monitors - Executes arbitrary monitor scripts and interprets exit codes and output to detect node problems.

Node Event Emitters - Generates one-time Kubernetes node events from plugin failures without persisting conditions.

Filesystem Corruption Detections - Checks for filesystem corruption or read-only mounts and alerts the cluster to storage issues.

Plugin Concurrency Limiters - Provides configurable concurrency limits for parallel execution of custom monitoring plugins.

Node Health Metric Collectors - A daemon that collects CPU, memory, disk, and network metrics and exposes them for Prometheus scraping.

Temporary Node Anomaly Detections - Emits temporary events for transient node issues like OOM kills without altering persistent conditions.

CPU Performance Monitors - Collects CPU load averages, usage time, and process statistics at configurable intervals.

Cross-Platform System Metrics Collectors - Collects system metrics across Linux and Windows using platform-specific backends.

Disk and Swap Monitoring - Collects disk usage, I/O operations, queue length, and weighted I/O for block devices.

Network Interface Statistics Collectors - Collects cumulative transmit and receive packet, error, and drop counts per network interface.

Node Component Repair Tools - Runs repair commands on unhealthy node components after a cooldown period to restore health.

Component Health Verifiers - Periodically verifying the health of node components like systemd services and reporting conditions when they become unhealthy.

Node-Level - Simulate node-level failures like filesystem errors or process hangs to verify that the cluster detects and reports problems correctly.

System Memory Monitors - Collects memory usage, page cache, anonymous, dirty, and unevictable memory statistics.

Kubernetes Windows Node Monitors - Collecting system health metrics and detecting problems on Windows Kubernetes nodes using platform-appropriate tools.

Windows Node Metric Collectors - Collects CPU, memory, disk, and uptime metrics on Windows nodes using the gopsutil library instead of /proc files.