Watchdog Overview
The max threshold settings for the watchdog resources are:- CPU: Above 25% usage for over 9 consecutive seconds.
- Memory: When 350MB is reached (LevelBlue Agent default setting).
Watchdog Threshold Limits and Errors
Once the watchdog limit is reached, the osquery watchdog respawns the child worker process. After osquery is restarted, the previously active queries are referenced by osquery to see which ones did not finish normally. It is possible that one or more of these queries caused the watchdog limit to be exceeded, therefore the unfinished queries are denylisted from the scheduler for 24 hours. If the osquery processes exceed their allocated resources, there is the possibility that the watchdog may respawn the process without giving any error message. A good indicator that this has happened can be found by looking at the logs subdirectory and at the timestamps of the files. If there is a high number of files with timestamps that are close together, it could be that the watchdog has been killing processes due to resource allocation limits. Here is an example:
Scheduled Query Failure Messages
The watchdog enforces limits on the worker process to protect systems from CPU-expensive and memory-intensive queries. If the watchdog observes limit violations, it will display an error similar to the following:Scheduled query may have failed: <<...>>
This line is created when a child worker starts and finds what osquery calls a “dirty bit” toggled for the currently-executing query. If a child worker process is stopped abruptly and a query does not finish, a similar line may display.
Lines that indicate the watchdog exceeded one of its limits include the following:
osqueryd worker (1234) system performance limits exceeded
osqueryd worker (5678) memory limits exceeded: 442494
The process identifier (PID) of the offending child worker is included in parenthesis.
If the child worker finds itself in a reoccurring error state, or if the watchdog continues to stop the worker, additional lines like the following are created:
osqueryd worker respawning too quickly: 1 times
The watchdog implements an exponential backoff when respawning child workers, and the offending query is denylisted from running for 24 hours.
The osquery watchdog is only used for the worker process. It is enabled by default and can be disabled with a control flag.
See the official osquery documentation on query failures with the watchdog for more information on osquery errors and debugging options.