How to determine high CPU usage over a period of specific time frame via Watcher
Think about a situation where you have numerous nodes in a production environment and you want to know which node has a high CPU consumption over a certain period.
Since it is challenging to manually monitor this, there should be a means for you to receive an alert by email or slack, allowing you to be made aware right away and take immediate action to address the problem.
Making a watcher for CPU utilization is the best approach to keep an eye on it. You might be thinking whether this can be discovered with ease by utilizing the Stack Management dashboard, which makes it simple to keep track of this information. But there is a flaw on the part of Elastic. When the CPU utilization spikes in Elastic (let's say 80%), you'll get a notification right away. However, in line with our use case, if CPU consumption is excessive for at least a predetermined period, we want to issue an alert.
To accomplish the use case, you can use the below logic of query in the watcher.
0.8 in the query represents 80 %
{
"query": {
"bool": {
"filter": [
{
"range": {
"system.cpu.total.norm.pct": {
"gte": 0.8
}
}
},
{
"range": {
"@timestamp": {
"gte": "now-5m",
"lte": "now"
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "10s"
},
"aggs": {
"histogram_doc_count": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 0"
}
}
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count >= 30"
}
}
}
}
}
}
}