Notes on application monitoring self-service
To get the application monitored in the given namespace, the namespace must have the correct label applied, an in the namespace there needs to be either PodMonitor or ServiceMonitor CRD setup, that points towards the service or pod that exports metrics.
This way, the merics will be scraped into the configured prometheus and correctly labeled.
As an example, lets look at ServiceMonitor for bodhi:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
monitoring-key: cpe
name: bodhi-service
namespace: bodhi
spec:
endpoints:
- path: /metrics
selector:
matchLabels:
service: web
In this example, we are only targetting the service wit label service:web, but we have the entire matching machinery at our disposal, see Matcher .
To manage alerting, you can create an alerting rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
monitoring-key: cpe
name: bodhi-rules
spec:
spec:
groups:
- name: general.rules
rules:
- alert: DeadMansSwitch
annotations:
description: >-
This is a DeadMansSwitch meant to ensure that the entire Alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
expr: vector(1)
labels:
severity: none
This would create a alert, that will always fire, to serve as a check the alerting works. You should be able to see it in alert manager.
To have an alert that actually does something, you should set expr to something else than vector(1). For example, to alert on rate of 500 responses of a service going over 5/s in past 10 minutes:
sum(rate(pyramid_request_count{job=”bodhi-web”, status=”500”}[10m])) > 5
The alerts themselves would be the routed for further processing and notification according to rules in alertmanager, these are not available to change from the developers namespaces.
The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.
Notes on instrumenting the application
Prometheus expects applications to scrape metrics from to be services, with ‘/metrics’ endpoint exposed with metrics in correct format.
There are libraries that help with this for many different languages, confusingly called client-libraries, eve though they usually export metrics as a http server endpoint: https://prometheus.io/docs/instrumenting/clientlibs/
As part of the proof of concept we have instrumented Bodhi application, to collect data through prometheus_client python library: https://github.com/fedora-infra/bodhi/pull/4079
Notes on alerting
To be be notified of alerts, you need to be subscribed to recievers that have been configured in alertmanager.
The configuration of the rules you want to alert on can be done in the namspace of your application. For example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
monitoring-key: cpe
name: prometheus-application-monitoring-rules
spec:
groups:
- name: general.rules
rules:
- alert: AlertBodhi500Status
annotations:
summary: Alerting on too many server errors
expr: (100*sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]", status="500"}[20m]))/sum(rate(pyramid_request_count{namespace="bodhi", path_info_pattern=~".*[^healthz]"}[20m])))>1
labels:
severity: high
would alert if there is more than 1% responses with 500 status code.