View on TensorFlow.org | Run in Google Colab | View source on GitHub | Download notebook |
Overview
This tutorial loads CoreDNS metrics from a Prometheus server into a tf.data.Dataset
, then uses tf.keras
for training and inference.
CoreDNS is a DNS server with a focus on service discovery, and is widely deployed as a part of the Kubernetes cluster. For that reason it is often closely monitoring by devops operations.
This tutorial is an example that could be used by devops looking for automation in their operations through machine learning.
Setup and usage
Install required tensorflow-io package, and restart runtime
import os
try:
%tensorflow_version 2.x
except Exception:
pass
TensorFlow 2.x selected.
pip install tensorflow-io
from datetime import datetime
import tensorflow as tf
import tensorflow_io as tfio
Install and setup CoreDNS and Prometheus
For demo purposes, a CoreDNS server locally with port 9053
open to receive DNS queries and port 9153
(defult) open to expose metrics for scraping. The following is a basic Corefile configuration for CoreDNS and is available to download:
.:9053 {
prometheus
whoami
}
More details about installation could be found on CoreDNS's documentation.
curl -s -OL https://github.com/coredns/coredns/releases/download/v1.6.7/coredns_1.6.7_linux_amd64.tgz
tar -xzf coredns_1.6.7_linux_amd64.tgz
curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/Corefile
cat Corefile
.:9053 { prometheus whoami }
# Run `./coredns` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./coredns &')
The next step is to setup Prometheus server and use Prometheus to scrape CoreDNS metrics that are exposed on port 9153
from above. The prometheus.yml
file for configuration is also available for download:
curl -s -OL https://github.com/prometheus/prometheus/releases/download/v2.15.2/prometheus-2.15.2.linux-amd64.tar.gz
tar -xzf prometheus-2.15.2.linux-amd64.tar.gz --strip-components=1
curl -s -OL https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/prometheus/prometheus.yml
cat prometheus.yml
global: scrape_interval: 1s evaluation_interval: 1s alerting: alertmanagers: - static_configs: - targets: rule_files: scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: "coredns" static_configs: - targets: ['localhost:9153']
# Run `./prometheus` as a background process.
# IPython doesn't recognize `&` in inline bash cells.
get_ipython().system_raw('./prometheus &')
In order to show some activity, dig
command could be used to generate a few DNS queries against the CoreDNS server that has been setup:
sudo apt-get install -y -qq dnsutils
dig @127.0.0.1 -p 9053 demo1.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo1.example.org ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53868 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 855234f1adcb7a28 (echoed) ;; QUESTION SECTION: ;demo1.example.org. IN A ;; ADDITIONAL SECTION: demo1.example.org. 0 IN A 127.0.0.1 _udp.demo1.example.org. 0 IN SRV 0 0 45361 . ;; Query time: 0 msec ;; SERVER: 127.0.0.1#9053(127.0.0.1) ;; WHEN: Tue Mar 03 22:35:20 UTC 2020 ;; MSG SIZE rcvd: 132
dig @127.0.0.1 -p 9053 demo2.example.org
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> @127.0.0.1 -p 9053 demo2.example.org ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53163 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 3 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: f18b2ba23e13446d (echoed) ;; QUESTION SECTION: ;demo2.example.org. IN A ;; ADDITIONAL SECTION: demo2.example.org. 0 IN A 127.0.0.1 _udp.demo2.example.org. 0 IN SRV 0 0 42194 . ;; Query time: 0 msec ;; SERVER: 127.0.0.1#9053(127.0.0.1) ;; WHEN: Tue Mar 03 22:35:21 UTC 2020 ;; MSG SIZE rcvd: 132
Now a CoreDNS server whose metrics are scraped by a Prometheus server and ready to be consumed by TensorFlow.
Create Dataset for CoreDNS metrics and use it in TensorFlow
Create a Dataset for CoreDNS metrics that is available from PostgreSQL server, could be done with tfio.experimental.IODataset.from_prometheus
. At the minimium two arguments are needed. query
is passed to Prometheus server to select the metrics and length
is the period you want to load into Dataset.
You can start with "coredns_dns_request_count_total"
and "5"
(secs) to create the Dataset below. Since earlier in the tutorial two DNS queries were sent, it is expected that the metrics for "coredns_dns_request_count_total"
will be "2.0"
at the end of the time series:
dataset = tfio.experimental.IODataset.from_prometheus(
"coredns_dns_request_count_total", 5, endpoint="http://localhost:9090")
print("Dataset Spec:\n{}\n".format(dataset.element_spec))
print("CoreDNS Time Series:")
for (time, value) in dataset:
# time is milli second, convert to data time:
time = datetime.fromtimestamp(time // 1000)
print("{}: {}".format(time, value['coredns']['localhost:9153']['coredns_dns_request_count_total']))
Dataset Spec: (TensorSpec(shape=(), dtype=tf.int64, name=None), {'coredns': {'localhost:9153': {'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)} } }) CoreDNS Time Series: 2020-03-03 22:35:17: 2.0 2020-03-03 22:35:18: 2.0 2020-03-03 22:35:19: 2.0 2020-03-03 22:35:20: 2.0 2020-03-03 22:35:21: 2.0
Further looking into the spec of the Dataset:
(
TensorSpec(shape=(), dtype=tf.int64, name=None),
{
'coredns': {
'localhost:9153': {
'coredns_dns_request_count_total': TensorSpec(shape=(), dtype=tf.float64, name=None)
}
}
}
)
It is obvious that the dataset consists of a (time, values)
tuple where the values
field is a python dict expanded into:
"job_name": {
"instance_name": {
"metric_name": value,
},
}
In the above example, 'coredns'
is the job name, 'localhost:9153'
is the instance name, and 'coredns_dns_request_count_total'
is the metric name. Note that depending on the Prometheus query used, it is possible that multiple jobs/instances/metrics could be returned. This is also the reason why python dict has been used in the structure of the Dataset.
Take another query "go_memstats_gc_sys_bytes"
as an example. Since both CoreDNS and Prometheus are written in Golang, "go_memstats_gc_sys_bytes"
metric is available for both "coredns"
job and "prometheus"
job:
dataset = tfio.experimental.IODataset.from_prometheus(
"go_memstats_gc_sys_bytes", 5, endpoint="http://localhost:9090")
print("Time Series CoreDNS/Prometheus Comparision:")
for (time, value) in dataset:
# time is milli second, convert to data time:
time = datetime.fromtimestamp(time // 1000)
print("{}: {}/{}".format(
time,
value['coredns']['localhost:9153']['go_memstats_gc_sys_bytes'],
value['prometheus']['localhost:9090']['go_memstats_gc_sys_bytes']))
Time Series CoreDNS/Prometheus Comparision: 2020-03-03 22:35:17: 2385920.0/2775040.0 2020-03-03 22:35:18: 2385920.0/2775040.0 2020-03-03 22:35:19: 2385920.0/2775040.0 2020-03-03 22:35:20: 2385920.0/2775040.0 2020-03-03 22:35:21: 2385920.0/2775040.0
The created Dataset
is ready to be passed to tf.keras
directly for either training or inference purposes now.
Use Dataset for model training
With metrics Dataset created, it is possible to directly pass the Dataset to tf.keras
for model training or inference.
For demo purposes, this tutorial will just use a very simple LSTM model with 1 feature and 2 steps as input:
n_steps, n_features = 2, 1
simple_lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape=(n_steps, n_features)),
tf.keras.layers.Dense(1)
])
simple_lstm_model.compile(optimizer='adam', loss='mae')
The dataset to be used is the value of 'go_memstats_sys_bytes' for CoreDNS with 10 samples. However, since a sliding window of window=n_steps
and shift=1
are formed, additional samples are needed (for any two consecute elements, the first is taken as x
and the second is taken as y
for training). The total is 10 + n_steps - 1 + 1 = 12
seconds.
The data value is also scaled to [0, 1]
.
n_samples = 10
dataset = tfio.experimental.IODataset.from_prometheus(
"go_memstats_sys_bytes", n_samples + n_steps - 1 + 1, endpoint="http://localhost:9090")
# take go_memstats_gc_sys_bytes from coredns job
dataset = dataset.map(lambda _, v: v['coredns']['localhost:9153']['go_memstats_sys_bytes'])
# find the max value and scale the value to [0, 1]
v_max = dataset.reduce(tf.constant(0.0, tf.float64), tf.math.maximum)
dataset = dataset.map(lambda v: (v / v_max))
# expand the dimension by 1 to fit n_features=1
dataset = dataset.map(lambda v: tf.expand_dims(v, -1))
# take a sliding window
dataset = dataset.window(n_steps, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda d: d.batch(n_steps))
# the first value is x and the next value is y, only take 10 samples
x = dataset.take(n_samples)
y = dataset.skip(1).take(n_samples)
dataset = tf.data.Dataset.zip((x, y))
# pass the final dataset to model.fit for training
simple_lstm_model.fit(dataset.batch(1).repeat(10), epochs=5, steps_per_epoch=10)
Train for 10 steps Epoch 1/5 10/10 [==============================] - 2s 150ms/step - loss: 0.8484 Epoch 2/5 10/10 [==============================] - 0s 10ms/step - loss: 0.7808 Epoch 3/5 10/10 [==============================] - 0s 10ms/step - loss: 0.7102 Epoch 4/5 10/10 [==============================] - 0s 11ms/step - loss: 0.6359 Epoch 5/5 10/10 [==============================] - 0s 11ms/step - loss: 0.5572 <tensorflow.python.keras.callbacks.History at 0x7f1758f3da90>
The trained model above is not very useful in reality, as the CoreDNS server that has been setup in this tutorial does not have any workload. However, this is a working pipeline that could be used to load metrics from true production servers. The model could then be improved to solve the real-world problem of devops automation.