Monitor batch job by Prometheus

Question:

There is a Python batch job that pushes huge file(s) to a shared location, once the file(s) are pushed, couple of tests will be run against that/those file(s).
I’m trying to get some metrics around the batch job & planning to use Node exporter having below metrics or labels.

file_push_status (success or failure)
first_test_status (Pass or Fail)
second_test_status (Pass or Fail)
first_test_time_taken (How long)
second_test_time_taken (How long)

Gone thru prometheus documentation, but unable to get a clarity whether Summary or Histogram should be used here ? I understand, Prometheus doesnt support Boolean(1st 3 cases), how those should be handled ?

If needed will attach the existing batch job code, thank you.

Asked By: Chel MS

||

Answers:

For small number of files you don’t need histograms.

Make all three metrics gauges.

Something like

# HELP file_push_success A metric with 0/1 value showing result of file push job. 0 - failure.
# TYPE file_push_success gauge
file_push_success{file="filename.txt"} 1 

# HELP file_push_test_success A metric with 0/1 value showing result of corresponding test after file being pushed. 0 - failure.
# TYPE file_push_test_success gauge
file_push_test_success{file="filename.txt", test="1"} 1
file_push_test_success{file="filename.txt", test="2"} 0

# HELP file_push_test_duration_seconds Duration of corresponding test after file being pushed 
# TYPE file_push_test_duration_seconds gauge
file_push_test_duration_seconds{file="filename.txt", test="1"} 5
file_push_test_duration_seconds{file="filename.txt", test="2"} 13

Here I grouped related metrics into one with different labels. It would be more easier to support (for example when you’ll decide to add new tests), and is generally advised by Prometheus documentation.

Answered By: markalex