Is there a liveness probe in Kubernetes that can catch when a python container freezes?

Question:

I have a python program that runs an infinite loop, however, every once in a while the code freezes. No errors are raised or any other message that would alert me something’s wrong. I was wondering if Kubernetes has any liveness probe that could possibly help catch when the code freezes so it can kill and restart that container.

I have an idea of having the python code make a periodic log every time it completes the loop. This way I can have a liveness probe check the log file every 30 seconds or so to see if the file has been updated. If the file has not been updated after the allotted time, then its is assumed the program froze and the container is killed and restarted.

I am currently using the following python code to test with:

#Libraries
import logging
import random as r
from time import sleep

#Global Veriables
FREEZE_TIME = 60


'''Starts an infinate loop that has a 10% chance of 
freezing...........................................'''
def main():
    #Create .log file to hold logged info.
    logging.basicConfig(filename="freeze.log", level=logging.INFO)

    #Start infinate loop
    while True:
        freeze = r.randint(1, 10) #10% chance of freezing.
        sleep(2)
        logging.info('Running infinate loop...')
        print("Running infinate loop...")

        #Simulate a freeze.
        if freeze == 1:
            print(f"Simulating freeze for {FREEZE_TIME} sec.")
            sleep(FREEZE_TIME)


#Start code with main()
if __name__ == "__main__":
    main()

If anyone could tell me how to implement this log idea or if there is a better way to do this I would be most grateful! I am currently using Kubernetes on Docker-Desktop for windows 10 if this makes a difference. Also, I am fairly new to this so if you could keep your answers to a "Kubernetes for dummies" level I would appreciate it.

Asked By: boblerbob

||

Answers:

A common approach to liveness probes in Kubernetes is to access an HTTP endpoint (if the application has it). Kubernetes checks whether response status code falls into 200-399 range (=success) or not (=failure). Running a HTTP server is not mandatory as you can run a command or sequence of commands instead. In this case health status is based on the exit code (0 – ok, anything else – failure).

Given the nature of your script and the idea with the log, I would wrote another python script to read the last line of that log and parse the timestamp. Then, if the difference between current time and the timestamp is greater than [insert reasonable amount] then exit(1), else exit(0).

If you have prepared the health-check script, you can enable it in this way:

spec:
  containers:
  - name: my_app
    image: my_image
    livenessProbe:
      exec:
        command:  # the command to run
        - python3
        - check_health.py
      initialDelaySeconds: 5  # wait 5 sec after start for the log to appear
      periodSeconds: 5  # run every 5 seconds

The documentation has detailed explanation with some great examples.

Answered By: anemyte