Find out which batch jobs are causing peak loads on database server

Question

I have datasets that look like this:

[['Timestamp', 'CPU%', 'IO', 'Job1', 'Job2', 'Job3'], ['2022-08-06 10:31:59.233', '10', '90', 1, 0, 0], ['2022-08-06 10:32:19.235', '30', '40', 1, 4, 2]]

It is a Pandas DataFrame with columns having values for Timestamp, CPU% utilization, and IO. Additionally, columns Job1, Job2, Job3 represent how many batch jobs of the kind were running at the given timestamp.
For instance, according to the sample data above, at timestamp 2022-08-06 10:31:59.233, CPU% utilization was only 10% while IOPS was 90. Only 1 job was active at that time and that was of type Job1.

In reality, I have a lot of job types (70 or more). And at any given time max 10 jobs can be active.

Now, I want to understand which job types are causing CPU and IO
spikes.

What would be the best plots in Seaborn or Pandas to understand this?

We could compute the correlation between each of the Job variables against CPU and IO and plot them. But that is cumbersome. We would have to make sense of all 70+ correlations.

Any ideas on how to do it simpler?
Best Regards

Asked By: kamal kant

||

Source

Answer 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
    'Timestamp': ['2022-08-06 10:31:59.233', '2022-08-06 10:32:19.235', '2022-08-06 10:32:39.235'],
    'CPU%': [10, 30, 40],
    'IO': [90, 40, 50],
    'Job1': [1, 1, 1],
    'Job2': [0, 4, 2],
    'Job3': [0, 2, 0]
})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df_plot = df.melt(value_vars=['CPU%', 'IO'], var_name='CPU_IO', value_name='CI_value', id_vars=['Timestamp', 'Job1', 'Job2', 'Job3'])
df_plot = df_plot.melt(value_vars=['Job1', 'Job2', 'Job3'], value_name='Job_count', var_name='Job', id_vars=['Timestamp', 'CPU_IO', 'CI_value'])
###
print(df_plot)
###
                 Timestamp CPU_IO  CI_value   Job  Job_count
0  2022-08-06 10:31:59.233   CPU%        10  Job1          1
1  2022-08-06 10:32:19.235   CPU%        30  Job1          1
2  2022-08-06 10:32:39.235   CPU%        40  Job1          1
3  2022-08-06 10:31:59.233     IO        90  Job1          1
4  2022-08-06 10:32:19.235     IO        40  Job1          1
5  2022-08-06 10:32:39.235     IO        50  Job1          1
6  2022-08-06 10:31:59.233   CPU%        10  Job2          0
7  2022-08-06 10:32:19.235   CPU%        30  Job2          4
8  2022-08-06 10:32:39.235   CPU%        40  Job2          2
9  2022-08-06 10:31:59.233     IO        90  Job2          0
10 2022-08-06 10:32:19.235     IO        40  Job2          4
11 2022-08-06 10:32:39.235     IO        50  Job2          2
12 2022-08-06 10:31:59.233   CPU%        10  Job3          0
13 2022-08-06 10:32:19.235   CPU%        30  Job3          2
14 2022-08-06 10:32:39.235   CPU%        40  Job3          0
15 2022-08-06 10:31:59.233     IO        90  Job3          0
16 2022-08-06 10:32:19.235     IO        40  Job3          2
17 2022-08-06 10:32:39.235     IO        50  Job3          0

fig, ax1 = plt.subplots(figsize=(12, 6))
g = sns.pointplot(x='Timestamp', y='CI_value', hue='CPU_IO', data=df_plot, ax=ax1, marker='o', markersize=5, linewidth=1, palette='Set1')
ax2 = ax1.twinx()
g = sns.barplot(x='Timestamp', y='Job_count', hue='Job', data=df_plot, ax=ax2, alpha=0.5, palette='Set2')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()

Plot via Plotly

With Plotly, we can plot directly via your df,
especially you got plenty of jobs, it would rather draw by stacked bar chart than unstacked one.

df
###
                Timestamp  CPU%  IO  Job1  Job2  Job3
0 2022-08-06 10:31:59.233    10  90     1     0     0
1 2022-08-06 10:32:19.235    30  40     1     4     2
2 2022-08-06 10:32:39.235    40  50     1     2     0

import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig_go = make_subplots(specs=[[{"secondary_y": True}]])
# add scatter trace on CPU% and IO
for col in df.columns.values[1:3]:
    fig_go.add_trace(go.Scatter(x=df['Timestamp'], y=df[col], name=col), secondary_y=False)
# add bar trace on Job1, Job2, Job3
for col in df.columns.values[3:]:
    fig_go.add_trace(go.Bar(x=df['Timestamp'], y=df[col], name=col, opacity=0.5), secondary_y=True)
fig_go.update_layout(title='CPU and IO and Job1, Job2, Job3')
fig_go.update_layout(barmode='stack')
fig_go.update_yaxes(range=[0,100], secondary_y=False)
fig_go.update_yaxes(range=[0,30], secondary_y=True)
fig_go.show()

Answered By: Baron Legendre

Find out which batch jobs are causing peak loads on database server

Question:

Answers:

Plot via Plotly