Find out which batch jobs are causing peak loads on database server
Question:
I have datasets that look like this:
[['Timestamp', 'CPU%', 'IO', 'Job1', 'Job2', 'Job3'], ['2022-08-06 10:31:59.233', '10', '90', 1, 0, 0], ['2022-08-06 10:32:19.235', '30', '40', 1, 4, 2]]
It is a Pandas DataFrame with columns having values for Timestamp, CPU% utilization, and IO. Additionally, columns Job1, Job2, Job3
represent how many batch jobs of the kind were running at the given timestamp.
For instance, according to the sample data above, at timestamp 2022-08-06 10:31:59.233
, CPU% utilization was only 10%
while IOPS was 90
. Only 1 job was active at that time and that was of type Job1.
In reality, I have a lot of job types (70 or more)
. And at any given time max 10 jobs
can be active.
Now, I want to understand which job types are causing CPU and IO
spikes.
What would be the best plots in Seaborn or Pandas to understand this?
We could compute the correlation between each of the Job variables against CPU and IO and plot them. But that is cumbersome. We would have to make sense of all 70+ correlations.
Any ideas on how to do it simpler?
Best Regards
Answers:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Timestamp': ['2022-08-06 10:31:59.233', '2022-08-06 10:32:19.235', '2022-08-06 10:32:39.235'],
'CPU%': [10, 30, 40],
'IO': [90, 40, 50],
'Job1': [1, 1, 1],
'Job2': [0, 4, 2],
'Job3': [0, 2, 0]
})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df_plot = df.melt(value_vars=['CPU%', 'IO'], var_name='CPU_IO', value_name='CI_value', id_vars=['Timestamp', 'Job1', 'Job2', 'Job3'])
df_plot = df_plot.melt(value_vars=['Job1', 'Job2', 'Job3'], value_name='Job_count', var_name='Job', id_vars=['Timestamp', 'CPU_IO', 'CI_value'])
###
print(df_plot)
###
Timestamp CPU_IO CI_value Job Job_count
0 2022-08-06 10:31:59.233 CPU% 10 Job1 1
1 2022-08-06 10:32:19.235 CPU% 30 Job1 1
2 2022-08-06 10:32:39.235 CPU% 40 Job1 1
3 2022-08-06 10:31:59.233 IO 90 Job1 1
4 2022-08-06 10:32:19.235 IO 40 Job1 1
5 2022-08-06 10:32:39.235 IO 50 Job1 1
6 2022-08-06 10:31:59.233 CPU% 10 Job2 0
7 2022-08-06 10:32:19.235 CPU% 30 Job2 4
8 2022-08-06 10:32:39.235 CPU% 40 Job2 2
9 2022-08-06 10:31:59.233 IO 90 Job2 0
10 2022-08-06 10:32:19.235 IO 40 Job2 4
11 2022-08-06 10:32:39.235 IO 50 Job2 2
12 2022-08-06 10:31:59.233 CPU% 10 Job3 0
13 2022-08-06 10:32:19.235 CPU% 30 Job3 2
14 2022-08-06 10:32:39.235 CPU% 40 Job3 0
15 2022-08-06 10:31:59.233 IO 90 Job3 0
16 2022-08-06 10:32:19.235 IO 40 Job3 2
17 2022-08-06 10:32:39.235 IO 50 Job3 0
fig, ax1 = plt.subplots(figsize=(12, 6))
g = sns.pointplot(x='Timestamp', y='CI_value', hue='CPU_IO', data=df_plot, ax=ax1, marker='o', markersize=5, linewidth=1, palette='Set1')
ax2 = ax1.twinx()
g = sns.barplot(x='Timestamp', y='Job_count', hue='Job', data=df_plot, ax=ax2, alpha=0.5, palette='Set2')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()
Plot via Plotly
With Plotly, we can plot directly via your df
,
especially you got plenty of jobs, it would rather draw by stacked bar chart than unstacked one.
df
###
Timestamp CPU% IO Job1 Job2 Job3
0 2022-08-06 10:31:59.233 10 90 1 0 0
1 2022-08-06 10:32:19.235 30 40 1 4 2
2 2022-08-06 10:32:39.235 40 50 1 2 0
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig_go = make_subplots(specs=[[{"secondary_y": True}]])
# add scatter trace on CPU% and IO
for col in df.columns.values[1:3]:
fig_go.add_trace(go.Scatter(x=df['Timestamp'], y=df[col], name=col), secondary_y=False)
# add bar trace on Job1, Job2, Job3
for col in df.columns.values[3:]:
fig_go.add_trace(go.Bar(x=df['Timestamp'], y=df[col], name=col, opacity=0.5), secondary_y=True)
fig_go.update_layout(title='CPU and IO and Job1, Job2, Job3')
fig_go.update_layout(barmode='stack')
fig_go.update_yaxes(range=[0,100], secondary_y=False)
fig_go.update_yaxes(range=[0,30], secondary_y=True)
fig_go.show()
I have datasets that look like this:
[['Timestamp', 'CPU%', 'IO', 'Job1', 'Job2', 'Job3'], ['2022-08-06 10:31:59.233', '10', '90', 1, 0, 0], ['2022-08-06 10:32:19.235', '30', '40', 1, 4, 2]]
It is a Pandas DataFrame with columns having values for Timestamp, CPU% utilization, and IO. Additionally, columns Job1, Job2, Job3
represent how many batch jobs of the kind were running at the given timestamp.
For instance, according to the sample data above, at timestamp 2022-08-06 10:31:59.233
, CPU% utilization was only 10%
while IOPS was 90
. Only 1 job was active at that time and that was of type Job1.
In reality, I have a lot of job types (70 or more)
. And at any given time max 10 jobs
can be active.
Now, I want to understand which job types are causing CPU and IO
spikes.What would be the best plots in Seaborn or Pandas to understand this?
We could compute the correlation between each of the Job variables against CPU and IO and plot them. But that is cumbersome. We would have to make sense of all 70+ correlations.
Any ideas on how to do it simpler?
Best Regards
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Timestamp': ['2022-08-06 10:31:59.233', '2022-08-06 10:32:19.235', '2022-08-06 10:32:39.235'],
'CPU%': [10, 30, 40],
'IO': [90, 40, 50],
'Job1': [1, 1, 1],
'Job2': [0, 4, 2],
'Job3': [0, 2, 0]
})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df_plot = df.melt(value_vars=['CPU%', 'IO'], var_name='CPU_IO', value_name='CI_value', id_vars=['Timestamp', 'Job1', 'Job2', 'Job3'])
df_plot = df_plot.melt(value_vars=['Job1', 'Job2', 'Job3'], value_name='Job_count', var_name='Job', id_vars=['Timestamp', 'CPU_IO', 'CI_value'])
###
print(df_plot)
###
Timestamp CPU_IO CI_value Job Job_count
0 2022-08-06 10:31:59.233 CPU% 10 Job1 1
1 2022-08-06 10:32:19.235 CPU% 30 Job1 1
2 2022-08-06 10:32:39.235 CPU% 40 Job1 1
3 2022-08-06 10:31:59.233 IO 90 Job1 1
4 2022-08-06 10:32:19.235 IO 40 Job1 1
5 2022-08-06 10:32:39.235 IO 50 Job1 1
6 2022-08-06 10:31:59.233 CPU% 10 Job2 0
7 2022-08-06 10:32:19.235 CPU% 30 Job2 4
8 2022-08-06 10:32:39.235 CPU% 40 Job2 2
9 2022-08-06 10:31:59.233 IO 90 Job2 0
10 2022-08-06 10:32:19.235 IO 40 Job2 4
11 2022-08-06 10:32:39.235 IO 50 Job2 2
12 2022-08-06 10:31:59.233 CPU% 10 Job3 0
13 2022-08-06 10:32:19.235 CPU% 30 Job3 2
14 2022-08-06 10:32:39.235 CPU% 40 Job3 0
15 2022-08-06 10:31:59.233 IO 90 Job3 0
16 2022-08-06 10:32:19.235 IO 40 Job3 2
17 2022-08-06 10:32:39.235 IO 50 Job3 0
fig, ax1 = plt.subplots(figsize=(12, 6))
g = sns.pointplot(x='Timestamp', y='CI_value', hue='CPU_IO', data=df_plot, ax=ax1, marker='o', markersize=5, linewidth=1, palette='Set1')
ax2 = ax1.twinx()
g = sns.barplot(x='Timestamp', y='Job_count', hue='Job', data=df_plot, ax=ax2, alpha=0.5, palette='Set2')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()
Plot via Plotly
With Plotly, we can plot directly via your df
,
especially you got plenty of jobs, it would rather draw by stacked bar chart than unstacked one.
df
###
Timestamp CPU% IO Job1 Job2 Job3
0 2022-08-06 10:31:59.233 10 90 1 0 0
1 2022-08-06 10:32:19.235 30 40 1 4 2
2 2022-08-06 10:32:39.235 40 50 1 2 0
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig_go = make_subplots(specs=[[{"secondary_y": True}]])
# add scatter trace on CPU% and IO
for col in df.columns.values[1:3]:
fig_go.add_trace(go.Scatter(x=df['Timestamp'], y=df[col], name=col), secondary_y=False)
# add bar trace on Job1, Job2, Job3
for col in df.columns.values[3:]:
fig_go.add_trace(go.Bar(x=df['Timestamp'], y=df[col], name=col, opacity=0.5), secondary_y=True)
fig_go.update_layout(title='CPU and IO and Job1, Job2, Job3')
fig_go.update_layout(barmode='stack')
fig_go.update_yaxes(range=[0,100], secondary_y=False)
fig_go.update_yaxes(range=[0,30], secondary_y=True)
fig_go.show()