How to make a loop for multiple scatterplots in python?
Question:
I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia
Answers:
What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter
.
Does that help? If you want to be more “Pythonic”, then look into itertools.product
to generate your column choices.
You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv
file, you can load it with pandas using pd.read_csv()
, and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering
You could use combinations
from itertools
. This way you will get an iterator with tuples of the combinations.
from itertools import combinations
print(list(combinations(df.columns, 2)))
The code you need would look like this:
from itertools import combinations
for col1, col2 in combinations(df.columns, 2): # <-----
plt.scatter(df[col1], df[col2])
plt.show()
I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia
What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter
.
Does that help? If you want to be more “Pythonic”, then look into itertools.product
to generate your column choices.
You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv
file, you can load it with pandas using pd.read_csv()
, and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering
You could use combinations
from itertools
. This way you will get an iterator with tuples of the combinations.
from itertools import combinations
print(list(combinations(df.columns, 2)))
The code you need would look like this:
from itertools import combinations
for col1, col2 in combinations(df.columns, 2): # <-----
plt.scatter(df[col1], df[col2])
plt.show()