How to make a loop for multiple scatterplots in python?

Question:

I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.

F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.

For the sake of clearity, I have simplified the problem to image below:
enter image description here

Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.

I tried also this (not working):

variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()

Any help is welcome!

PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.

Greetings,

Nadia

Asked By: Nadia Merquez

||

Answers:

What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is

col_choice = ["Sex", "Age", "BMI"]

for pos, axis1 in enumerate(col_choice):   # Pick a first col
    for axis2 in enumerate(col_choice[pos+1:]):   # Pick a later col
        plt.scatter(df.loc[:, axis1], df.loc[:, axis2])

I think this generates a series acceptable to scatter.

Does that help? If you want to be more “Pythonic”, then look into itertools.product to generate your column choices.

Answered By: Prune

You could do something like this:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create dummy dataframe, or load your own with pd.read_csv()

columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)


x_col = "sex"
y_columns = ["age", "BMI", "smoke"]


for y_col in y_columns:

    figure = plt.figure
    ax = plt.gca()
    ax.scatter(data[x_col], data[y_col])
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_title("{} vs {}".format(x_col, y_col))

    plt.legend()
    plt.show()

Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).

Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering

Answered By: neko

You could use combinations from itertools. This way you will get an iterator with tuples of the combinations.

from itertools import combinations


print(list(combinations(df.columns, 2)))

The code you need would look like this:

from itertools import combinations


for col1, col2 in combinations(df.columns, 2): # <-----
    plt.scatter(df[col1], df[col2])
    plt.show()

Answered By: Erick Hernández
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.