How to loop through corresponding columns in a csv

Question:

I’m trying to write a python script which reads through all the .csv files in a folder. Every .csv file contains 94 columns. I would like to loop through all the files and headers in such a way that it looks at the first column of each header, plots a single histogram containing the data from all of those first columns, then moves on to plot another single histogram containing only the data from the 2nd column, then moves on to plot another single histogram containing only the data from the 3rd column, and so on. Thus, in total it should produce 94 histograms.

I currently have code which loops a bit differently: it goes to the first file, then plots a histogram for each header in that file, then moves on to the next file, plots a histogram for each header in that file etc. Below is part of the code that does that.

dfs = []
for iteration, file in enumerate(files):
    _dfs = pd.read_csv(file)
    dfs.append(_dfs)
    print('Data is', round(100*((iteration+1)/len(files)), 0), '% loaded') #Prints how much data has been loaded so far.


'''-----------------------------------
Plotting Graphs
--------------------------------------
'''
for i in range(len(dfs)): #loops through files
    for k in dfs[i]: #loops through column headers
        plt.hist(dfs[i][k], 25)
        plt.title(files[i][22:]) #uses filename as title
        plt.xlabel(dfs[i][k].name) #uses column header for x-label
        plt.ylabel('Frequency Density')
        plt.show()

dfs is simply a list containing all the names of the files. How can I alter my script to achieve what I said in the beginning?

Asked By: probablysid

||

Answers:

If i understand you correctly.
You can change the second for loop to loop through the columns of each dataframe, instead of the column headers, and then use the enumerate function to keep track of the current column number. Then, you can use that column number to create a separate histogram for each column.

for i in range(len(dfs)): #loops through files
    for j, col in enumerate(dfs[i].columns): #loops through columns
        plt.hist(dfs[i][col], 25)
        plt.title(files[i][22:]) #uses filename as title
        plt.xlabel(col) #uses column header for x-label
        plt.ylabel('Frequency Density')
        plt.show()

I hope it helps!

Answered By: MotaBtw

94 histograms, each histogram represents a per-column aggregation of data from all dataframes.

#######################
### Plotting Graphs ###
#######################

for i in range(94):
    data = [] # store all i'th column data across all dfs
    for df in dfs:
        data.extend(list(df.iloc[:,i])) # i'th column
    
    plt.hist(data, bins=25)
    plt.title(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.xlabel(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.ylabel('Frequency Density')
    plt.show()
Answered By: Joynul Islam

You reference pd. Is that Pandas? If so, then let Pandas do the work for you! Check out this tutorial that shows you how to plot columns in a Pandas data frame. That tutorial assumes that you know the column headers, which you may not know, so look at this other tutorial that will show you how to get the column headers out of your data frame.

Answered By: paul
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.