Python – Finding average of a column in a CSV given a value in another column (data from a specific year in a file with multiple years)?

Question:

The CSV files used in this code are air quality sensor data files. They record particle concentrations each hour over multiple years in some cases. There is about 100 CSV files I am using. I have already figured out how to look through each file and average a variable regardless of the year, but I am having trouble finding the averages for only the year 2020.

The goal of the code is to find the average number of hours each sensor is running in the year 2020.

# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv

# Read in table summarizing key variables about each Purple Air station around Pittsburgh
summary_table = pd.read_csv('Pittsburgh Hourly Averaged PM Data.csv')

# Subset the table to include only stations to be used in analysis
summary_table = summary_table[summary_table['Y/N'] == 'Y']

# Number of stations
print('Initial number of stations: ', len(summary_table))

num_hr = []

# Loop through all rows in the summary data table.  For each row, find filename
# of the station corresponding to the row and read in that station data.

hours_utc = ['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23']
for i in summary_table.index:
    station_data = pd.read_csv('Hourly_Averages/Excel_Data/' + summary_table.at[i,'Filename'] + '.csv')
    if station_data['year'] == 2020:
    # num_hr.append(station_data['PM2.5_CF1_ug/m3'].mean())
    station_data = station_data[station_data['hr'] == h]
print(num_hr)

with open('average_hr.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(num_hr)

An example of the CSV’s used by the code (the full CSV’s are thousands of rows long and I don’t know a way to put a full file in the question).

 , Unnamed: 0, Unnamed: 0.1, Unnamed: 0.1.1, Unnamed: 0.1.1.1, created_at, PM1.0_CF1_ug/m3, PM2.5_CF1_ug/m3, PM10.0_CF1_ug/m3, UptimeMinutes, RSSI_dbm, Temperature_F, Humidity_%, PM2.5_ATM_ug/m3, hr, year, month, date, season
0  0           0             0               0                 2020-12-23 17:00:00 UTC 0        0.04    0.12    7.5     -39.45  71      14.85   0.04        17  2020    12  12/23/20    Winter
1  1           1             1               1                 2020-12-23 18:00:00 UTC  172.9   393.94  489.19  47.41   -36.93  76.34   14.72   261.9       18  2020    12  12/23/20    Winter
2  2           2             2               2                 2020-12-23 19:00:00 UTC  77.59   144.78  161.67  101     -37.7   76.17   15.61   95.94       19  2020    12  12/23/20    Winter
3  3           3             3               3                 2021-01-07 19:00:00 UTC  103.61  236.47  298.67  28.04   -60.39  76   14.61  157.63      19  2021    1   1/7/21  Winter
4  4           4             4               4                 2021-01-07 20:00:00 UTC  11.18   21.12   23.04   64      -59.55  78.91   13.36   19.77       20  2021    1   1/7/21  Winter
5  5           5             5               5                 2021-01-13 18:00:00 UTC  59.77   96.07   102.51  13.26   -49.52  73.78   29.48   65.32       18  2021    1   1/13/21 Winter

Referance photo of the same CSV

One row with attribute names

FYI I am fairly new to coding and using CSV files, there may be a simple answer to my question, but after looking over many sites I am still stuck. I appreciate any help any of you may have.

Asked By: Jesse Rutt

||

Answers:

Imagine this is your table :

enter image description here

I tried to give you the idea on :

how to do something on a column on a condition of other column:

import pandas as pd 

fields = ['Sensor_1','Sensor_2','Sensor_3','Year'] # you can tell pandas that fetch only these attributes 
df = pd.read_excel('myData.xlsx' , usecols=fields)

sensor1 = df.Sensor_1.mean()
for x in df:
    if(x != 'Year'):
        sensor = df[x].where(df['Year'] == 2020).sum() / 14 
        print(sensor)


the result is :

10.785714285714286  # sensor_1 avg
4.357142857142857   # sensor_2 avg
2.892857142857143   # sensor_3 avg

For more :

I know after you read the code , you wonder is there any function to give you average, the answer is YES and the function name is mean() but when you use mean() it will ignore those rows that disabled in the condition (where(df['Year'] == 2020) ) so it will give you wrong result, for example in my sample it will give you the result of sum()/ 10 cause 4 rows are in 2021 Year.

This is all you need, just replace your attribute names with the code I gave you , I think it will help you .

Answered By: two.X
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.