Creating sub columns in Pandas Dataframes for Summary Statistics
Question:
I am working with water quality data for both surface water locations and groundwater well locations. I would like to create a summary statistics table for all three of my parameters (pH, Temp, salinity) grouped by the location the samples were taken from (surface water vs. Groundwater) as shown below:
| 'Surface Water' | 'Groundwater' |
___________________________________________________________________________
| min | max | mean | std | min | max | mean | std
'pH'
The way I set up my Excel Sheet for data collection includes the following columns: Date, Monitoring ID (Either Surface Water or Groundwater), pH, Temp, and Salinity.
How can i tell python to do this? I am familiar with the groupby and describe() function but I don’t know how to style organize it the way that I want. Any help would be appreciated!
I have tried using the groupby function for each descriptive stat for example:
mean = df.
groupby('Monitoring ID')
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].mean()
min = df.
groupby('Monitoring ID')
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].min()
etc…. but I don’t know how to incorporate it all into one nice table
Answers:
You can use groupby_describe
as you suggest then stack_transpose
:
metrics = ['count', 'mean', 'std', 'min', 'max']
out = df.groupby('Monitoring ID').describe().stack().T.loc[:, (slice(None), metrics)]
>>> out
Monitoring ID Groundwater Surface Water
count mean std min max count mean std min max
pH 159.0 6.979182 0.587316 6.00 7.98 141.0 6.991135 0.564097 6.00 7.99
SAL (ppt) 159.0 1.976226 0.577557 1.02 2.99 141.0 1.917589 0.576650 1.01 2.99
Temperature (°C) 159.0 13.466101 4.805317 4.13 21.78 141.0 13.099645 4.989240 4.03 21.61
DO (mg/L) 159.0 1.984277 0.609071 1.00 2.99 141.0 1.939433 0.577651 1.00 2.96
You can use agg
along with groupby
:
import pandas as pd
import numpy as np
# Sample data
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03'],
'Monitoring ID': ['Surface Water', 'Surface Water', 'Surface Water', 'Groundwater', 'Groundwater', 'Groundwater'],
'pH': [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
'Temp': [10, 12, 9, 15, 13, 14],
'Salinity': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# Group by 'Monitoring ID' and calculate summary statistics
summary_stats = df.groupby('Monitoring ID').agg({'pH': ['min', 'max', 'mean', 'std'],
'Temp': ['min', 'max', 'mean', 'std'],
'Salinity': ['min', 'max', 'mean', 'std']})
# Reorganise column by renaming
summary_stats.columns = ['_'.join(col).strip() for col in summary_stats.columns.values]
# Summary table
print(summary_stats)
Pardon me I’m still trying to figure how to demonstrate the output of the code here but I hope this helps.
I am working with water quality data for both surface water locations and groundwater well locations. I would like to create a summary statistics table for all three of my parameters (pH, Temp, salinity) grouped by the location the samples were taken from (surface water vs. Groundwater) as shown below:
| 'Surface Water' | 'Groundwater' | ___________________________________________________________________________ | min | max | mean | std | min | max | mean | std 'pH'
The way I set up my Excel Sheet for data collection includes the following columns: Date, Monitoring ID (Either Surface Water or Groundwater), pH, Temp, and Salinity.
How can i tell python to do this? I am familiar with the groupby and describe() function but I don’t know how to style organize it the way that I want. Any help would be appreciated!
I have tried using the groupby function for each descriptive stat for example:
mean = df.
groupby('Monitoring ID')
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].mean()
min = df.
groupby('Monitoring ID')
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].min()
etc…. but I don’t know how to incorporate it all into one nice table
You can use groupby_describe
as you suggest then stack_transpose
:
metrics = ['count', 'mean', 'std', 'min', 'max']
out = df.groupby('Monitoring ID').describe().stack().T.loc[:, (slice(None), metrics)]
>>> out
Monitoring ID Groundwater Surface Water
count mean std min max count mean std min max
pH 159.0 6.979182 0.587316 6.00 7.98 141.0 6.991135 0.564097 6.00 7.99
SAL (ppt) 159.0 1.976226 0.577557 1.02 2.99 141.0 1.917589 0.576650 1.01 2.99
Temperature (°C) 159.0 13.466101 4.805317 4.13 21.78 141.0 13.099645 4.989240 4.03 21.61
DO (mg/L) 159.0 1.984277 0.609071 1.00 2.99 141.0 1.939433 0.577651 1.00 2.96
You can use agg
along with groupby
:
import pandas as pd
import numpy as np
# Sample data
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03'],
'Monitoring ID': ['Surface Water', 'Surface Water', 'Surface Water', 'Groundwater', 'Groundwater', 'Groundwater'],
'pH': [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
'Temp': [10, 12, 9, 15, 13, 14],
'Salinity': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# Group by 'Monitoring ID' and calculate summary statistics
summary_stats = df.groupby('Monitoring ID').agg({'pH': ['min', 'max', 'mean', 'std'],
'Temp': ['min', 'max', 'mean', 'std'],
'Salinity': ['min', 'max', 'mean', 'std']})
# Reorganise column by renaming
summary_stats.columns = ['_'.join(col).strip() for col in summary_stats.columns.values]
# Summary table
print(summary_stats)
Pardon me I’m still trying to figure how to demonstrate the output of the code here but I hope this helps.