Pandas – Compute z-score for all columns

Question:

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it:

ID      Age    BMI    Risk Factor
PT 6    48     19.3    4
PT 8    43     20.9    NaN
PT 2    39     18.1    3
PT 9    41     19.5    NaN

Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

I’m interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using

df2.to_excel("Z-Scores.xlsx")

So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?

SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.

Asked By: Slavatron

||

Answers:

Build a list from the columns and remove the column you don’t want to calculate the Z score for:

In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]

Out[66]:
   Age  BMI  Risk  Factor
0    6   48  19.3       4
1    8   43  20.9     NaN
2    2   39  18.1       3
3    9   41  19.5     NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
df
Out[68]:
   ID  Age  BMI  Risk  Factor  Age_zscore  BMI_zscore  Risk_zscore  
0  PT    6   48  19.3       4   -0.093250    1.569614    -0.150946   
1  PT    8   43  20.9     NaN    0.652753    0.074744     1.459148   
2  PT    2   39  18.1       3   -1.585258   -1.121153    -1.358517   
3  PT    9   41  19.5     NaN    1.025755   -0.523205     0.050315   

   Factor_zscore  
0              1  
1            NaN  
2             -1  
3            NaN  
Answered By: EdChum

The almost one-liner solution:

df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']
Answered By: Josh Chartier

Using Scipy’s zscore function:

df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
df

|    |   A |   B |   C |
|---:|----:|----:|----:|
|  0 | 163 | 163 | 159 |
|  1 | 120 | 153 | 181 |
|  2 | 130 | 199 | 108 |
|  3 | 108 | 188 | 157 |
|  4 | 109 | 171 | 119 |

from scipy.stats import zscore
df.apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |

If not all the columns of your data frame are numeric, then you can apply the Z-score function only to the numeric columns using the select_dtypes function:

# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)

|    |         A |         B |         C |
|---:|----------:|----------:|----------:|
|  0 |  1.83447  | -0.708023 |  0.523362 |
|  1 | -0.297482 | -1.30804  |  1.3342   |
|  2 |  0.198321 |  1.45205  | -1.35632  |
|  3 | -0.892446 |  0.792025 |  0.449649 |
|  4 | -0.842866 | -0.228007 | -0.950897 |
Answered By: Manuel

If you want to calculate the zscore for all of the columns, you can just use the following:

df_zscore = (df - df.mean())/df.std()
Answered By: Joe Bathelt

When we are dealing with time-series, calculating z-scores (or anomalies – not the same thing, but you can adapt this code easily) is a bit more complicated. For example, you have 10 years of temperature data measured weekly. To calculate z-scores for the whole time-series, you have to know the means and standard deviations for each day of the year. So, let’s get started:

Assume you have a pandas DataFrame. First of all, you need a DateTime index. If you don’t have it yet, but luckily you do have a column with dates, just make it as your index. Pandas will try to guess the date format. The goal here is to have DateTimeIndex. You can check it out by trying:

type(df.index)

If you don’t have one, let’s make it.

df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)

Next step is to calculate mean and standard deviation for each group of days. For this, we use the groupby method.

mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)

Finally, we loop through all the dates, performing the calculation (value – mean)/stddev; however, as mentioned, for time-series this is not so straightforward.

df2 = df.copy() #keep a copy for future comparisons 
for y in np.unique(df.index.year):
    for d in np.unique(df.index.dayofyear):
        df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
        df2.index.name = 'date' #this is just to look nicer

df2 #this is your z-score dataset.

The logic inside the for loops is: for a given year we have to match each dayofyear to its mean and stdev. We run this for all the years in your time-series.

Answered By: Deninhos

Here’s other way of getting Zscore using custom function:

In [6]: import pandas as pd; import numpy as np

In [7]: np.random.seed(0) # Fixes the random seed

In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])

In [9]: df # watch output of dataframe
Out[9]:
    randomA   randomB   randomC
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

## Create custom function to compute Zscore 
In [10]: def z_score(df):
   ....:         df.columns = [x + "_zscore" for x in df.columns.tolist()]
   ....:         return ((df - df.mean())/df.std(ddof=0))
   ....:

## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
   randomA_zscore  randomB_zscore  randomC_zscore
0        0.798350       -0.106335        0.731041
1        1.505002        1.939828       -1.577295
2       -0.407899       -0.875374       -0.545799
3       -1.207392       -0.463464        1.292230
4       -0.688061       -0.494655        0.099824

Result reproduced using scipy.stats zscore

In [12]: from scipy.stats import zscore

In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
    randomA   randomB   randomC
0  0.798350 -0.106335  0.731041
1  1.505002  1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464  1.292230
4 -0.688061 -0.494655  0.099824
Answered By: Surya

for Z score, we can stick to documentation instead of using ‘apply’ function

from scipy.stats import zscore
df_zscore = zscore(cols as array, axis=1)
Answered By: ibozkurt79

To calculate a z-score for an entire column quickly, do as follows:

from scipy.stats import zscore
import pandas as pd

df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
df['num_1_zscore'] = zscore(df['num_1'])

display(df)
Answered By: BGG16

Another way is to call StandardScaler() from scikit-learn. Simply instantiate StandardScaler and call fit_transform using the relevant columns as input. The result is a numpy array which you can assign back to the dataframe as new columns (or work on the array itself etc.).

from sklearn.preprocessing import StandardScaler

cols = ['col1', 'col2']
new_cols = [f"{c}_zscore" for c in cols]

sc = StandardScaler()
df[new_cols] = sc.fit_transform(df[cols])
Answered By: cottontail
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.