quantile normalization on pandas dataframe

Question:

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

Asked By: Shawn. L

||

Answers:

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn’t googled the available code.

The code is in github: Quantile Normalize

Answered By: Shawn. L

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667
Answered By: ayhan

Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic[col] = df[col].sort_values(na_position='first').values
    sorted_df = pd.DataFrame(dic)
    #rank = sorted_df.mean(axis = 1).tolist()
    rank = sorted_df.median(axis = 1).tolist()
    #sort
    for col in df:
        # compute percentile rank [0,1] for each score in column 
        t = df[col].rank( pct=True, method='max' ).values
        # replace percentile values in column with quantile normalized score
        # retrieve q_norm score using calling rank with percentile value
        df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
    return df
Answered By: xspensiv

The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

import numpy as np

def quantile_normalize_using_target(x, target):
    """
    Both `x` and `target` are numpy arrays of equal lengths.
    """

    target_sorted = np.sort(target)

    return target_sorted[x.argsort().argsort()]

Once you have a pandas.DataFrame easy to do:

quantile_normalize_using_target(df[0].as_matrix(),
                                df[1].as_matrix())

(Normalizing the first columnt to the second one as a reference distribution in the example above.)

Answered By: deeenes

I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

def quantile_normalize(dataframe, cols, pandas=pd):

    # copy dataframe and only use the columns with numerical values
    df = dataframe.copy().filter(items=cols)

    # columns from the original dataframe not specified in cols
    non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))


    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  

    norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()


    result = pandas.concat([norm, non_numeric], axis=1)
    return result

the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

e.g. if you added some ‘meta-data’ (char) to the wiki example:

df = pd.DataFrame({
    'rep1': [5, 2, 3, 4],
    'rep2': [4, 1, 4, 2],
    'rep3': [3, 4, 6, 8],
    'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])

you can then call

quantile_normalize(t, ['rep1', 'rep2', 'rep3'])

to get

    rep1        rep2        rep3        char
a   5.666667    4.666667    2.000000    gene_a
b   2.000000    2.000000    3.000000    gene_b
c   3.000000    4.666667    4.666667    gene_c
d   4.666667    3.000000    5.666667    gene_d
Answered By: SumNeuron

One thing worth noticing is that both ayhan and shawn’s code use the smaller rank mean for ties, but if you use R package processcore’s normalize.quantiles() , it would use the mean of rank means for ties.

Using the above example:

> df

   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

> normalize.quantiles(as.matrix(df))

         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
Answered By: msg

As pointed out by @msg, none of the solutions here take ties into account. I made a python package called qnorm which handles ties, and correctly recreates the Wikipedia quantile normalization example:

import pandas as pd
import qnorm

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

print(qnorm.quantile_normalize(df))
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667

Installation can be done with either pip or conda

pip install qnorm

or

conda config --add channels conda-forge
conda install qnorm
Answered By: Maarten-vd-Sande

This is a minor adjustment but I imagine that many have noticed the subtle ‘flaw’ in @ayhan ‘s answer.

I have made a small adjustment to it, which gets the ‘correct’ answer, while not having to resort to any external libraries for such an exceedingly simple function.

The only adjustment necessary is the [Add interpolated values] section.

import pandas as pd

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

def quant_norm(df):
    ranks = (df.rank(method="first")
              .stack())
    rank_mean = (df.stack()
                   .groupby(ranks)
                   .mean())
    # Add interpolated values in between ranks
    finer_ranks = ((rank_mean.index+0.5).to_list() +
                    rank_mean.index.to_list())
    rank_mean = rank_mean.reindex(finer_ranks).sort_index().interpolate()
    return (df.rank(method='average')
              .stack()
              .map(rank_mean)
              .unstack())
quant_norm(df)

Out[122]: 
         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667
Answered By: chase

Note that scikit-learn offers a module for quantile normalization:

import pandas as pd
import sklearn.preprocessing

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
sklearn.preprocessing.quantile_transform(df)
array([[1.  , 1.  , 0.  ],
       [0.  , 0.  , 0.33],
       [0.33, 1.  , 0.67],
       [0.67, 0.33, 1.  ]])

It is also possible to adjust the data to a normal distribution instead of a uniform distribution:

sklearn.preprocessing.quantile_transform(df, output_distribution="normal")
array([[ 5.2 ,  5.2 , -5.2 ],
       [-5.2 , -5.2 , -0.43],
       [-0.43,  5.2 ,  0.43],
       [ 0.43, -0.43,  5.2 ]])
Answered By: Gregor Sturm