How can I efficiently get [many] quartiles?

Question:

I need to encode numerical values by ranges: low: 0, medium: 1, high: 2 , very high: 3. I’m doing it for quartiles. I have the following code:

import pandas as pd
import numpy as np

def fun(df):
    table = df.copy() # pandas dataframe
    N = int(table.shape[0])
    for header in list(table.columns):
        q1 = np.percentile(table[header], 25)
        q2 = np.percentile(table[header], 50)
        q3 = np.percentile(table[header], 75)
        for k in range(0, N):
            if( table[header][k] < q1 ):
                table[header][k] = int(0)
            elif( (table[header][k] >= q1) & (table[header][k] < q2)):
                table[header][k] = int(1)
            elif( (table[header][k] >= q2) & (table[header][k] < q3)):
                table[header][k] = int(2)
            else:
                table[header][k] = int(3)
        pass
    table = table.astype(int)
    return table

Proof

df = pd.DataFrame( {
        'A': [30, 28, 32, 25, 25, 25, 22, 24, 35, 40],
        'B': [25, 30, 27, 40, 42, 40, 50, 45, 30, 25],
        'C': [25.5, 30.1, 27.3, 40.77, 25.1, 25.34, 22.11, 23.81, 33.66, 38.56],
    }, columns = [ 'A', 'B', 'C' ] )

Result:

A  B  C
2  0  1
2  1  2
3  0  2
1  2  3
1  3  0
1  2  1
0  3  0
0  3  0
3  1  3
3  0  3

Any way to do the same, efficiently?

Asked By: JonAnthrax

||

Answers:

You could use a combination of np.digitize and pd.rank

In [569]: np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True)
Out[569]:
array([[2, 0, 1],
       [2, 1, 2],
       [3, 1, 2],
       [1, 2, 3],
       [1, 3, 1],
       [1, 2, 1],
       [0, 3, 0],
       [0, 3, 0],
       [3, 1, 3],
       [3, 0, 3]], dtype=int64)

Details

In [570]: df.rank(pct=True)
Out[570]:
     A     B    C
0  0.7  0.15  0.5
1  0.6  0.45  0.7
2  0.8  0.30  0.6
3  0.4  0.65  1.0
4  0.4  0.80  0.3
5  0.4  0.65  0.4
6  0.1  1.00  0.1
7  0.2  0.90  0.2
8  0.9  0.45  0.8
9  1.0  0.15  0.9

In [571]: pd.DataFrame(np.digitize(df.rank(pct=True), bins=[.25, .5, .75], right=True),
                       columns=df.columns)
Out[571]:
   A  B  C
0  2  0  1
1  2  1  2
2  3  1  2
3  1  2  3
4  1  3  1
5  1  2  1
6  0  3  0
7  0  3  0
8  3  1  3
9  3  0  3
Answered By: Zero