Applying pandas qcut bins to new data

Question:

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:

data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)

My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?

Thanks

Asked By: GRN

||

Answers:

You can do it by passing retbins=True.

Consider the following DataFrame:

import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])

pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:

ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)

ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:

pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]: 
0     13
1     19
2      3
3      9
4     13
5     17
...
Answered By: ayhan

User @Karen said:

By using this logic, I am getting Na values in my validation set. Is there some way to solve it?

If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.

You can solve this problem by extending the range of the training data:

# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)
Answered By: Arturo Sbr
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.