Converting a pandas Interval into a string (and back again)

Question:

I’m relatively new to Python and am trying to get some data prepped to train a RandomForest. For various reasons, we want the data to be discrete, so there are a few continuous variables that need to be discretized. I found qcut in pandas, which seems to do what I want – I can set a number of bins, and it will discretize the variable into that many bins, trying to keep the counts in each bin even.

However, the output of pandas.qcut is a list of Intervals, and the RandomForest classifier in scikit-learn needs a string. I found that I can convert an interval into a string by using .astype(str). Here’s a quick example of what I’m doing:

import pandas as pd
from random import sample

vals = sample(range(0,100), 100)
cuts = pd.qcut(vals, q=5)
str_cuts = pd.qcut(vals, q=5).astype(str)

and then str_cuts is one of the variables passed into a random forest.

However, the intent of this system is to train a RandomForest, save it to a file, and then allow someone to load it at a later date and get a classification for a new test instance, that is not available at training time. And because the classifier was trained on discretized data, the new test instance will need to be discretized before it can be used. So what I want to be able to do is read in a new instance, apply the already-established discretization scheme to it, convert it to a string, and run it through the random forest. However, I’m getting hung up on the best way to ‘apply the discretization scheme’.

Is there an easy way to handle this? I assume there’s no straight-forward way to convert a string back into an Interval. I can get the list of all Interval values from the discretization (ex: cuts.unique()) and apply that at test-time, but that would require saving/loading a discretization dictionary alongside the random forest, which seems clunky, and I worry about running into issues trying to recreate a categorical variable (coming mostly from R, which is extremely particular about the format of categorical variables). Or is there another way around this that I’m not seeing?

Asked By: Amanda

||

Answers:

While it may not be the cleanest-looking method, converting a string back into an interval is indeed possible:

import pandas as pd
str_intervals = [i.replace("(","").replace("]", "").split(", ") for i in str_cuts]
original_cuts = [pd.Interval(float(i), float(j)) for i, j in str_intervals]
Answered By: Ted

Use the labelsargument in qcut and use pandas Categorical.

Either of those can help you create categories instead of interval for your variable. Then, you can use a form of encoding, for example Label Encoding or Ordinal Encoding to convert the categories (the factors if you’re used to R) to numerical values which the Forest will be able to use.

Then the process goes :

cutting => categoricals => encoding

and you don’t need to do it by hand anymore.

Lastly, some gradient boosted trees libraries have support for categorical variables though it’s not a silver bullet and will depend on your goal. See catboost and lightgbm.

Answered By: Nathan Furnal

For future searchers, there are benefits to using transformers from scikit-learn instead of pandas. In this case, KBinsDiscretizer is the scikit equivalent of qcut.
It can be used in a pipeline, which will handle applying the previously-learned discretization to unseen data without the need for storing the discretization dictionary separately or round trip string conversion. Here’s an example:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer


pipeline = make_pipeline(KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'), 
                         RandomForestClassifier())
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

If you really need to convert back and forth between pandas IntervalIndex and string, you’ll probably need to do some parsing as described in this answer: https://stackoverflow.com/a/65296110/3945991 and either use FunctionTransformer or write your own Transformer for pipeline integration.

Answered By: marcusaurelius