Reconstruct a categorical variable from dummies in pandas


pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it’s trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?

Asked By: themiurgo



In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

So I think we need a function to ‘do’ this as it seems to be a natural operations. Maybe get_categories(), see here

Answered By: Jeff

It’s been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.

EDIT: I didn’t bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'a', 'c'])

In [3]: s
0    a
1    b
2    a
3    c
dtype: object

In [4]: dummies = pd.get_dummies(s)

In [5]: dummies
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

In [6]: s2 = dummies.idxmax(axis=1)

In [7]: s2
0    a
1    b
2    a
3    c
dtype: object

In [8]: (s2 == s).all()
Out[8]: True

EDIT in response to @piRSquared’s comment:
This solution does indeed assume there’s one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I’m missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).

If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the “first” variable was, so that operation isn’t invertible unless you keep extra information around; I’d recommend leaving drop_first=False (default).

Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the “dummification” and your data contains any NaNs. Setting dummy_na=True will always add a “nan” column, even if that column is all 0s, so you probably don’t want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What’s also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says “nan”).

It’s also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.

Answered By: Nathan

This is quite a late answer, but since you ask for a quick way to do it, I assume you’re looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:


Using the same sample data as @Nathan:

>>> dummies
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])

>>> s2
0    a
1    b
2    a
3    c
dtype: object


On a small dummy dataframe, you won’t see much difference in performance. However, testing different strategies to solving this problem on a large series:

s = pd.Series(np.random.choice(['a','b','c'], 10000))

dummies = pd.get_dummies(s)

def np_method(dummies=dummies):
    return pd.Series(dummies.columns[np.where(dummies!=0)[1]])

def idx_max_method(dummies=dummies):
    return dummies.idxmax(axis=1)

def get_level_values_method(dummies=dummies):
    x = dummies.stack()
    return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))

def dot_method(dummies=dummies):

import timeit

# Time each method, 1000 iterations each:

>>> timeit.timeit(np_method, number=1000)

>>> timeit.timeit(idx_max_method, number=1000)

>>> timeit.timeit(get_level_values_method, number=1000)

>>> timeit.timeit(dot_method, number=1000)

The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question

They all return the same result:

>>> (get_level_values_method() == np_method()).all()
>>> (idx_max_method() == np_method()).all()
Answered By: sacuL


Using @Jeff’s setup

s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)

If columns are strings

and there is only one 1 per row

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object


Again! Assuming only one 1 per row

i, j = np.where(df)
pd.Series(df.columns[j], i)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]


Not assuming one 1 per row

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))

0   0    a
1   0    a
2   0    a
3   1    b
4   1    b
5   1    b
6   2    c
7   2    c
8   3    d
9   3    d
10  4    e
11  5    f
12  6    g
13  7    h
dtype: object


Where we don’t assume one 1 per row and we drop the index

i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)

0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: object
Answered By: piRSquared

Converting dat[“classification”] to one hot encodes and back!!

import pandas as pd

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

dat["labels"]= le.fit_transform(dat["classification"])

Y= pd.get_dummies(dat["labels"])


for i in range(0, len(Y)):

tru= le.inverse_transform(tru)

##Identical check!
Answered By: TBhavnani

If you’re categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don’t form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.

0. Data setup:


student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan  # artificially introduce NAs

students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
   mark   pass
a  51.0   True
b   NaN   True
c  14.0  False
d  71.0   True
e  60.0   True
f   NaN  False
g  82.0   True
h  86.0   True
i  74.0   True

1. Compute the value of the relevant boolean conditions:

failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
   b  f  p
a  1  0  0
b  0  0  0
c  0  1  0
d  0  0  1
e  0  0  1
f  0  1  0
g  0  0  1
h  0  0  1
i  0  0  1

As you can see row b has False for all three categories (since the mark is NaN and pass is True).

2. Generate the categorical series:

cat = pd.Series(
    pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a    barely passed
b              NaN
c           failed
d      well passed
e      well passed
f           failed
g      well passed
h      well passed
i      well passed

As you can see, a NaN was kept where none of the categories applied.

This approach is as performant as using np.where but allows for the flexibility of possible NaNs.

Answered By: Anakhand

Another option is using the function from_dummies from pandas version 1.5.0. Here is a reproducible example:

import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)

   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

Using from_dummies:


0  a
1  b
2  a
3  c
Answered By: Quinten
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.