Reconstruct a categorical variable from dummies in pandas
Question:
pd.get_dummies
allows to convert a categorical variable into dummy variables. Besides the fact that it’s trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
Answers:
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to ‘do’ this as it seems to be a natural operations. Maybe get_categories()
, see here
It’s been a few years, so this may well not have been in the pandas
toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax
will return the index corresponding to the largest element (i.e. the one with a 1
). We do axis=1
because we want the column name where the 1
occurs.
EDIT: I didn’t bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical
(and pd.Series
, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to @piRSquared’s comment:
This solution does indeed assume there’s one 1
per row. I think this is usually the format one has. pd.get_dummies
can return rows that are all 0 if you have drop_first=True
or if there are NaN
values and dummy_na=False
(default) (any cases I’m missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a
in the example above).
If drop_first=True
, you have no way to know from the dummies dataframe alone what the name of the “first” variable was, so that operation isn’t invertible unless you keep extra information around; I’d recommend leaving drop_first=False
(default).
Since dummy_na=False
is the default, this could certainly cause problems. Please set dummy_na=True
when you call pd.get_dummies
if you want to use this solution to invert the “dummification” and your data contains any NaNs
. Setting dummy_na=True
will always add a “nan” column, even if that column is all 0s, so you probably don’t want to set this unless you actually have NaN
s. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any())
. What’s also nice is that idxmax
solution will correctly regenerate your NaN
s (not just a string that says “nan”).
It’s also worth mentioning that setting drop_first=True
and dummy_na=False
means that NaN
s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN
values.
This is quite a late answer, but since you ask for a quick way to do it, I assume you’re looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where
instead of idxmax
or get_level_values
, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as @Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won’t see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where
method is about 4 times faster than the get_level_values
method 11.5 times faster than the idxmax
method! It also beats (but only by a little) the .dot()
method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using @Jeff’s setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1
per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1
per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1
per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don’t assume one 1
per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Converting dat[“classification”] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you’re categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don’t form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical
full with np.nan
and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b
has False
for all three categories (since the mark is NaN
and pass
is True
).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN
was kept where none of the categories applied.
This approach is as performant as using np.where
but allows for the flexibility of possible NaN
s.
Another option is using the function from_dummies
from pandas
version 1.5.0
. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies
:
pd.from_dummies(df)
0 a
1 b
2 a
3 c
pd.get_dummies
allows to convert a categorical variable into dummy variables. Besides the fact that it’s trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to ‘do’ this as it seems to be a natural operations. Maybe get_categories()
, see here
It’s been a few years, so this may well not have been in the pandas
toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax
will return the index corresponding to the largest element (i.e. the one with a 1
). We do axis=1
because we want the column name where the 1
occurs.
EDIT: I didn’t bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical
(and pd.Series
, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to @piRSquared’s comment:
This solution does indeed assume there’s one 1
per row. I think this is usually the format one has. pd.get_dummies
can return rows that are all 0 if you have drop_first=True
or if there are NaN
values and dummy_na=False
(default) (any cases I’m missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a
in the example above).
If drop_first=True
, you have no way to know from the dummies dataframe alone what the name of the “first” variable was, so that operation isn’t invertible unless you keep extra information around; I’d recommend leaving drop_first=False
(default).
Since dummy_na=False
is the default, this could certainly cause problems. Please set dummy_na=True
when you call pd.get_dummies
if you want to use this solution to invert the “dummification” and your data contains any NaNs
. Setting dummy_na=True
will always add a “nan” column, even if that column is all 0s, so you probably don’t want to set this unless you actually have NaN
s. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any())
. What’s also nice is that idxmax
solution will correctly regenerate your NaN
s (not just a string that says “nan”).
It’s also worth mentioning that setting drop_first=True
and dummy_na=False
means that NaN
s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN
values.
This is quite a late answer, but since you ask for a quick way to do it, I assume you’re looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where
instead of idxmax
or get_level_values
, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as @Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won’t see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where
method is about 4 times faster than the get_level_values
method 11.5 times faster than the idxmax
method! It also beats (but only by a little) the .dot()
method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using @Jeff’s setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1
per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1
per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1
per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don’t assume one 1
per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Converting dat[“classification”] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you’re categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don’t form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical
full with np.nan
and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b
has False
for all three categories (since the mark is NaN
and pass
is True
).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN
was kept where none of the categories applied.
This approach is as performant as using np.where
but allows for the flexibility of possible NaN
s.
Another option is using the function from_dummies
from pandas
version 1.5.0
. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies
:
pd.from_dummies(df)
0 a
1 b
2 a
3 c