How to create dummies for certain columns with pandas.get_dummies()
Question:
df = pd.DataFrame({'A': ['x', 'y', 'x'], 'B': ['z', 'u', 'z'],
'C': ['1', '2', '3'],
'D':['j', 'l', 'j']})
I just want Column A and D to get dummies not for Column B. If I used pd.get_dummies(df)
, all columns turned into dummies.
I want the final result containing all of columns , which means column C and column B exit,like 'A_x','A_y','B','C','D_j','D_l'
.
Answers:
Just select the two columns you want to .get_dummies()
for – column
names indicate source column and variable label represented as binary variable, and pd.concat()
the original columns you want unchanged:
pd.concat([pd.get_dummies(df[['A', 'D']]), df[['B', 'C']]], axis=1)
A_x A_y D_j D_l B C
0 1.0 0.0 1.0 0.0 z 1
1 0.0 1.0 0.0 1.0 u 2
2 1.0 0.0 1.0 0.0 z 3
It can be done without concatenation, using get_dummies() with required parameters
In [294]: pd.get_dummies(df, prefix=['A', 'D'], columns=['A', 'D'])
Out[294]:
B C A_x A_y D_j D_l
0 z 1 1.0 0.0 1.0 0.0
1 u 2 0.0 1.0 0.0 1.0
2 z 3 1.0 0.0 1.0 0.0
Adding to the above perfect answers, in case you have a big dataset with lots of attributes, if you don’t want to specify by hand all of the dummies you want, you can do set differences:
len(df.columns) = 50
non_dummy_cols = ['A','B','C']
# Takes all 47 other columns
dummy_cols = list(set(df.columns) - set(non_dummy_cols))
df = pd.get_dummies(df, columns=dummy_cols)
- The other answers are great for the specific example in the OP
- This answer is for cases where there may be many columns, and it’s too cumbersome to type out all the column names
- This is a non-exhaustive solution to specifying many different columns to
get_dummies
while excluding some columns.
- Using the built-in
filter()
function on df.columns
is also an option.
pd.get_dummies
only works on columns with an object dtype
when columns=None
.
- Another potential option is to set only columns to be transformed with the
object dtype
, and make sure the columns that shouldn’t be transformed, are not object dtype
.
- Using
set()
, as shown in this answer, is yet another option.
import pandas as pd
import string # for data
import numpy as np
# create test data
np.random.seed(15)
df = pd.DataFrame(np.random.randint(1, 4, size=(5, 10)), columns=list(string.ascii_uppercase[:10]))
# display(df)
A B C D E F G H I J
0 1 2 1 2 1 1 2 3 2 2
1 2 1 3 3 1 2 2 1 2 1
2 2 3 1 3 2 2 1 2 3 3
3 3 2 1 2 3 2 3 1 3 1
4 1 1 1 3 3 1 2 1 2 1
Option 1
- If the excluded columns are fewer than the included columns, specify the columns to remove, and then use a list comprehension to remove them from the list being passed to the
columns=
parameter.
# columns not to transform
not_cols = ['C', 'G']
# get dummies
df_dummies = pd.get_dummies(data=df, columns=[col for col in df.columns if col not in not_cols])
C G A_1 A_2 A_3 B_1 B_2 B_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 3 2 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0
2 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1
3 1 3 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0
4 1 2 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0
Option 2
- If the columns to remove are at the beginning or end, slice
df.columns
df_dummies = pd.get_dummies(data=df, columns=df.columns[2:])
A B C_1 C_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 G_1 G_2 G_3 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 2 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0
2 2 3 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 1
3 3 2 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0
4 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
Option 3
- Specify slices and then concat the
excluded
columns to the dummies
- Uses
pd.concat
, similar to this answer, but with more columns.
np.r_
translates slice objects to concatenate
slices = np.r_[slice(0, 2), slice(3, 6), slice(7, 10)]
excluded = [2, 6]
df_dummies = pd.concat([df.iloc[:, excluded], pd.get_dummies(data=df.iloc[:, slices].astype(object))], axis=1)
C G A_1 A_2 A_3 B_1 B_2 B_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 3 2 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0
2 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1
3 1 3 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0
4 1 2 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0
df = pd.DataFrame({'A': ['x', 'y', 'x'], 'B': ['z', 'u', 'z'],
'C': ['1', '2', '3'],
'D':['j', 'l', 'j']})
I just want Column A and D to get dummies not for Column B. If I used pd.get_dummies(df)
, all columns turned into dummies.
I want the final result containing all of columns , which means column C and column B exit,like 'A_x','A_y','B','C','D_j','D_l'
.
Just select the two columns you want to .get_dummies()
for – column
names indicate source column and variable label represented as binary variable, and pd.concat()
the original columns you want unchanged:
pd.concat([pd.get_dummies(df[['A', 'D']]), df[['B', 'C']]], axis=1)
A_x A_y D_j D_l B C
0 1.0 0.0 1.0 0.0 z 1
1 0.0 1.0 0.0 1.0 u 2
2 1.0 0.0 1.0 0.0 z 3
It can be done without concatenation, using get_dummies() with required parameters
In [294]: pd.get_dummies(df, prefix=['A', 'D'], columns=['A', 'D'])
Out[294]:
B C A_x A_y D_j D_l
0 z 1 1.0 0.0 1.0 0.0
1 u 2 0.0 1.0 0.0 1.0
2 z 3 1.0 0.0 1.0 0.0
Adding to the above perfect answers, in case you have a big dataset with lots of attributes, if you don’t want to specify by hand all of the dummies you want, you can do set differences:
len(df.columns) = 50
non_dummy_cols = ['A','B','C']
# Takes all 47 other columns
dummy_cols = list(set(df.columns) - set(non_dummy_cols))
df = pd.get_dummies(df, columns=dummy_cols)
- The other answers are great for the specific example in the OP
- This answer is for cases where there may be many columns, and it’s too cumbersome to type out all the column names
- This is a non-exhaustive solution to specifying many different columns to
get_dummies
while excluding some columns. - Using the built-in
filter()
function ondf.columns
is also an option. pd.get_dummies
only works on columns with anobject dtype
whencolumns=None
.- Another potential option is to set only columns to be transformed with the
object dtype
, and make sure the columns that shouldn’t be transformed, are notobject dtype
.
- Another potential option is to set only columns to be transformed with the
- Using
set()
, as shown in this answer, is yet another option.
import pandas as pd
import string # for data
import numpy as np
# create test data
np.random.seed(15)
df = pd.DataFrame(np.random.randint(1, 4, size=(5, 10)), columns=list(string.ascii_uppercase[:10]))
# display(df)
A B C D E F G H I J
0 1 2 1 2 1 1 2 3 2 2
1 2 1 3 3 1 2 2 1 2 1
2 2 3 1 3 2 2 1 2 3 3
3 3 2 1 2 3 2 3 1 3 1
4 1 1 1 3 3 1 2 1 2 1
Option 1
- If the excluded columns are fewer than the included columns, specify the columns to remove, and then use a list comprehension to remove them from the list being passed to the
columns=
parameter.
# columns not to transform
not_cols = ['C', 'G']
# get dummies
df_dummies = pd.get_dummies(data=df, columns=[col for col in df.columns if col not in not_cols])
C G A_1 A_2 A_3 B_1 B_2 B_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 3 2 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0
2 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1
3 1 3 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0
4 1 2 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0
Option 2
- If the columns to remove are at the beginning or end, slice
df.columns
df_dummies = pd.get_dummies(data=df, columns=df.columns[2:])
A B C_1 C_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 G_1 G_2 G_3 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 2 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0
2 2 3 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 1
3 3 2 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0
4 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
Option 3
- Specify slices and then concat the
excluded
columns to the dummies- Uses
pd.concat
, similar to this answer, but with more columns.
- Uses
np.r_
translates slice objects to concatenate
slices = np.r_[slice(0, 2), slice(3, 6), slice(7, 10)]
excluded = [2, 6]
df_dummies = pd.concat([df.iloc[:, excluded], pd.get_dummies(data=df.iloc[:, slices].astype(object))], axis=1)
C G A_1 A_2 A_3 B_1 B_2 B_3 D_2 D_3 E_1 E_2 E_3 F_1 F_2 H_1 H_2 H_3 I_2 I_3 J_1 J_2 J_3
0 1 2 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
1 3 2 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0
2 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1
3 1 3 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0
4 1 2 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0