Creating dummy variables in pandas for python

Question

I’m trying to create a series of dummy variables from a categorical variable using pandas in python. I’ve come across the get_dummies function, but whenever I try to call it I receive an error that the name is not defined.

Any thoughts or other ways to create the dummy variables would be appreciated.

EDIT: Since others seem to be coming across this, the get_dummies function in pandas now works perfectly fine. This means the following should work:

import pandas as pd

dummies = pd.get_dummies(df['Category'])

See http://blog.yhathq.com/posts/logistic-regression-and-python.html for further information.

Asked By: user1074057

||

Source

Answer 1

It’s hard to infer what you’re looking for from the question, but my best guess is as follows.

If we assume you have a DataFrame where some column is ‘Category’ and contains integers (or otherwise unique identifiers) for categories, then we can do the following.

Call the DataFrame dfrm, and assume that for each row, dfrm['Category'] is some value in the set of integers from 1 to N. Then,

for elem in dfrm['Category'].unique():
    dfrm[str(elem)] = dfrm['Category'] == elem

Now there will be a new indicator column for each category that is True/False depending on whether the data in that row are in that category.

If you want to control the category names, you could make a dictionary, such as

cat_names = {1:'Some_Treatment', 2:'Full_Treatment', 3:'Control'}
for elem in dfrm['Category'].unique():
    dfrm[cat_names[elem]] = dfrm['Category'] == elem

to result in having columns with specified names, rather than just string conversion of the category values. In fact, for some types, str() may not produce anything useful for you.

Answered By: ely

Answer 2

So I was actually needing an answer to this question today (7/25/2013), so I wrote this earlier. I’ve tested it with some toy examples, hopefully you get some mileage out of it

def categorize_dict(x, y=0):
    # x Requires string or numerical input
    # y is a boolean that specifices whether to return category names along with the dict.
    # default is no
    cats = list(set(x))
    n = len(cats)
    m = len(x)
    outs = {}
    for i in cats:
        outs[i] = [0]*m
    for i in range(len(x)):
        outs[x[i]][i] = 1
    if y:
        return outs,cats
    return outs

Answered By: ThomasRoderick

Answer 3

When I think of dummy variables I think of using them in the context of OLS regression, and I would do something like this:

import numpy as np
import pandas as pd
import statsmodels.api as sm

my_data = np.array([[5, 'a', 1],
                    [3, 'b', 3],
                    [1, 'b', 2],
                    [3, 'a', 1],
                    [4, 'b', 2],
                    [7, 'c', 1],
                    [7, 'c', 1]])                


df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
just_dummies = pd.get_dummies(df['dummy'])

step_1 = pd.concat([df, just_dummies], axis=1)      
step_1.drop(['dummy', 'c'], inplace=True, axis=1)
# to run the regression we want to get rid of the strings 'a', 'b', 'c' (obviously)
# and we want to get rid of one dummy variable to avoid the dummy variable trap
# arbitrarily chose "c", coefficients on "a" an "b" would show effect of "a" and "b"
# relative to "c"
step_1 = step_1.applymap(np.int) 

result = sm.OLS(step_1['y'], sm.add_constant(step_1[['x', 'a', 'b']])).fit()
print result.summary()

Answered By: Akavall

Answer 4

I created a dummy variable for every state using this code.

def create_dummy_column(series, f):
    return series.apply(f)

for el in df.area_title.unique():
    col_name = el.split()[0] + "_dummy"
    f = lambda x: int(x==el)
    df[col_name] = create_dummy_column(df.area_title, f)
df.head()

More generally, I would just use .apply and pass it an anonymous function with the inequality that defines your category.

(Thank you to @prpl.mnky.dshwshr for the .unique() insight)

Answered By: userFog

Answer 5

Based on the official documentation:

dummies = pd.get_dummies(df['Category']).rename(columns=lambda x: 'Category_' + str(x))
df = pd.concat([df, dummies], axis=1)
df = df.drop(['Category'], inplace=True, axis=1)

There is also a nice post in the FastML blog.

Answered By: beyondfloatingpoint

Answer 6

For my case, dmatrices in patsy solved my problem. Actually, this function is designed for the generation of dependent and independent variables from a given DataFrame with an R-style formula string. But it can be used for the generation of dummy features from the categorical features. All you need to do would be drop the column ‘Intercept’ that is generated by dmatrices automatically regardless of your original DataFrame.

import pandas as pd
from patsy import dmatrices

df_original = pd.DataFrame({
   'A': ['red', 'green', 'red', 'green'],
   'B': ['car', 'car', 'truck', 'truck'],
   'C': [10,11,12,13],
   'D': ['alice', 'bob', 'charlie', 'alice']},
   index=[0, 1, 2, 3])

_, df_dummyfied = dmatrices('A ~ A + B + C + D', data=df_original, return_type='dataframe')
df_dummyfied = df_dummyfied.drop('Intercept', axis=1)

df_dummyfied.columns    
Index([u'A[T.red]', u'B[T.truck]', u'D[T.bob]', u'D[T.charlie]', u'C'], dtype='object')

df_dummyfied
   A[T.red]  B[T.truck]  D[T.bob]  D[T.charlie]     C
0       1.0         0.0       0.0           0.0  10.0
1       0.0         0.0       1.0           0.0  11.0
2       1.0         1.0       0.0           1.0  12.0
3       0.0         1.0       0.0           0.0  13.0

Answered By: Erdem KAYA

Answer 7

The following code returns dataframe with the ‘Category’ column replaced by categorical columns:

df_with_dummies = pd.get_dummies(df, prefix='Category_', columns=['Category'])

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

Answered By: Spas

Answer 8

You can create dummy variables to handle the categorical data

# Creating dummy variables for categorical datatypes
trainDfDummies = pd.get_dummies(trainDf, columns=['Col1', 'Col2', 'Col3', 'Col4'])

This will drop the original columns in trainDf and append the column with dummy variables at the end of the trainDfDummies dataframe.

It automatically creates the column names by appending the values at the end of the original column name.

Answered By: rzskhr

Answer 9

Handling categorical features
scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)
Unordered categories: use dummy encoding (0/1)
What are the categorical features in our dataset?

Ordered categories: weather (already encoded with sensible numeric values)
Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)
For season, we can’t simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:

# An utility function to create dummy variable
`def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df`

Answered By: subodh agrawal

Answer 10

A very simple approach without using get_dummies if you have very less categorical variable using NumPy and Pandas.

let, i have a column named <"State"> and it have 3 categorical variable <‘New York’>, <‘California’> and <‘Florida’> and we want to assign 0 and 1 for respectively.

we can do it with following simple code.

import numpy as np
import pandas as pd

dataset['NewYork_State'] = np.where(dataset['State']=='New York', 1, 0)
dataset['California_State'] = np.where(dataset['State']=='California', 1, 0)
dataset['Florida_State'] = np.where(dataset['State']=='Florida', 1, 0)

Above we create Three New Columns for storing values "NewYork_State", "California_State", "Florida_State".

Drop the original column

dataset.drop(columns=['State'],axis=1,inplace=True)

Answered By: Raushan kumar

Answer 11

A simple and robust way to create dummies based on a column with your category values:

for category in list(df['category_column'].unique()):
    df[category] = lis(map(lambda x: 1 if x==category else 0, df['category_column']))

But watch out when doing some OLS regression because you will need to exclude one of the categorys so you dont fall on dummie trap variable

Answered By: Ramon

Answer 12

If you want to replace a list of variables with dummy features:

# create an empty list to store the dataframes
   dataframes = []

# iterate over the list of categorical features
 for feature in categoricalFeatures:

   # create a dataframe with dummy variables for the current feature
      df_feature = pd.get_dummies(df_raw[feature])

   # add the dataframe to the list
      dataframes.append(df_feature)`

# concatenate the dataframes to create a single dataframe
  df_dummies = pd.concat(dataframes, axis=1)
  df_final = pd.concat([df_raw, df_dummies], axis=1).drop(columns = 
                                                      categoricalFeatures, axis = 1)

Answered By: Samira Eshghi

Creating dummy variables in pandas for python

Question:

Answers: