Python Pandas – Changing some column types to categories
Question:
I have fed the following CSV file into iPython Notebook:
public = pd.read_csv("categories.csv")
public
I’ve also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary – there are about 100 columns)
In [36]: public.dtypes
Out[37]: parks object
playgrounds object
sports object
roading object
resident int64
children int64
I want to change ‘parks’, ‘playgrounds’, ‘sports’ and ‘roading’ to categories (they have likert scale responses in them – each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.
I was able to create a separate dataframe – public1 – and change one of the columns to a category type using the following code:
public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')
However, when I tried to change a number at once using this code, I was unsuccessful:
public1 = {'parks': public.parks,
'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')
Notwithstanding this, I don’t want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.
I tried numerous ways to achieve this, then tried the code here: Change column type in pandas.
public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')
and got the following error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Is there a way to change ‘parks’, ‘playgrounds’, ‘sports’, ‘roading’ to categories (so the likert scale responses can then be analysed), leaving ‘resident’ and ‘children’ (and the 94 other columns that are string, int + floats) untouched?
I am using Python 2.7.
Answers:
Sometimes, you just have to use a for-loop:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
As of pandas 0.19.0, What’s New describes that read_csv
supports parsing Categorical
columns directly.
This answer applies only if you’re starting from read_csv
otherwise, I think unutbu’s answer is still best.
Example on 10,000 records:
import pandas as pd
import numpy as np
# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)
<0.19.0 (or >=19.0 without specifying dtype)
pd.read_csv('categories_large.csv').dtypes # inspect default dtypes
children int64
parks object
playgrounds object
resident int64
roading object
sports object
dtype: object
>=0.19.0
For mixed dtypes
parsing as Categorical
can be implemented by passing a dictionary dtype={'colname' : 'category', ...}
in read_csv
.
pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children int64
parks category
playgrounds category
resident int64
roading category
sports category
dtype: object
Performance
A slight speed-up (local jupyter notebook), as mentioned in the release notes.
# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop
# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop
You can use the pandas.DataFrame.apply
method along with a lambda
expression to solve this. In your example you could use
df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))
I don’t know of a way to execute this inplace, so typically I’ll end up with something like this:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
Obviously you can replace .select_dtypes
with explicit column names if you don’t want to select all of a certain datatype (although in your example it seems like you wanted all object
types).
I found that using a for loop works well.
for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
dataframe_name[col] = dataframe_name[col].astype(float)
Jupyter Notebook
In my case, I had big Dataframe with many objects that I would like to convert it to category.
Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in
# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()
I hope this might be a helpful resource for later reference
No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.
cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})
>> a b
>> 0 a c
>> 1 b d
>> 2 c e
df.dtypes
>> a object
>> b object
>> dtype: object
df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a category
>> b category
>> dtype: object
To make things easier. No apply. No map. No loop.
cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')
Using list comprehension(avoiding loop), this would convert all colums with dtypes=object to dtypes=category. I’ve put ‘df’ as the dataframe to be more generic.
df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)
In case you’d like to avoid "copy=False" argument for some reason (as python documentation tells us to be careful while using that), you may use the following line.
df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')
This is my first answer on stack, so please be kind.
I have fed the following CSV file into iPython Notebook:
public = pd.read_csv("categories.csv")
public
I’ve also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary – there are about 100 columns)
In [36]: public.dtypes
Out[37]: parks object
playgrounds object
sports object
roading object
resident int64
children int64
I want to change ‘parks’, ‘playgrounds’, ‘sports’ and ‘roading’ to categories (they have likert scale responses in them – each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.
I was able to create a separate dataframe – public1 – and change one of the columns to a category type using the following code:
public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')
However, when I tried to change a number at once using this code, I was unsuccessful:
public1 = {'parks': public.parks,
'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')
Notwithstanding this, I don’t want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.
I tried numerous ways to achieve this, then tried the code here: Change column type in pandas.
public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')
and got the following error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Is there a way to change ‘parks’, ‘playgrounds’, ‘sports’, ‘roading’ to categories (so the likert scale responses can then be analysed), leaving ‘resident’ and ‘children’ (and the 94 other columns that are string, int + floats) untouched?
I am using Python 2.7.
Sometimes, you just have to use a for-loop:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
As of pandas 0.19.0, What’s New describes that read_csv
supports parsing Categorical
columns directly.
This answer applies only if you’re starting from read_csv
otherwise, I think unutbu’s answer is still best.
Example on 10,000 records:
import pandas as pd
import numpy as np
# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)
<0.19.0 (or >=19.0 without specifying dtype)
pd.read_csv('categories_large.csv').dtypes # inspect default dtypes
children int64
parks object
playgrounds object
resident int64
roading object
sports object
dtype: object
>=0.19.0
For mixed dtypes
parsing as Categorical
can be implemented by passing a dictionary dtype={'colname' : 'category', ...}
in read_csv
.
pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children int64
parks category
playgrounds category
resident int64
roading category
sports category
dtype: object
Performance
A slight speed-up (local jupyter notebook), as mentioned in the release notes.
# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop
# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop
You can use the pandas.DataFrame.apply
method along with a lambda
expression to solve this. In your example you could use
df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))
I don’t know of a way to execute this inplace, so typically I’ll end up with something like this:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
Obviously you can replace .select_dtypes
with explicit column names if you don’t want to select all of a certain datatype (although in your example it seems like you wanted all object
types).
I found that using a for loop works well.
for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
dataframe_name[col] = dataframe_name[col].astype(float)
Jupyter Notebook
In my case, I had big Dataframe with many objects that I would like to convert it to category.
Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in
# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()
I hope this might be a helpful resource for later reference
No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.
cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})
>> a b
>> 0 a c
>> 1 b d
>> 2 c e
df.dtypes
>> a object
>> b object
>> dtype: object
df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a category
>> b category
>> dtype: object
To make things easier. No apply. No map. No loop.
cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')
Using list comprehension(avoiding loop), this would convert all colums with dtypes=object to dtypes=category. I’ve put ‘df’ as the dataframe to be more generic.
df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)
In case you’d like to avoid "copy=False" argument for some reason (as python documentation tells us to be careful while using that), you may use the following line.
df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')
This is my first answer on stack, so please be kind.