pandas: Convert string column to ordered Category?
Question:
I’m working with pandas for the first time. I have a column with survey responses in, which can take ‘strongly agree’, ‘agree’, ‘disagree’, ‘strongly disagree’, and ‘neither’ values.
This is the output of describe()
and value_counts()
for the column:
count 4996
unique 5
top Agree
freq 1745
dtype: object
Agree 1745
Strongly agree 926
Strongly disagree 918
Disagree 793
Neither 614
dtype: int64
I want to do a linear regression on this question versus overall score. However, I have a feeling that I should convert the column into a Category variable first, given that it’s inherently ordered. Is this correct? If so, how should I do this?
I’ve tried this:
df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor
This produces output that looks vaguely right, but it seems that the categories are in the wrong order. Is there a way that I can specify ordering? Do I even need to specify ordering?
This is the rest of my code right now:
df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared
Answers:
Yes you should convert it to categorical data and this should do the trick
likert_scale = {'strongly agree':2, 'agree':1, 'neither':0, 'disagree':-1, 'strongly disagree':-2}
df['categorical_data'] = df.EasyToUseQuestion.apply(lambda x: likert_scale[x])
Two ways to do it nowadays. Your column would be more readable and use less memory. Since it will be a Categorical Type you still will be able to order the values.
First my preferred one:
df['grades'].astype('category')
astype
used to accept a categories
argument, but it isn’t present anymore. So if you want to order your categories in a not lexicographical order, or to have extra categories that aren’t present in your data, you must use the solution below.
This recommendation is from the docs
In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
....: ordered=True)
In [29]: s_cat = s.astype(cat_type)
Extra tip: get all existing values from a column with df.colname.unique()
.
pandas.factorize()
can obtain a numeric representation of an array.
factorize is available as both a top-level function pandas.factorize()
, and as a method Series.factorize()
and Index.factorize()
import pandas as pd
df = pd.DataFrame({'answer' : ['strongly agree', 'strongly agree', 'agree', 'neither', 'disagree', 'strongly disagree']})
# df['category'] = pd.factorize(df['answer'])[0]
df['category'] = df['answer'].factorize()[0]
# print(df)
answer category
0 strongly agree 0
1 strongly agree 0
2 agree 1
3 neither 2
4 disagree 3
5 strongly disagree 4
I’m working with pandas for the first time. I have a column with survey responses in, which can take ‘strongly agree’, ‘agree’, ‘disagree’, ‘strongly disagree’, and ‘neither’ values.
This is the output of describe()
and value_counts()
for the column:
count 4996
unique 5
top Agree
freq 1745
dtype: object
Agree 1745
Strongly agree 926
Strongly disagree 918
Disagree 793
Neither 614
dtype: int64
I want to do a linear regression on this question versus overall score. However, I have a feeling that I should convert the column into a Category variable first, given that it’s inherently ordered. Is this correct? If so, how should I do this?
I’ve tried this:
df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor
This produces output that looks vaguely right, but it seems that the categories are in the wrong order. Is there a way that I can specify ordering? Do I even need to specify ordering?
This is the rest of my code right now:
df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared
Yes you should convert it to categorical data and this should do the trick
likert_scale = {'strongly agree':2, 'agree':1, 'neither':0, 'disagree':-1, 'strongly disagree':-2}
df['categorical_data'] = df.EasyToUseQuestion.apply(lambda x: likert_scale[x])
Two ways to do it nowadays. Your column would be more readable and use less memory. Since it will be a Categorical Type you still will be able to order the values.
First my preferred one:
df['grades'].astype('category')
astype
used to accept a categories
argument, but it isn’t present anymore. So if you want to order your categories in a not lexicographical order, or to have extra categories that aren’t present in your data, you must use the solution below.
This recommendation is from the docs
In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
....: ordered=True)
In [29]: s_cat = s.astype(cat_type)
Extra tip: get all existing values from a column with df.colname.unique()
.
pandas.factorize()
can obtain a numeric representation of an array.
factorize is available as both a top-level function pandas.factorize()
, and as a method Series.factorize()
and Index.factorize()
import pandas as pd
df = pd.DataFrame({'answer' : ['strongly agree', 'strongly agree', 'agree', 'neither', 'disagree', 'strongly disagree']})
# df['category'] = pd.factorize(df['answer'])[0]
df['category'] = df['answer'].factorize()[0]
# print(df)
answer category
0 strongly agree 0
1 strongly agree 0
2 agree 1
3 neither 2
4 disagree 3
5 strongly disagree 4