Adding a column with one single categorical value to a pandas dataframe
Question:
I have a pandas.DataFrame
df
and would like to add a new column col
with one single value "hello"
. I would like this column to be of dtype category
with the single category "hello"
. I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("category")
- Do I really need to write
df["col"]
three times in order to achieve this?
- After the first line I am worried that the intermediate dataframe
df
might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello"
is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools
and the use of len(df)
, and I am not sure how memory usage is under the hood.
Answers:
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
- we first create a generator of ‘hello’ with length equal to the number of records in df
- then we pass it to
pd.Categorical
to make it a categorical column.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__
then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign
to create your new variable, then change dtype to category
using df.astype
along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don’t have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object
I have a pandas.DataFrame
df
and would like to add a new column col
with one single value "hello"
. I would like this column to be of dtype category
with the single category "hello"
. I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("category")
- Do I really need to write
df["col"]
three times in order to achieve this? - After the first line I am worried that the intermediate dataframe
df
might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value"hello"
is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools
and the use of len(df)
, and I am not sure how memory usage is under the hood.
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
- we first create a generator of ‘hello’ with length equal to the number of records in df
- then we pass it to
pd.Categorical
to make it a categorical column.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__
then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign
to create your new variable, then change dtype to category
using df.astype
along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don’t have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object