Adding a column with one single categorical value to a pandas dataframe

Question:

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.

df["col"] = "hello"
df["col"] = df["col"].astype("category")
  1. Do I really need to write df["col"] three times in order to achieve this?
  2. After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)

Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?

An alternative solution is

df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))

but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.

Asked By: DustByte

||

Answers:

This solution surely solves the first point, not sure about the second:

df['col'] = pd.Categorical(('hello' for i in len(df)))

Essentially

  • we first create a generator of ‘hello’ with length equal to the number of records in df
  • then we pass it to pd.Categorical to make it a categorical column.
Answered By: Matteo Felici

We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:

df['col'] = pd.Series('hello', index=df.index, dtype='category')

Sample Program:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]})

df['col'] = pd.Series('hello', index=df.index, dtype='category')

print(df)
print(df.dtypes)
print(df['col'].cat.categories)
   a    col
0  1  hello
1  2  hello
2  3  hello

a         int64
col    category
dtype: object

Index(['hello'], dtype='object')
Answered By: Henry Ecker

A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.

df = df.assign(col="hello").astype({'col':'category'})

df.dtypes
A         int64
col    category
dtype: object

That way you don’t have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.


This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.

df = pd.DataFrame({'A':[1,2,3,4]})

df = (df.assign(col1 = 'hello',                    #Define column based on series or broadcasting
                col2 = lambda x:x['A']**2,         #Define column based on existing columns
                col3 = lambda x:x['col2']/x['A'])  #Define column based on previously defined columns
        .astype({'col1':'category',
                 'col2':'float'}))

print(df)
print(df.dtypes)
   A   col1  col2  col3
0  1  hello   1.0   1.0
1  2  hello   4.0   2.0
2  3  hello   9.0   3.0
3  4  hello  16.0   4.0


A          int64
col1    category  #<-changed dtype
col2     float64  #<-changed dtype
col3     float64
dtype: object
Answered By: Akshay Sehgal