Retaining categorical dtype upon dataframe concatenation

Question:

I have two dataframes with identical column names and dtypes, similar to the following:

A             object
B             category
C             category

The categories are not identical in each of the dataframes.

When normally concatinating, pandas outputs:

A             object
B             object
C             object

Which is the expected behaviour as per the documentation.

However, I wish to keep the categorisation and wish to union the categories, so I have tried the union_categoricals across the columns in the dataframe which are both categorical. cdf and df are my two dataframes.

for column in df:
    if df[column].dtype.name == "category" and cdf[column].dtype.name == "category":
        print (column)
        union_categoricals([cdf[column], df[column]], ignore_order=True)

cdf = pd.concat([cdf,df])

This is still not providing me with a categorical output.

Asked By: tom

||

Answers:

I don’t think this is completely obvious from the documentation, but you could do something like the following. Here’s some sample data:

df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})

Use union_categoricals to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

from pandas.api.types import union_categoricals

uc = union_categoricals([df1.x,df2.x])
df1.x = pd.Categorical( df1.x, categories=uc.categories )
df2.x = pd.Categorical( df2.x, categories=uc.categories )

Concatenate and verify the dtype is categorical.

df3 = pd.concat([df1,df2])

df3.x.dtypes
category

As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that’s the best way to do it just because it’s simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

Answered By: JohnE

JohnE’s answer is helpful, but in pandas 0.19.2, union_categoricals can only be imported as follow:

from pandas.types.concat import union_categoricals

Answered By: kai

To complement JohnE’s answer, here’s a function that does the job by converting to union_categoricals all the category columns present on all input dataframes:

def concatenate(dfs):
    """Concatenate while preserving categorical columns.

    NB: We change the categories in-place for the input dataframes"""
    from pandas.api.types import union_categoricals
    import pandas as pd
    # Iterate on categorical columns common to all dfs
    for col in set.intersection(
        *[
            set(df.select_dtypes(include='category').columns)
            for df in dfs
        ]
    ):
        # Generate the union category across dfs for this column
        uc = union_categoricals([df[col] for df in dfs])
        # Change to union category for all dataframes
        for df in dfs:
            df[col] = pd.Categorical(df[col].values, categories=uc.categories)
    return pd.concat(dfs)

Note the categories are changed in place in the input list:

df1=pd.DataFrame({'a': [1, 2],
                  'x':pd.Categorical(['dog','cat']),
                  'y': pd.Categorical(['banana', 'bread'])})
df2=pd.DataFrame({'x':pd.Categorical(['rat']),
                  'y': pd.Categorical(['apple'])})

concatenate([df1, df2]).dtypes
Answered By: Tom Bug

All other answers use union_categoricals to get a combined list of both dataframes’ categories. Since this already combines the series of both dataframes, which is then discarded in favor of the following pd.concat, these answers add significant overhead.

It’s also possible to just create a union over the categories:

for col in (
    # intersection of columns that are categorical in both dataframes
    df1.select_dtypes(include="category").columns
    & df2.select_dtypes(include="category").columns
):
    # union of the categories in both dataframes' columns
    all_cats = df1[col].cat.categories | df2[col].cat.categories
    df1[col] = df1[col].cat.set_categories(all_cats)
    df2[col] = df2[col].cat.set_categories(all_cats)

I’ve tested this with unordered categories only. union_categoricals also covers ordering, for which it might be better suited.

Using the example from John’s answer:

>>> df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
... df1.x.dtype
CategoricalDtype(categories=['cat', 'dog'], ordered=False)

>>> df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})
... df2.x.dtype
CategoricalDtype(categories=['cat', 'rat'], ordered=False)

>>> for col in (
...     # intersection of columns that are categorical in both dataframes
...     df1.select_dtypes(include="category").columns
...     & df2.select_dtypes(include="category").columns
... ):
...     # union of the categories in both dataframes' columns
...     all_cats = df1[col].cat.categories | df2[col].cat.categories
...     df1[col] = df1[col].cat.set_categories(all_cats)
...     df2[col] = df2[col].cat.set_categories(all_cats)

>>> df1.x.dtype
CategoricalDtype(categories=['cat', 'dog', 'rat'], ordered=False)

>>> df2.x.dtype
CategoricalDtype(categories=['cat', 'dog', 'rat'], ordered=False)

>>> df3 = pd.concat([df1, df2])
... df3.x.dtype
CategoricalDtype(categories=['cat', 'dog', 'rat'], ordered=False)
Answered By: Valentin Kuhn
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.