How can I automatically detect if a colum is categorical?

Question:

I want to find a category of a pandas column. I can get the type but I’m struggling to figure out categories.

titanic_df = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')

#ID datatype

def idDataTypes(inputDataFrame):
    columnTypesDict = {} 
    import numpy as np
    import numbers
    import pandas as pd
    from pandas.api.types import is_string_dtype
    from pandas.api.types import is_numeric_dtype

    for columns in inputDataFrame.columns.values:
        #print(columns)
        #try to convert to number. If it doesn't work it will convert to another type
        try:
            inputDataFrame[columns] = pd.to_numeric(inputDataFrame[columns], errors='ignore').apply(lambda x: x + 1 if isinstance(x, numbers.Number) else x) 
        except:
            print(columns, " cannot convert.")
        #print(inputDataFrame[columns].dtype)

        #create dictionary with the label
        if is_numeric_dtype(inputDataFrame[columns]): #products[columns].dtype == np.float64:
            columnTypesDict[columns] = "numeric"
        elif is_string_dtype(inputDataFrame[columns]): # products[columns].dtype == np.object:
            columnTypesDict[columns] = "string"
            #print(is_string_dtype(products[columns]))
        else:
            print("something else", prinputDataFrameoducts[columns].dtype)

    #category 
    cols = inputDataFrame.columns
    num_cols = inputDataFrame._get_numeric_data().columns
    #num_cols
    proposedCategory = list(set(cols) - set(num_cols))
    for value in proposedCategory:
        columnTypesDict[value] = "category"

    return(columnTypesDict)

idDataTypes(titanic_df)

The results I’m getting are not what I expect:

{'pclass': 'numeric',
 'survived': 'numeric',
 'name': 'category',
 'sex': 'category',
 'age': 'numeric',
 'sibsp': 'numeric',
 'parch': 'numeric',
 'ticket': 'category',
 'fare': 'numeric',
 'cabin': 'category',
 'embarked': 'category',
 'boat': 'category',
 'body': 'numeric',
 'home.dest': 'category'}

pclass should be a category and name should not be.

I’m not sure how to assess if something is a category or not. Any ideas?

Asked By: Lostsoul

||

Answers:

Here’s the bug in your code:

proposedCategory = list(set(cols) - set(num_cols))

Everything other than the numeric columns are to become categories.


There is no right way to do this either, since whether a column is categorical is best decided manually with knowledge of the data the column contains. You are trying to do it automatically. One way to do it is to count the number of unique values in the column. It there are relatively few unique values, the column is likely categorical.

#category 
for name, column in inputDataFrame.iteritems():
    unique_count = column.unique().shape[0]
    total_count = column.shape[0]
    if unique_count / total_count < 0.05:
        columnTypesDict[name] = 'category'

The 5% threshold is random. No column will be identified as categorical if there are fewer than 20 rows in your dataframe. For best result, you will have to adjust that ratio of small and big dataframes.

Answered By: Code Different

One quick (and lazy) workaround I’ve found out is using the Pandas .corr() method to automatically slash out numerical columns for you. As per my observation, .corr() automatically selects numerical columns when it returns the pairwise correlations for the entire dataframe. (Provided you have applied it on the entire dataset). Hence you can always linear search for the categorical columns in your original dataframe, if its not in the dataframe returned by .corr(). This might not be 100% effective but it does the job most of the time.

corr_df = df.corr() #returns a dataframe
num_cols = corr_df.columns
cat_cols = [cols for cols in df.columns if not cols in num_cols]

PS : Might be a bit time/memory intensive if dataset contains a lot of columns.

Answered By: Ayan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.