Python – Filtering conditions efficiently

Question:

I have this data:

df = pd.DataFrame({'date': {0: '2012-03-02',  1: '2012-04-01',  2: '2012-04-12',  3: '2012-04-14',  4: '2012-04-21',  5: '2012-05-12',  6: '2012-06-23',  7: '2012-06-25',  8: '2012-06-26',  9: '2012-06-27'},
                   'type': {0: 'Holiday',  1: 'Other',  2: 'Event',  3: 'Holiday',  4: 'Event',  5: 'Holiday',  6: 'Other',  7: 'Holiday',  8: 'Holiday',  9: 'Event'},
                   'locale': {0: 'Local',  1: 'Regional',  2: 'National',  3: 'Local',  4: 'National',  5: 'National',  6: 'National',  7: 'Regional',  8: 'Regional',  9: 'Local'}})

And this function:

def preprocess(data_input):
    data = data_input.copy()
    
    conditions = [
        (data['type'] ==  'Holiday') & (data.locale == 'National'),
        (data['type'] ==  'Holiday') & (data.locale == 'Regional'),
        (data['type'] ==  'Holiday') & (data.locale == 'Local')
        ]
    values = [3,2,1]

    data.loc[:,'Holiday'] = np.select(conditions, values, default=0)
    
    
    conditions = [
        (data['type'] ==  'Event') & (data.locale == 'National'),
        (data['type'] ==  'Event') & (data.locale == 'Regional'),
        (data['type'] ==  'Event') & (data.locale == 'Local')
        ]
    values = [3,2,1]

    data.loc[:,'Events'] = np.select(conditions, values, default=0)

    return(data)

When I run this function over df I will get:

preprocess(df)

enter image description here

Which is the expected result. However, I’m wondering about efficiency and best practices. I feel that both conditions use too many lines of code and that they could be written more elegant and efficiently, but I can’t figure out how. I tried with np.where() but I struggled when I found out I would have to filter the data frames in each argument.

Any suggestions?

Asked By: Chris

||

Answers:

Overall I think your code looks fine, it’s readable and easily understandable. In terms of efficiency, you’re using boolean masks for your conditions and not committing any sort of pandas performance anti-pattern that I can see.

One major-ish pandas issue is that you have a redundant call to copy(). This isn’t cheap, it results in Python creating a deep copy of your whole dataframe. It’s also not necessary, you can just return the modified df rather than making a new copy the whole time and returning it. This is pretty standard style in pandas, it’s why pandas operations usually look like df = df.do_a_thing(). The new value of df is returned via do_a_thing() rather than modified in place. (The in_place flag exists for a lot of methods, but it isn’t the default for a reason).

One nitpick I’d say is to be explicit when using categorical variables. Your values variable is a categorical variable that’s derived conditionally from the dataframe, but you are treating it as a series of integers.

In terms of general software engineering, there is one clear area of improvement that I can see. Your preprocess method is duplicating code for a singular purpose. I have refactored it such that the preprocess method takes the type_column (e.g. ‘Holiday’ and ‘Event’) as a parameter, and applies the transformation after that. Now if you have an issue with your preprocess logic, or want to make a change, you’ll only need to do it in one area rather than two.

df = pd.DataFrame({'date': {0: '2012-03-02',  1: '2012-04-01',  2: '2012-04-12',  3: '2012-04-14',  4: '2012-04-21',  5: '2012-05-12',  6: '2012-06-23',  7: '2012-06-25',  8: '2012-06-26',  9: '2012-06-27'},
                   'type': {0: 'Holiday',  1: 'Other',  2: 'Event',  3: 'Holiday',  4: 'Event',  5: 'Holiday',  6: 'Other',  7: 'Holiday',  8: 'Holiday',  9: 'Event'},
                   'locale': {0: 'Local',  1: 'Regional',  2: 'National',  3: 'Local',  4: 'National',  5: 'National',  6: 'National',  7: 'Regional',  8: 'Regional',  9: 'Local'}})


def preprocess(data, type_column):
    conditions = [
        (data['type'] == type_column) & (data.locale == 'National'),
        (data['type'] == type_column) & (data.locale == 'Regional'),
        (data['type'] == type_column) & (data.locale == 'Local')
    ]
    # These values are categorical variables, (i.e. it's dependent on the combination of two columns).
    # Better to be explicit about what type they are rather than leaving them as ints.
    values = pd.Categorical([3, 2, 1])

    data.loc[:, type_column] = np.select(conditions, values, default=0)

    return data


df = preprocess(df, 'Holiday')
df = preprocess(df, 'Event')
print(df)
Answered By: Cold Fish

You can try Series.map with Series.where

locale_map = {'National': '3',
              'Regional': '2',
              'Local': '1'}

df['Holiday'] = df['locale'].map(locale_map).where(df['type'].eq('Holiday'), 0)
df['Events'] = df['locale'].map(locale_map).where(df['type'].eq('Event'), 0)
print(df)

         date     type    locale Holiday Events
0  2012-03-02  Holiday     Local       1      0
1  2012-04-01    Other  Regional       0      0
2  2012-04-12    Event  National       0      3
3  2012-04-14  Holiday     Local       1      0
4  2012-04-21    Event  National       0      3
5  2012-05-12  Holiday  National       3      0
6  2012-06-23    Other  National       0      0
7  2012-06-25  Holiday  Regional       2      0
8  2012-06-26  Holiday  Regional       2      0
9  2012-06-27    Event     Local       0      1
Answered By: Ynjxsjmh

Two part answer, how to apply the function and how to improve memory use with right pandas dtype.

Data transform:

One-liner using assign and a dict for mapping the values. This method avoids SettingWithCopyWarning.

r_map = {
        "National": "3",
        "Regional": "2",
        "Local": "1",
    }


df = df.assign(
    Holiday=df.locale.where(df.type == "Holiday", 0).replace(r_map),
    Events=df.locale.where(df.type == "Event", 0).replace(r_map),
)
date type locale Holiday Events
0 2012-03-02 Holiday Local 1 0
1 2012-04-01 Other Regional 0 0
2 2012-04-12 Event National 0 3
3 2012-04-14 Holiday Local 1 0
4 2012-04-21 Event National 0 3
5 2012-05-12 Holiday National 3 0
6 2012-06-23 Other National 0 0
7 2012-06-25 Holiday Regional 2 0
8 2012-06-26 Holiday Regional 2 0
9 2012-06-27 Event Local 0 1

Memory Efficiency:

Using the appropriate pandas dtype for columns with strings that are repeated can save a large amount of memory. Pandas defaults to rather high precision and generality in the dtypes for a newly created dataframe:

s_float = pd.Series(0.001).dtype
> dtype('float64')

Same with textual data in columns:

s_text = pd.Series("Thy").dtype
> dtype: object

The object dtype is a very general container for pretty much anything we like and isn’t very optimized. In particular for the example in op’s post where we have multiple columns where the values are mostly similar, we can gain a lot by using pandas category datatype. In this example below, based on data from OP’s post we’re using 7x memory footprint for date, type, and locale columns by using object instead of category dtype.

df_large = pd.DataFrame(df.iloc[[0]]).reindex(range(99999), method="ffill")

for col in df_large.columns:
    df_large[col] = df_large[col].transform(lambda x: df[col].sample(1).values[0])
print("Without category: ")
print(df_large.memory_usage().iloc[1:])
print(df_large.dtypes)

for col in df_large.columns:
    if type(df[col].iloc[0]) == str:
        df_large[col] = df_large[col].astype("category")

print("       ")
print("With category: ")
print(df_large.memory_usage().iloc[1:])
print(df_large.dtypes)

Without category: 
date      799992
type      799992
locale    799992
dtype: int64
date      object
type      object
locale    object
dtype: object
       
With category: 
date      100379
type      100131
locale    100131
dtype: int64
date      category
type      category
locale    category
dtype: object
Answered By: ivanp
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.