feature crossing in pandas

Question:

I have 2 columns in pandas DF:

col_A     col_B
 0         1
 0         0
 0         1
 0         1
 1         0
 1         0
 1         1

I want to create a new columns for each value of the combination of col_A and col_B similar to get_dummies(), but the only change is here I am trying to use a combination of columns

Example OP – In this column the value of Col_A is 0 and col_B is 1:

col_A_0_col_B_1

   1
   0
   1
   1
   0
   0
   0

I am currently using the iterrows() to iterate through every row to check the value and then change

Is there a usual pandas shorter approach to achieve this.

Asked By: data_person

||

Answers:

First create your column and assign is e.g. 0 for False

df['col_A_0_col_B_1'] = 0

Then using loc you can filter by where col_A == 0 and col_B ==1 and then assign 1 to the new column
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1

Answered By: Pureluck

If I understood correctly, you could do something like this:

import pandas as pd
data = [[0, 1],
        [0, 0],
        [0, 1],
        [0, 1],
        [1, 0],
        [1, 0],
        [1, 1]]

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)

Output

   col_A  col_B  col_A_0_col_B_1
0      0      1                1
1      0      0                0
2      0      1                1
3      0      1                1
4      1      0                0
5      1      0                0
6      1      1                0

Or as alternative:

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)
Answered By: Dani Mesejo

You can use np.where

df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
Answered By: Sociopath

Convert chained boolean masks to integers:

df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)

For better performance:

df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)

Performance: Depends of number of rows and 0, 1 values:

np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)

In [92]: %%timeit
    ...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
    ...: 
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [93]: %%timeit
    ...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
    ...: 
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [94]: %%timeit
    ...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
    ...: 
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [95]: %%timeit
    ...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
    ...: 
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [96]: %%timeit
    ...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
    ...: 
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [97]: %%timeit
    ...: df['col_A_0_col_B_1'] = 0
    ...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
    ...: 
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Answered By: jezrael

You can use pandas ~ for boolean not, coupled with 1 and 0 being true and false.

df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']
Answered By: T Burgis

I was look for something in pandas that was similar to the tensorflow "crossed_column" that was used in the Google introduction to ML course and couldn’t find one. This will work to add one-hot encoded feature crosses to a dataframe. The selected columns must already be ordinal encoded / factorized.

def cross_category_features(
    df: pd.DataFrame,
    cross: list[str],
    remove_originals: bool = True
) -> pd.DataFrame:
    """
    Add feature crosses to the  based on the columns in cross_cols.  The columns must have already been factorized / ordinal encoded.

    :param data: The data to add feature crosses to
    :param cross_cols: The columns to cross. Columns must be int categorical 0 to n-1
    :param remove_originals: If True, remove the original columns from the data

    :return: The data with the feature crosses added
    """
    def set_hot_index(row):
        hot_index = (row[cross] * offsets).sum()
        row[hot_index + org_col_len] = 1
        return row

    org_col_len = df.shape[1]
    str_values = [[col + str(val) for val in sorted(df[col].unique())]
                  for col in cross]
    cross_names = ["_".join(x) for x in product(*str_values)]

    cross_features = pd.DataFrame(
        data=np.zeros((df.shape[0], len(cross_names))),
        columns=cross_names,
        dtype="int64")
    df = pd.concat([df, cross_features], axis=1)
    
    max_vals = df[cross].max(axis=0) + 1
    offsets = [np.prod(max_vals[i+1:]) for i in range(len(max_vals))]
    df.apply(set_hot_index, axis=1)

    if remove_originals:
        df = df.drop(columns=cross)

    return df

Answered By: Tom Fuller
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.