Ignoring an invalid filter among multiple filters on a DataFrame

Question:

Problem Statement:

I have a DataFrame that has to be filtered with multiple conditions.

Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.

While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality

Reproducible Example

Dummy dataframe –

df = pd.DataFrame({'One':['a','a','a','b'], 
                   'Two':['x','y','y','y'], 
                   'Three':['l','m','m','l']})

print(df)
  One Two Three
0   a   x     l
1   a   y     m
2   a   y     m
3   b   y     l

Let’s say that invalid values are the values that don’t belong to the respective column. So, for column ‘One’ all other values are invalid except ‘a’ and ‘b’. If the user input’s ‘a’ then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.

My attempt (with multiple parameters):

def valid_filtering(df, inp):
    if inp[0] in df['One'].values:
        df = df[df['One']==inp[0]]

    if inp[1] in df['Two'].values:
        df = df[df['Two']==inp[1]]

    if inp[2] in df['Three'].values:
        df = df[df['Three']==inp[2]]
        
    return df

With all valid inputs –

inp = ['a','y','m']             #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
  One Two Three
1   a   y     m
2   a   y     m

With few invalid inputs –

inp = ['a','NA','NA']           #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
  One Two Three
0   a   x     l
1   a   y     m
2   a   y     m

P.S. Additional question – is there a way to get DataFrame indexing to behave as –

df[df['One']=='valid'] -> returns filtered df

df[df['One']=='invalid'] -> returns original df

Because this would help me rewrite my filtering –

df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
Asked By: Akshay Sehgal

||

Answers:

Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.

def valid_filtering(df, inp):
    # check where inp values are same than in df
    m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
    # select the columns with at least one True
    cols = m.columns[m.any()]
    # select the rows that all True amongst wanted columns
    rows = m[cols].all(axis=1)
    # return df with selected rows
    return df.loc[rows]

Note that if you don’t have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.

d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
Answered By: Ben.T

The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:

mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]

Case 1: inp = ['a','y','m'] (with all valid inputs)

>>> out
  One Two Three
1   a   y     m
2   a   y     m

Case 2: inp = ['a','NA','NA'] (with few invalid inputs)

>>> out
  One Two Three
0   a   x     l
1   a   y     m
2   a   y     m

Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)

>>> out
  One Two Three
0   a   x     l
1   a   y     m
2   a   y     m
3   b   y     l

Case 4: inp = ['b','x','m'] (with all valid inputs but not results)

>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []

Of course, you can increase input parameters:

df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
  One Two Three Four
2   a   y     m    k
Answered By: Corralien

Another way with list comprehension:

def valid_filtering(df, inp):
    series = [df[column] == inp[i]
        for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
    for s in series: df = df[s]
    return df

Output of print(valid_filtering(df, ['a','NA','NA'])):

  One Two Three
0   a   x     l
1   a   y     m
2   a   y     m

Related: applying lambda row on multiple columns pandas

Answered By: Confused Learner

An updated solution inspired by the code and logic provided by Corralien and Ben.T:

df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]

This answer was posted as an edit to the question Ignoring an invalid filter among multiple filters on a DataFrame by the OP Akshay Sehgal under CC BY-SA 4.0.

Answered By: vvvvv
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.