pandas select rows based on values in dictionary

Question:

Im am trying to select rows from a pandas dataframe, based on a variable amount of columns and values. For a fixed column and value, one can do this:

df = pd.DataFrame([{'name' : 'ruben','age' : 25},{'name' : 'henk', 'age' : 26},{'name' : 'gijs', 'age' : 20}])

column_name = 'name'
column_value = 'ruben'

rows = df[df[column_name] == column_value]

However, I want to do this for a variable amount of column-value pairs, for example from a list

df = pd.DataFrame([{'name' : 'ruben','age' : 25},{'name' : 'henk', 'age' : 26},{'name' : 'gijs', 'age' : 20}])

column_value_pairs = {'name' : 'ruben','age' : '25'}
rows = df[???]

Which then should return all rows where the name is ruben and the age is 25. So basically this:

rows = df[(df['name'] == 'ruben') & (df['age'] == 25)]

But instead with columns and values from the dictionary.

Asked By: user3053216

||

Answers:

Would just iterating over your dict work?

for key in column_value_pairs:
    df = df.loc[(df[key] == column_value_pairs[key])]
Answered By: Tom S

You can generate query string and then use that query string.

query_string = ''.join(
    f'({key} == "{val}") and ' for key, val in column_value_pairs.items()
)

query_string = query_string.rstrip('and ')

df.astype(str).query(query_string)
Answered By: Nk03

Like you say you "want to do this for a variable amount of column-value pairs" , this example go for the general case.

You could put whatever X-columns dictionnary you want in ldict.

ldict could contain :

  • different X-columns dictionnaries
  • one or many dictionnaries

In fact it could be useful to build complex requests joining many dictionnaries with different X-columns dictionnaries involved

import pandas as pd
df = pd.DataFrame([{'name' : 'ruben','age' : 25,'height' : 160}, 
                   {'name' : 'henk', 'age' : 26,'height' : 180},
                   {'name' : 'gijs', 'age' : 20,'height' : 175}])

ldict = [{'name' : 'ruben','age' : 25}, {'name' : 'gijs','age' : 20, 'height' : 175}]

def djoin(ldict, req=''):
    return req + ' & '.join([('{0} == "{1}"'.format(k, v)) for k,v in ldict.items()])

result = df.query(' | '.join(list(map(djoin, ldict))))

# request: name == "ruben" & age == "25" | name == "gijs" & age == "20" & height == "175"
print(result)
result
    name  age  height
0  ruben   25     160
2   gijs   20     175

Answered By: Laurent B.

This is not as general as I would like, because it has to be modified for the fields in the dataframe, but it makes one pass over the dataframe and that makes it fast, and allows the sel_dict to omit fields that are not important. Because it is not a loop and because there are no external functions, it will probably be the fastest.

sel_dict = column_value_pairs

df[((not('name'   in sel_dict)) or (sel_dict['name']  == df['name'])) & 
   ((not('age'    in sel_dict)) or (sel_dict['age']   == df['age'])) & 
   ((not('height' in sel_dict)) or (sel_dict['height']== df['height']))]

I would like to find a trick, similar to this one, for selecting matching lists in a list of list structure:

def compare_dicts_by_fields(d1, d2, fields=None):
    # compare to dicts by the fields specified, either list or str.
    # if fields not provided, it uses the fields of d2
    
    if fields is None:
        fields = d2.keys()
    elif isinstance(fields, str):
        fields = [fields]
    for field in fields:
        v1 = d1[field]
        v2 = d2[field]
        
        if v1 == v2:
            continue
            
        return False
        
    return True
    
    
def search_lod_by_dict(lod, dpat, fields=None):
    # return list of indices of lod which match elements provided in dpat.
    # compare only fields listed or use all fields of second dict.
    # fields is list or str.

    return [i for i, d in enumerate(lod) if compare_dicts_by_fields(d, dpat, fields)]

A similar function could be called instead of the repetitive boolean expression I suggested, but I doubt it can be faster.

Answered By: Ray Lutz
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.