how do you filter pandas dataframes by multiple columns
Question:
To filter a dataframe (df) by a single column, if we consider data with male and females we might:
males = df[df[Gender]=='Male']
Question 1 – But what if the data spanned multiple years and i wanted to only see males for 2014?
In other languages I might do something like:
if A = "Male" and if B = "2014" then
(except I want to do this and get a subset of the original dataframe in a new dataframe object)
Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female
for y in year:
for g in gender:
df = .....
Answers:
Using &
operator, don’t forget to wrap the sub-statements with ()
:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your dataframes in a dict
using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
EDIT:
A demo for your getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).
Start from pandas 0.13, this is the most efficient way.
df.query('Gender=="Male" & Year=="2014" ')
You can filter by multiple columns (more than two) by using the np.logical_and
operator to replace &
(or np.logical_or
to replace |
)
Here’s an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:
def filter_df(df, filter_values):
"""Filter df by matching targets for multiple columns.
Args:
df (pd.DataFrame): dataframe
filter_values (None or dict): Dictionary of the form:
`{<field>: <target_values_list>}`
used to filter columns data.
"""
import numpy as np
if filter_values is None or not filter_values:
return df
return df[
np.logical_and.reduce([
df[column].isin(target_values)
for column, target_values in filter_values.items()
])
]
Usage:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})
filter_df(df, {
'a': [1, 2, 3],
'b': [1, 2, 4]
})
In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
Results for 100,000 rows:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results for 10,000,000 rows:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So results depend on the size and the data. On my laptop, query()
gets faster after 500k rows. Further, the string search in Year=="2014"
has an unnecessary overhead (Year==2014
is faster).
You can create your own filter function using query
in pandas
. Here you have filtering of df
results by all the kwargs
parameters. Dont’ forgot to add some validators(kwargs
filtering) to get filter function for your own df
.
def filter(df, **kwargs):
query_list = []
for key in kwargs.keys():
query_list.append(f'{key}=="{kwargs[key]}"')
query = ' & '.join(query_list)
return df.query(query)
Since you are looking for a rows that basically meet a condition where Column_A=’Value_A’ and Column_B=’Value_B’
you can do using loc
df = df.loc[df['Column_A'].eq('Value_A') & df['Column_B'].eq('Value_B')]
You can find full doc here panda loc
After a few years I came back to this question and can propose another solution, it’s especially good when you have lots of filters included. We can create a several filtering masks and then operate on those filters:
>>> df = pd.DataFrame({'gender': ['Male', 'Female', 'Male'],
... 'married': [True, False, False]})
>>> gender_mask = df['gender'] == 'Male'
>>> married_mask = df['married']
>>> filtered_df = df.loc[gender_mask & married_mask]
>>> filtered_df
gender married
0 Male True
Maybe it’s not the shortest solution, but it’s readable and could be a great help to organize the code.
An improvement to Alex answer
def df_filter(df, **kwargs):
query_list = []
for key, value in kwargs.items():
if value is not None:
query_list.append(f"{key}==@kwargs['{str(key)}']")
query = ' & '.join(query_list)
return df.query(query)
will remove None values so can be directly incoperated to functions with some values defaulting to None
also the previous one would not work if the value was not string , this will work on any type of arguments
My dataframe has 25 columns and I want to leave for future a freedom to choice any kind of filters (num of params, conditions).
I use this:
def flex_query(params):
res = load_dataframe()
if type(params) is not list:
return None
for el in params:
res = res.query(f"{el[0]} {el[1]} {el[2]}")
return res
And calling this:
res = flex_query([['DATE','==', '"2022-09-26"'],['LEVEL','>=',2], ['PERCENT','>',10.2]])
Where ‘DATE’, ‘LEVEL’, ‘PERCENT’ – column names.
As you can see, here are very flexible query method with several params and different type of conditions. This method gives me possibility to compare int, float, string – ‘all in one’
To filter a dataframe (df) by a single column, if we consider data with male and females we might:
males = df[df[Gender]=='Male']
Question 1 – But what if the data spanned multiple years and i wanted to only see males for 2014?
In other languages I might do something like:
if A = "Male" and if B = "2014" then
(except I want to do this and get a subset of the original dataframe in a new dataframe object)
Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female
for y in year:
for g in gender:
df = .....
Using &
operator, don’t forget to wrap the sub-statements with ()
:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your dataframes in a dict
using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
EDIT:
A demo for your getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).
Start from pandas 0.13, this is the most efficient way.
df.query('Gender=="Male" & Year=="2014" ')
You can filter by multiple columns (more than two) by using the np.logical_and
operator to replace &
(or np.logical_or
to replace |
)
Here’s an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:
def filter_df(df, filter_values):
"""Filter df by matching targets for multiple columns.
Args:
df (pd.DataFrame): dataframe
filter_values (None or dict): Dictionary of the form:
`{<field>: <target_values_list>}`
used to filter columns data.
"""
import numpy as np
if filter_values is None or not filter_values:
return df
return df[
np.logical_and.reduce([
df[column].isin(target_values)
for column, target_values in filter_values.items()
])
]
Usage:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})
filter_df(df, {
'a': [1, 2, 3],
'b': [1, 2, 4]
})
In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
Results for 100,000 rows:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results for 10,000,000 rows:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So results depend on the size and the data. On my laptop, query()
gets faster after 500k rows. Further, the string search in Year=="2014"
has an unnecessary overhead (Year==2014
is faster).
You can create your own filter function using query
in pandas
. Here you have filtering of df
results by all the kwargs
parameters. Dont’ forgot to add some validators(kwargs
filtering) to get filter function for your own df
.
def filter(df, **kwargs):
query_list = []
for key in kwargs.keys():
query_list.append(f'{key}=="{kwargs[key]}"')
query = ' & '.join(query_list)
return df.query(query)
Since you are looking for a rows that basically meet a condition where Column_A=’Value_A’ and Column_B=’Value_B’
you can do using loc
df = df.loc[df['Column_A'].eq('Value_A') & df['Column_B'].eq('Value_B')]
You can find full doc here panda loc
After a few years I came back to this question and can propose another solution, it’s especially good when you have lots of filters included. We can create a several filtering masks and then operate on those filters:
>>> df = pd.DataFrame({'gender': ['Male', 'Female', 'Male'],
... 'married': [True, False, False]})
>>> gender_mask = df['gender'] == 'Male'
>>> married_mask = df['married']
>>> filtered_df = df.loc[gender_mask & married_mask]
>>> filtered_df
gender married
0 Male True
Maybe it’s not the shortest solution, but it’s readable and could be a great help to organize the code.
An improvement to Alex answer
def df_filter(df, **kwargs):
query_list = []
for key, value in kwargs.items():
if value is not None:
query_list.append(f"{key}==@kwargs['{str(key)}']")
query = ' & '.join(query_list)
return df.query(query)
will remove None values so can be directly incoperated to functions with some values defaulting to None
also the previous one would not work if the value was not string , this will work on any type of arguments
My dataframe has 25 columns and I want to leave for future a freedom to choice any kind of filters (num of params, conditions).
I use this:
def flex_query(params): res = load_dataframe() if type(params) is not list: return None for el in params: res = res.query(f"{el[0]} {el[1]} {el[2]}") return res
And calling this:
res = flex_query([['DATE','==', '"2022-09-26"'],['LEVEL','>=',2], ['PERCENT','>',10.2]])
Where ‘DATE’, ‘LEVEL’, ‘PERCENT’ – column names.
As you can see, here are very flexible query method with several params and different type of conditions. This method gives me possibility to compare int, float, string – ‘all in one’