How to filter a dataframe multiple times in a loop (multiple conditions and one-to-many dataframe results)?

Question:

I have a dataframe, and a list with some columns of that dataframe. I need to take all distinct values of those columns, store them, and make a unique dataframe for each combination of those distinct values in the original dataframe. Then, export those many dataframes to an excel (no problem with that). For example:

example-table

That table would be converted to a dataframe, and let’s suppose the list for columns is [‘OS’, ‘Work’]. In the end, I’ll have a dictionary with each column as key and each distinct value as a set of values for that key, as follows:

data = {'OS': {'IOS', 'Linux', 'Windows'}, 'Work': {'Developer', 'CEO', 'Administrator', 'Engineer'}}

Now comes the problem (and the code block I’ll show). I need to filter the dataframe according to combinations of those values, for example:

Dataframe 1) IOS + Developer —> Will only have all rows that have IOS in the OS column, and Developer in the Work column

Dataframe 2) IOS + CEO —> Will only have all rows that have IOS in the OS column, and CEO in the Work column

It is important to notice, I have no idea of what columns or dataframe will be entered, meaning it could be any number of columns, with any number of distinct values, and the algorithm should work for all cases

This is the code I have so far:

# data is the dictionary with the values as shown, it will automatically get all
# the columns and distinct values, for any number of columns and any dataframe

# column_name is the name of the column that I'm about to filter, and N is the condition
# (for example, df['OS'] == 'Linux' will only take rows that have Linux in that column

for N in data:
    out = path + f'{name}({N})'
    df_aux = df[df[column_name] == N]
    with pandas.ExcelWriter(out) as writer:
        #... and it exports the dataframe to an excel .xlsx file

# this works for one column (working with a string and a set instead of a dictionary),
# but I have this (failure) for multiple columns

for col in data:
    for N in data[col]:
        #... and then filter with
        df_aux = df[df[col] == N]

#...and then export it to excel file in this level of indentation

I’ve tried different levels of indentation, using a multidimensional array instead of a dictionary, using an ordered dictionary, … in the end, I really don’t know how to make the loop work, and that’s the core issue. My idea right now is to make a dataframe with the distinct values of the columns, and simply make all the different possibilities walking through the dataframe, but still, I don’t know how to do the loop, because I don’t know how to filter the original dataframe with an arbitraty number of conditions.

Asked By: Joaquin Hernandez

||

Answers:

This can be solved using groupby function from pandas. Function for input data with arbitrary columns could look like this:

def create_dataframes_by_columns(data, columns_to_group_by):
    dataframes = []
    for name, group in data.groupby(columns_to_group_by):
        dataframes.append(group)
        
    unique_values = {col: pd.unique(df[col]).tolist() for col in columns_to_group_by}
    
    return unique_values, dataframes

This returns two values: dictionary of unique values for columns you group by, and list of dataframes, each of which contains only elements with one combination of values in columns_to_groupby.

If you wanted to save each dataframe into excel file, you could do something like this (fullyl reproducible example):

df = pd.DataFrame({
    'name': [
        'Maria',
        'Ana',
        'Gabriel',
        'Marcos',
        'Ana',
        'Joaquin',
        'Alberto',
        'Maria',
        'Marta',
        'Belen'
    ],
    'work': [
        'Developer',
        'Administrator',
        'CEO',
        'Engineer',
        'Developer',
        'Developer',
        'Administrator',
        'CEO',
        'Developer',
        'Engineer'
    ],
    'OS': [
        'IOS',
        'Linux',
        'Linux',
        'Windows',
        'Linux',
        'Windows',
        'IOS',
        'IOS',
        'Windows',
        'Windows'
    ]
})
columns_to_group_by = ['work', 'OS']

for name, group in df.groupby(columns_to_group_by):
    filename_parts = ['data']
    for colname in name:
        filename_parts.append(colname)
    save_path = '_'.join(filename_parts) + '.xlsx'
    group.to_excel(save_path)

Value ‘name’ in groupby is a tuple containing unique values from given group, I use those values to create the excel filename.

Answered By: druskacik
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.