How to aggregate rows from a CSV, excluding ones based on a list of values

Question:

I have a csv with the following data:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic"
"1","Horse","John","KFX","122","Drama"
"1","Horse","John","KFX","122","Horror"
"1","Horse","John","AZW3","122","Classic"
"1","Horse","John","AZW3","122","Drama"
"1","Horse","John","AZW3","122","Horror"
"1","Horse","John","PDF","122","Classic"
"1","Horse","John","PDF","122","Drama"
"1","Horse","John","PDF","122","Horror"
"2","Banana","Anna","AZW3","312","SciFi"
"2","Banana","Julia","AZW3","312","SciFi"
"2","Banana","Anna","PDF","312","SciFi"
"2","Banana","Julia","PDF","312","SciFi"

I would like to, using pandas, get this:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic, Drama, Horror"
"2","Banana","Anna, Julia","AZW3","312","SciFi"

Where it will exclude rows based in values of a list
KFX, AZW3, PDF

If it has KFX, exclude the other two.
If it doesn’t have KFX, check for the next in the list, AZW3, if exist, exclude the PDF (if exists)

Then, aggregate Author(s) and Tags.

I am searching for ways of doing this and so far had no success.
I’m trying to group by id, then exclude the rows based on list (KFX, AZW3, PDF), then aggregate.

I was able to aggregate using:

df.groupby(['id']).agg(lambda x: ','.join(x))

But there are two problems:
First, it will include rows containing all "Format".
Second, it will repeat other columns as well
"1, 1, 1","Horse, Horse, Horse","John, John, John"

and so on.

Asked By: Migu3litto

||

Answers:

You can use custom functions to aggregate:

def uniq_str(l):
    return ', '.join(dict.fromkeys(map(str, l)))

def agg_format(l):
    s = set(l) # not necessary if only 1 comparison
    if 'KFX' in s:
        return 'KFX'
    else:
        exclude = {'PDF'}
        return next((x for x in l if x not in exclude), None)
    
df = pd.read_csv('file.csv')
    
agg = {c: uniq_str for c in df}
agg['Format'] = agg_format

import csv 
(df.groupby('id', as_index=False).agg(agg)
   .to_csv('out.csv', index=False, quoting=csv.QUOTE_ALL)
)

output:

"id","Title","Author(s)","Format","Size","Tags"
"1","Horse","John","KFX","122","Classic, Drama, Horror"
"2","Banana","Anna, Julia","AZW3","312","SciFi"
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.