Store rows of DataFrame with certain value in list

Question:

I have a DataFrame like:

id country city amount duplicated
1 France Paris 200 1
2 France Paris 200 1
3 France Lyon 50 2
4 France Lyon 50 2
5 France Lyon 50 2

And I would like to store a list per distinct value in duplicated, like:

list 1

[
    {
        "id": 1,
        "country": "France",
        "city": "Paris",
        "amount": 200,
    },
    {
        "id": 2,
        "country": "France",
        "city": "Paris",
        "amount": 200,
    }
  ]

list 2

[
    {
        "id": 3,
        "country": "France",
        "city": "Lyon",
        "amount": 50,
    },
    {
        "id": 4,
        "country": "France",
        "city": "Lyon",
        "amount": 50,
    },
    {
        "id": 5,
        "country": "France",
        "city": "Lyon",
        "amount": 50,
    }
  ]

I tried filtering duplicates with

df[df.duplicated(['country','city','amount', 'duplicated'], keep = False)]

but it just returns the same df.

Asked By: Taco22

||

Answers:

If I understand you correctly, you can use DataFrame.to_dict(‘records’) to make your lists:

list_1 = df[df['duplicated'] == 1].to_dict('records')
list_1 = df[df['duplicated'] == 2].to_dict('records')

Or for an arbitrary number of values in the column, you can make a dict:

result = {}
for value in df['duplicated'].unique():
    result[value] = df[df['duplicated'] == value].to_dict('records')
Answered By: jprebys

You can use groupby:

lst = (df.groupby(['country', 'city', 'amount'])  # or .groupby('duplicated')
         .apply(lambda x: x.to_dict('records'))
         .tolist())

Output:

>>> lst
[[{'id': 3,
   'country': 'France',
   'city': 'Lyon',
   'amount': 50,
   'duplicated': 2},
  {'id': 4,
   'country': 'France',
   'city': 'Lyon',
   'amount': 50,
   'duplicated': 2},
  {'id': 5,
   'country': 'France',
   'city': 'Lyon',
   'amount': 50,
   'duplicated': 2}],
 [{'id': 1,
   'country': 'France',
   'city': 'Paris',
   'amount': 200,
   'duplicated': 1},
  {'id': 2,
   'country': 'France',
   'city': 'Paris',
   'amount': 200,
   'duplicated': 1}]]

Another solution if you want a dict indexed by duplicated key:

data = {k: v.to_dict('records') for k, v in df.set_index('duplicated').groupby(level=0)}
>>> data[1]
[{'id': 1, 'country': 'France', 'city': 'Paris', 'amount': 200},
 {'id': 2, 'country': 'France', 'city': 'Paris', 'amount': 200}]

>>> data[2]
[{'id': 3, 'country': 'France', 'city': 'Lyon', 'amount': 50},
 {'id': 4, 'country': 'France', 'city': 'Lyon', 'amount': 50},
 {'id': 5, 'country': 'France', 'city': 'Lyon', 'amount': 50}]
Answered By: Corralien