Python: most efficient way to categorize transactions

Question:

I have a large list of transactions that I want to categorize.
It looks like this:

transactions: [
     {
        "id": "20200117-16045-0",
        "date": "2020-01-17",
        "creationTime": null,
        "text": "SuperB Vesterbro T 74637",
        "originalText": "SuperB Vesterbro T 74637",
        "details": null,
        "category": null,
        "amount": {
            "value": -160.45,
            "currency": "DKK"
        },
        "balance": {
            "value": 12572.68,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    },
    {
        "id": "20200117-4800-0",
        "date": "2020-01-17",
        "creationTime": null,
        "text": "Rent        45228",
        "originalText": "Rent        45228",
        "details": null,
        "category": null,
        "amount": {
            "value": -48.00,
            "currency": "DKK"
        },
        "balance": {
            "value": 12733.13,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    },
    {
        "id": "20200114-1200-0",
        "date": "2020-01-14",
        "creationTime": null,
        "text": "Superbest          86125",
        "originalText": "SUPERBEST          86125",
        "details": null,
        "category": null,
        "amount": {
            "value": -12.00,
            "currency": "DKK"
        },
        "balance": {
            "value": 12781.13,
            "currency": "DKK"
        },
        "type": "Card",
        "state": "Booked"
    }
]

I loaded in the data like this:

with open('transactions.json') as transactions:
    file = json.load(transactions)

data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)

And I have the following categories so far, I want to group the transactions by:

CATEGORIES = {
    'Groceries': ['SuperB', 'Superbest'],
    'Housing': ['Insurance', 'Rent']
}

Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text contains one of the values from the CATEGORIES dictionary.

If so, that transaction should get categorized as the key of the CATEGORIES dictionary – for instance Groceries.

How do I do this most efficiently?

Asked By: Mathias Lund

||

Answers:

If I understand your requirement correctly.

we can create a pipe delimited list from your dictionary and do some assignment with .loc

print(df)
for k,v in CATEGORIES.items():
    pat = '|'.join(v)
    df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
                       text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries

more efficient solution :

we create a single list of all your values and extract them with str.extract at the same time we re-create your dictionary, so each value is now the key we will map onto your target DataFrame.

words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
    for item in v:
        words.append(item)
        mapping_dict[item] = k


ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
                       text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries
Answered By: Umar.H
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.