Python: most efficient way to categorize transactions
Question:
I have a large list of transactions that I want to categorize.
It looks like this:
transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent 45228",
"originalText": "Rent 45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest 86125",
"originalText": "SUPERBEST 86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]
I loaded in the data like this:
with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)
And I have the following categories so far, I want to group the transactions by:
CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}
Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text
contains one of the values from the CATEGORIES
dictionary.
If so, that transaction should get categorized as the key of the CATEGORIES
dictionary – for instance Groceries
.
How do I do this most efficiently?
Answers:
If I understand your requirement correctly.
we can create a pipe delimited list from your dictionary and do some assignment with .loc
print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
more efficient solution :
we create a single list of all your values and extract them with str.extract
at the same time we re-create your dictionary, so each value is now the key we will map onto your target DataFrame.
words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k
ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
I have a large list of transactions that I want to categorize.
It looks like this:
transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent 45228",
"originalText": "Rent 45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest 86125",
"originalText": "SUPERBEST 86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]
I loaded in the data like this:
with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)
And I have the following categories so far, I want to group the transactions by:
CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}
Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text
contains one of the values from the CATEGORIES
dictionary.
If so, that transaction should get categorized as the key of the CATEGORIES
dictionary – for instance Groceries
.
How do I do this most efficiently?
If I understand your requirement correctly.
we can create a pipe delimited list from your dictionary and do some assignment with .loc
print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
more efficient solution :
we create a single list of all your values and extract them with str.extract
at the same time we re-create your dictionary, so each value is now the key we will map onto your target DataFrame.
words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k
ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries