How to select and store string values in pandas using regex
Question:
I want to select and store string values(regex) into variable. Here I want to extract USD, 1000.00, 21/01/2022 from Name and store into separate variables for amount = 1000.0, date = 21/01/2022 and currency = USD respectively.
import pandas as pd
data_dict = {
'Name': ['I paid USD 1000.00 on 21/01/2022 for car service']
}
df = pd.DataFrame(data_dict)
Answers:
If the order is consistent and the currency always right before the amount, a simple regex with str.extract
should work:
df['Name'].str.extract(r'(?P<currency>[A-Z]{3})s*(?P<amount>d+(?:.d+)).*(?P<date>dd/dd/d{4})')
output:
currency amount date
0 USD 1000.00 21/01/2022
Notes
- This won’t validate the date (99/00/9876 would match)
- This assumes the currency is 3 uppercase letters (you can easily adapt)
- The type of all columns is string, if you plan to use the amount or date for operations you must convert them
variant: numbers with decimal separator
out = df['Name'].str.extract(r'(?P<currency>[A-Z]{3})s*(?P<amount>d+(?:,d{3})*(?:.d+)).*(?P<date>dd/dd/d{4})')
example:
currency amount date
0 USD 1000.00 21/01/2022
1 USD 100,000.00 21/01/2022
I want to select and store string values(regex) into variable. Here I want to extract USD, 1000.00, 21/01/2022 from Name and store into separate variables for amount = 1000.0, date = 21/01/2022 and currency = USD respectively.
import pandas as pd
data_dict = {
'Name': ['I paid USD 1000.00 on 21/01/2022 for car service']
}
df = pd.DataFrame(data_dict)
If the order is consistent and the currency always right before the amount, a simple regex with str.extract
should work:
df['Name'].str.extract(r'(?P<currency>[A-Z]{3})s*(?P<amount>d+(?:.d+)).*(?P<date>dd/dd/d{4})')
output:
currency amount date
0 USD 1000.00 21/01/2022
Notes
- This won’t validate the date (99/00/9876 would match)
- This assumes the currency is 3 uppercase letters (you can easily adapt)
- The type of all columns is string, if you plan to use the amount or date for operations you must convert them
variant: numbers with decimal separator
out = df['Name'].str.extract(r'(?P<currency>[A-Z]{3})s*(?P<amount>d+(?:,d{3})*(?:.d+)).*(?P<date>dd/dd/d{4})')
example:
currency amount date
0 USD 1000.00 21/01/2022
1 USD 100,000.00 21/01/2022