Perform multiple regex operations on each line of text file and store extracted data in respective column
Question:
Data in test.txt
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>DXB</CityCode><CountryCode>EG</CountryCode><Currency>USD</Currency><Channel>TA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>95HAJSTI</Value></Param></CustomParams></Pricing></ServiceRQ>
<SearchRQ xmlns_xsi="http://"><SaleInfo><CityCode>CPT</CityCode><CountryCode>US</CountryCode><Currency>USD</Currency><Channel>AY</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>56ASJSTS</Value></Param></CustomParams></Pricing></SearchRQ>
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>BOM</CityCode><CountryCode>AU</CountryCode><Currency>USD</Currency><Channel>QA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>85ATAKSQ</Value></Param></CustomParams></Pricing></ServiceRQ>
<ServiceRQ ......
<SearchRQ ........
My code:
import pandas as pd
import re
columns = ['Request Type','Channel','AG']
# data = pd.DataFrame
exp = re.compile(r'<(.*)s+xmlns'
r'<Channel>(.*)</Channel>'
r'<Param Name="AG">.*?<Value>(.*?)</Value>')
final = []
with open(r"test.txt") as f:
for line in f:
result = re.search(exp,line)
final.append(result)
df = pd.DataFrame(final, columns)
print(df)
My expected output is
I want to iterate through each line and to perform the 3 regex operation and extract data from each line in text file
1. r'<(.*)s+xmlns'
2. r'<Channel>(.*)</Channel>'
3. r'<Param Name="AG">.*?<Value>(.*?)</Value>')
Each regex extract respective data from single line
like
- extract the type of request
- extract the name of channel
- extract the value present for AG
My expected output ExcelSheet
Request Type Channel AG
ServiceRQ TA 95HAJSTI
SearchRQ AY 56ASJSTS
ServiceRQ QA 85ATAKSQ
... ... .....
... .... .....
and so on..
How can I achieve expected output.
Answers:
Try this re
, actually I don’t Know how the rest of your text content looks like, but this will work with what I have seen so far.
result.groups()
will extract matching elements of all groups then return a tuple before appending.
exp = re.compile(r'<(w+)s+>import pandas as pd
import re
columns = ['Request Type','Channel','AG']
file_data = """
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>DXB</CityCode><CountryCode>EG</CountryCode><Currency>USD</Currency><Channel>TA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>95HAJSTI</Value></Param></CustomParams></Pricing></ServiceRQ>
<SearchRQ xmlns_xsi="http://"><SaleInfo><CityCode>CPT</CityCode><CountryCode>US</CountryCode><Currency>USD</Currency><Channel>AY</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>56ASJSTS</Value></Param></CustomParams></Pricing></SearchRQ>
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>BOM</CityCode><CountryCode>AU</CountryCode><Currency>USD</Currency><Channel>QA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>85ATAKSQ</Value></Param></CustomParams></Pricing></ServiceRQ>
"""
exp = re.compile(r'<(w+)s+xmlns.*?>.*?'
r'<Channel>(.*?)</Channel>.*?'
r'<Param Name="AG"><Value>(.*?)</Value>')
final = []
for line in file_data.splitlines():
result = re.search(exp,line)
if result:
final.append(result.groups())
df = pd.DataFrame(final, columns=columns)
print(df)
Request Type Channel AG
0 ServiceRQ TA 95HAJSTI
1 SearchRQ AY 56ASJSTS
2 ServiceRQ QA 85ATAKSQ
Data in test.txt
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>DXB</CityCode><CountryCode>EG</CountryCode><Currency>USD</Currency><Channel>TA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>95HAJSTI</Value></Param></CustomParams></Pricing></ServiceRQ>
<SearchRQ xmlns_xsi="http://"><SaleInfo><CityCode>CPT</CityCode><CountryCode>US</CountryCode><Currency>USD</Currency><Channel>AY</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>56ASJSTS</Value></Param></CustomParams></Pricing></SearchRQ>
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>BOM</CityCode><CountryCode>AU</CountryCode><Currency>USD</Currency><Channel>QA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>85ATAKSQ</Value></Param></CustomParams></Pricing></ServiceRQ>
<ServiceRQ ......
<SearchRQ ........
My code:
import pandas as pd
import re
columns = ['Request Type','Channel','AG']
# data = pd.DataFrame
exp = re.compile(r'<(.*)s+xmlns'
r'<Channel>(.*)</Channel>'
r'<Param Name="AG">.*?<Value>(.*?)</Value>')
final = []
with open(r"test.txt") as f:
for line in f:
result = re.search(exp,line)
final.append(result)
df = pd.DataFrame(final, columns)
print(df)
My expected output is
I want to iterate through each line and to perform the 3 regex operation and extract data from each line in text file
1. r'<(.*)s+xmlns'
2. r'<Channel>(.*)</Channel>'
3. r'<Param Name="AG">.*?<Value>(.*?)</Value>')
Each regex extract respective data from single line
like
- extract the type of request
- extract the name of channel
- extract the value present for AG
My expected output ExcelSheet
Request Type Channel AG
ServiceRQ TA 95HAJSTI
SearchRQ AY 56ASJSTS
ServiceRQ QA 85ATAKSQ
... ... .....
... .... .....
and so on..
How can I achieve expected output.
Try this re
, actually I don’t Know how the rest of your text content looks like, but this will work with what I have seen so far.
result.groups()
will extract matching elements of all groups then return a tuple before appending.
exp = re.compile(r'<(w+)s+>import pandas as pd
import re
columns = ['Request Type','Channel','AG']
file_data = """
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>DXB</CityCode><CountryCode>EG</CountryCode><Currency>USD</Currency><Channel>TA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>95HAJSTI</Value></Param></CustomParams></Pricing></ServiceRQ>
<SearchRQ xmlns_xsi="http://"><SaleInfo><CityCode>CPT</CityCode><CountryCode>US</CountryCode><Currency>USD</Currency><Channel>AY</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>56ASJSTS</Value></Param></CustomParams></Pricing></SearchRQ>
<ServiceRQ xmlns_xsi="http://"><SaleInfo><CityCode>BOM</CityCode><CountryCode>AU</CountryCode><Currency>USD</Currency><Channel>QA</Channel></SaleInfo><Pricing><CustomParams><Param Name="AG"><Value>85ATAKSQ</Value></Param></CustomParams></Pricing></ServiceRQ>
"""
exp = re.compile(r'<(w+)s+xmlns.*?>.*?'
r'<Channel>(.*?)</Channel>.*?'
r'<Param Name="AG"><Value>(.*?)</Value>')
final = []
for line in file_data.splitlines():
result = re.search(exp,line)
if result:
final.append(result.groups())
df = pd.DataFrame(final, columns=columns)
print(df)
Request Type Channel AG
0 ServiceRQ TA 95HAJSTI
1 SearchRQ AY 56ASJSTS
2 ServiceRQ QA 85ATAKSQ