Complex data cleaning using regex on python
Question:
I have data in devanagari that needs some extraction to be done. This is an example of a few lines
तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि
<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3
The alphanumerics are the tags of the text. I need to extract the binary compounds along with their tags (the alphanumerics immediately after the compound) from the line. Binary compounds are the two words hyphenated in the angular brackets.
<अभ्युदय-अर्थः>
<गीता-शास्त्रम्>
<विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6
The first two are both examples of binary compounds whereas the third one is not. The simplest way to identify a binary compound is to find two words hyphenated enclosed by one set of angular brackets and followed by a single tag. So after extraction, of say the first line, I should get a list with this in it
<गीता-शास्त्रम्>K7, <दुर्विज्ञेय-अर्थम्>K1
The code that I tried was this
import re
cw = re.findall('<(.*?)>', f)
tags = re.findall('[a-zA-Z0-9]+', f)
cc = re.sub("[<>a-zA-z0-9]", '', f)
print(cw, tags, cc)
This, unfortunately, finds everything in a list but I cannot map the tags to their original compounds this way. Is there a more intuitive way to do this?
Answers:
You can use
re.findall(r'<([^<>]*)>(w+)', text)
See the regex demo. Details:
<([^<>]*)>
– <
, then zero or more chars other than <
and >
captured into Group 1, and then >
(w+)
– Group 2: one or more word chars.
See the Python demo:
import re
text = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामिn<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3"
matches = list(re.finditer(r'<([^<>]*)>(w+)', text))
# Show overall matches and their positions:
for m in matches:
print( "Match: ", m.group(), ", Start position: ", m.start(), sep="")
print("---")
# Show groups and their positions:
for m in matches:
print( "Word: ", m.group(1), ", Word start position: ", m.start(1),
", Tag: ", m.group(2), ", Tag start position: ", m.start(2), sep="")
Output:
Match: <गीता-शास्त्रम्>K7, Start position: 9
Match: <समस्त-वेद>K1, Start position: 32
Match: <दुर्विज्ञेय-अर्थम्>K1, Start position: 80
Match: <तत्-अर्थ>T6, Start position: 105
Match: <पद-अर्थ>T6, Start position: 152
Match: <वाक्य-अर्थ>T6, Start position: 164
Match: <अत्यन्त-विरुद्ध>K1, Start position: 202
Match: <अनेक-अर्थ>K1, Start position: 222
Match: <अर्थ-निर्धारण>T6, Start position: 285
Match: <अभ्युदय-अर्थः>T4, Start position: 341
Match: <प्रवृत्ति-लक्षणः>Bs6, Start position: 366
Match: <देव-आदि>Bs6, Start position: 436
Match: <ईश्वर-अर्पण>T6, Start position: 489
Match: <सत्त्व-शुद्धये>T6, Start position: 530
Match: <फल-अभिसन्धि>T6, Start position: 555
---
Word: गीता-शास्त्रम्, Word start position: 10, Tag: K7, Tag start position: 25
Word: समस्त-वेद, Word start position: 33, Tag: K1, Tag start position: 43
Word: दुर्विज्ञेय-अर्थम्, Word start position: 81, Tag: K1, Tag start position: 100
Word: तत्-अर्थ, Word start position: 106, Tag: T6, Tag start position: 115
Word: पद-अर्थ, Word start position: 153, Tag: T6, Tag start position: 161
Word: वाक्य-अर्थ, Word start position: 165, Tag: T6, Tag start position: 176
Word: अत्यन्त-विरुद्ध, Word start position: 203, Tag: K1, Tag start position: 219
Word: अनेक-अर्थ, Word start position: 223, Tag: K1, Tag start position: 233
Word: अर्थ-निर्धारण, Word start position: 286, Tag: T6, Tag start position: 300
Word: अभ्युदय-अर्थः, Word start position: 342, Tag: T4, Tag start position: 356
Word: प्रवृत्ति-लक्षणः, Word start position: 367, Tag: Bs6, Tag start position: 384
Word: देव-आदि, Word start position: 437, Tag: Bs6, Tag start position: 445
Word: ईश्वर-अर्पण, Word start position: 490, Tag: T6, Tag start position: 502
Word: सत्त्व-शुद्धये, Word start position: 531, Tag: T6, Tag start position: 546
Word: फल-अभिसन्धि, Word start position: 556, Tag: T6, Tag start position: 568
Similar to @WiktorStribizew, but slight variation.
[A-Z]d
will look for exactly 1 letter followed by 1 digit, example ‘K7’
import re
f = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि"
cw = re.findall(r'<[^<>]+>[A-Z]d', f)
print(cw)
Output
['<गीता-शास्त्रम्>K7', '<समस्त-वेद>K1', '<दुर्विज्ञेय-अर्थम्>K1', '<तत्-अर्थ>T6', '<पद-अर्थ>T6', '<वाक्य-अर्थ>T6', '<अत्यन्त-विरुद्ध>K1', '<अनेक-अर्थ>K1', '<अर्थ-निर्धारण>T6']
To locate the position of each item found, below codes will output the index number (first character location):
for item in cw:
print(f.index(item))
9
32
80
105
152
164
202
222
285
I have data in devanagari that needs some extraction to be done. This is an example of a few lines
तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि
<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3
The alphanumerics are the tags of the text. I need to extract the binary compounds along with their tags (the alphanumerics immediately after the compound) from the line. Binary compounds are the two words hyphenated in the angular brackets.
<अभ्युदय-अर्थः>
<गीता-शास्त्रम्>
<विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6
The first two are both examples of binary compounds whereas the third one is not. The simplest way to identify a binary compound is to find two words hyphenated enclosed by one set of angular brackets and followed by a single tag. So after extraction, of say the first line, I should get a list with this in it
<गीता-शास्त्रम्>K7, <दुर्विज्ञेय-अर्थम्>K1
The code that I tried was this
import re
cw = re.findall('<(.*?)>', f)
tags = re.findall('[a-zA-Z0-9]+', f)
cc = re.sub("[<>a-zA-z0-9]", '', f)
print(cw, tags, cc)
This, unfortunately, finds everything in a list but I cannot map the tags to their original compounds this way. Is there a more intuitive way to do this?
You can use
re.findall(r'<([^<>]*)>(w+)', text)
See the regex demo. Details:
<([^<>]*)>
–<
, then zero or more chars other than<
and>
captured into Group 1, and then>
(w+)
– Group 2: one or more word chars.
See the Python demo:
import re
text = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामिn<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3"
matches = list(re.finditer(r'<([^<>]*)>(w+)', text))
# Show overall matches and their positions:
for m in matches:
print( "Match: ", m.group(), ", Start position: ", m.start(), sep="")
print("---")
# Show groups and their positions:
for m in matches:
print( "Word: ", m.group(1), ", Word start position: ", m.start(1),
", Tag: ", m.group(2), ", Tag start position: ", m.start(2), sep="")
Output:
Match: <गीता-शास्त्रम्>K7, Start position: 9
Match: <समस्त-वेद>K1, Start position: 32
Match: <दुर्विज्ञेय-अर्थम्>K1, Start position: 80
Match: <तत्-अर्थ>T6, Start position: 105
Match: <पद-अर्थ>T6, Start position: 152
Match: <वाक्य-अर्थ>T6, Start position: 164
Match: <अत्यन्त-विरुद्ध>K1, Start position: 202
Match: <अनेक-अर्थ>K1, Start position: 222
Match: <अर्थ-निर्धारण>T6, Start position: 285
Match: <अभ्युदय-अर्थः>T4, Start position: 341
Match: <प्रवृत्ति-लक्षणः>Bs6, Start position: 366
Match: <देव-आदि>Bs6, Start position: 436
Match: <ईश्वर-अर्पण>T6, Start position: 489
Match: <सत्त्व-शुद्धये>T6, Start position: 530
Match: <फल-अभिसन्धि>T6, Start position: 555
---
Word: गीता-शास्त्रम्, Word start position: 10, Tag: K7, Tag start position: 25
Word: समस्त-वेद, Word start position: 33, Tag: K1, Tag start position: 43
Word: दुर्विज्ञेय-अर्थम्, Word start position: 81, Tag: K1, Tag start position: 100
Word: तत्-अर्थ, Word start position: 106, Tag: T6, Tag start position: 115
Word: पद-अर्थ, Word start position: 153, Tag: T6, Tag start position: 161
Word: वाक्य-अर्थ, Word start position: 165, Tag: T6, Tag start position: 176
Word: अत्यन्त-विरुद्ध, Word start position: 203, Tag: K1, Tag start position: 219
Word: अनेक-अर्थ, Word start position: 223, Tag: K1, Tag start position: 233
Word: अर्थ-निर्धारण, Word start position: 286, Tag: T6, Tag start position: 300
Word: अभ्युदय-अर्थः, Word start position: 342, Tag: T4, Tag start position: 356
Word: प्रवृत्ति-लक्षणः, Word start position: 367, Tag: Bs6, Tag start position: 384
Word: देव-आदि, Word start position: 437, Tag: Bs6, Tag start position: 445
Word: ईश्वर-अर्पण, Word start position: 490, Tag: T6, Tag start position: 502
Word: सत्त्व-शुद्धये, Word start position: 531, Tag: T6, Tag start position: 546
Word: फल-अभिसन्धि, Word start position: 556, Tag: T6, Tag start position: 568
Similar to @WiktorStribizew, but slight variation.
[A-Z]d
will look for exactly 1 letter followed by 1 digit, example ‘K7’
import re
f = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि"
cw = re.findall(r'<[^<>]+>[A-Z]d', f)
print(cw)
Output
['<गीता-शास्त्रम्>K7', '<समस्त-वेद>K1', '<दुर्विज्ञेय-अर्थम्>K1', '<तत्-अर्थ>T6', '<पद-अर्थ>T6', '<वाक्य-अर्थ>T6', '<अत्यन्त-विरुद्ध>K1', '<अनेक-अर्थ>K1', '<अर्थ-निर्धारण>T6']
To locate the position of each item found, below codes will output the index number (first character location):
for item in cw:
print(f.index(item))
9
32
80
105
152
164
202
222
285