Complex data cleaning using regex on python

Question

I have data in devanagari that needs some extraction to be done. This is an example of a few lines

तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1 <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि

<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन् <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3

The alphanumerics are the tags of the text. I need to extract the binary compounds along with their tags (the alphanumerics immediately after the compound) from the line. Binary compounds are the two words hyphenated in the angular brackets.

<अभ्युदय-अर्थः>

<गीता-शास्त्रम्>

<विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6

The first two are both examples of binary compounds whereas the third one is not. The simplest way to identify a binary compound is to find two words hyphenated enclosed by one set of angular brackets and followed by a single tag. So after extraction, of say the first line, I should get a list with this in it
<गीता-शास्त्रम्>K7, <दुर्विज्ञेय-अर्थम्>K1

The code that I tried was this

import re
cw = re.findall('<(.*?)>', f)
tags = re.findall('[a-zA-Z0-9]+', f)
cc = re.sub("[<>a-zA-z0-9]", '', f)
print(cw, tags, cc)

This, unfortunately, finds everything in a list but I cannot map the tags to their original compounds this way. Is there a more intuitive way to do this?

Asked By: Adideva98

||

Source

Answer 1

You can use

re.findall(r'<([^<>]*)>(w+)', text)

See the regex demo. Details:

<([^<>]*)> – <, then zero or more chars other than < and > captured into Group 1, and then >
(w+) – Group 2: one or more word chars.

See the Python demo:

import re
text = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामिn<अभ्युदय-अर्थः>T4 अपि यः <प्रवृत्ति-लक्षणः>Bs6 धर्मः वर्णान् आश्रमान् च उद्दिश्य विहितः  सः <<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6 अपि सन्  <<ईश्वर-अर्पण>T6-बुद्ध्या>T6 अनुष्ठीयमानः <सत्त्व-शुद्धये>T6 भवति <<फल-अभिसन्धि>T6-वर्जितः>T3"
matches = list(re.finditer(r'<([^<>]*)>(w+)', text))
# Show overall matches and their positions:
for m in matches:
    print( "Match: ", m.group(), ", Start position: ", m.start(), sep="")
print("---")
# Show groups and their positions:
for m in matches:
    print( "Word: ", m.group(1), ", Word start position: ", m.start(1),
           ", Tag: ", m.group(2), ", Tag start position: ", m.start(2), sep="")

Output:

Match: <गीता-शास्त्रम्>K7, Start position: 9
Match: <समस्त-वेद>K1, Start position: 32
Match: <दुर्विज्ञेय-अर्थम्>K1, Start position: 80
Match: <तत्-अर्थ>T6, Start position: 105
Match: <पद-अर्थ>T6, Start position: 152
Match: <वाक्य-अर्थ>T6, Start position: 164
Match: <अत्यन्त-विरुद्ध>K1, Start position: 202
Match: <अनेक-अर्थ>K1, Start position: 222
Match: <अर्थ-निर्धारण>T6, Start position: 285
Match: <अभ्युदय-अर्थः>T4, Start position: 341
Match: <प्रवृत्ति-लक्षणः>Bs6, Start position: 366
Match: <देव-आदि>Bs6, Start position: 436
Match: <ईश्वर-अर्पण>T6, Start position: 489
Match: <सत्त्व-शुद्धये>T6, Start position: 530
Match: <फल-अभिसन्धि>T6, Start position: 555
---
Word: गीता-शास्त्रम्, Word start position: 10, Tag: K7, Tag start position: 25
Word: समस्त-वेद, Word start position: 33, Tag: K1, Tag start position: 43
Word: दुर्विज्ञेय-अर्थम्, Word start position: 81, Tag: K1, Tag start position: 100
Word: तत्-अर्थ, Word start position: 106, Tag: T6, Tag start position: 115
Word: पद-अर्थ, Word start position: 153, Tag: T6, Tag start position: 161
Word: वाक्य-अर्थ, Word start position: 165, Tag: T6, Tag start position: 176
Word: अत्यन्त-विरुद्ध, Word start position: 203, Tag: K1, Tag start position: 219
Word: अनेक-अर्थ, Word start position: 223, Tag: K1, Tag start position: 233
Word: अर्थ-निर्धारण, Word start position: 286, Tag: T6, Tag start position: 300
Word: अभ्युदय-अर्थः, Word start position: 342, Tag: T4, Tag start position: 356
Word: प्रवृत्ति-लक्षणः, Word start position: 367, Tag: Bs6, Tag start position: 384
Word: देव-आदि, Word start position: 437, Tag: Bs6, Tag start position: 445
Word: ईश्वर-अर्पण, Word start position: 490, Tag: T6, Tag start position: 502
Word: सत्त्व-शुद्धये, Word start position: 531, Tag: T6, Tag start position: 546
Word: फल-अभिसन्धि, Word start position: 556, Tag: T6, Tag start position: 568

Answered By: Wiktor Stribiżew

Answer 2

Similar to @WiktorStribizew, but slight variation.

[A-Z]d will look for exactly 1 letter followed by 1 digit, example ‘K7’

import re
f = "तत् इदम् <गीता-शास्त्रम्>K7 <<<<<समस्त-वेद>K1-अर्थ>T6-सार>T6-संग्रह>T6-भूतम्>T2 <दुर्विज्ञेय-अर्थम्>K1  <<तत्-अर्थ>T6-आविष्करणाय>T6 अनेकैः <विवृत-<<<पद-<पद-अर्थ>T6-<वाक्य-अर्थ>T6>Di-न्यायम्>T6>Bs6 अपि <<अत्यन्त-विरुद्ध>K1-<अनेक-अर्थ>K1>K1 त्वेन लौकिकैः गृह्यमाणम् उपलभ्य अहम् विवेकतः <<अर्थ-निर्धारण>T6-अर्थम्>T4 संक्षेपतः विवरणम् करिष्यामि"
cw = re.findall(r'<[^<>]+>[A-Z]d', f)
print(cw)

Output

['<गीता-शास्त्रम्>K7', '<समस्त-वेद>K1', '<दुर्विज्ञेय-अर्थम्>K1', '<तत्-अर्थ>T6', '<पद-अर्थ>T6', '<वाक्य-अर्थ>T6', '<अत्यन्त-विरुद्ध>K1', '<अनेक-अर्थ>K1', '<अर्थ-निर्धारण>T6']

To locate the position of each item found, below codes will output the index number (first character location):

for item in cw:
    print(f.index(item))

9
32
80
105
152
164
202
222
285

Answered By: perpetualstudent

Complex data cleaning using regex on python

Question:

Answers: