Python – Extracting only necessary elements from a string

Question:

I’m trying to extract only the parts I need from the table.

    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0

All I need is ‘text text’ and/ immediately followed by two numbers and ‘low’ as shown below.

    texttext    0   100 lowness
    texttext    10  650 lowness
    texttext    0   30  lowness
    texttext    0   0   lowness
    texttext    10  70  lowness

I tried this but failed.

text = """
    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0
"""

for a in text.split('n'):
    if a == "":
        continue
    else:
        print(a)
        m = re.match('(^Dd*D)(w*s)(d*s)(d*s)(d*sd*sd*sd*sd*s)(w+)', a)
        print(m)
        print(m.group(2), m.group(3), m.group(4), m.group(6))

I tried to group by regex and get the parts, but I got the following error: Help / print(m.group(2), m.group(3), m.group(4), m.group(6))
AttributeError: ‘NoneType’ object has no attribute ‘group’

Asked By: anfwkdrn

||

Answers:

Try this:

for a in text.split('n'):
    if a == "":
        continue
    else:
        parts = a.split()
        print(parts[1],parts[2],parts[3],parts[9])

If you absolutely want to use a regular expression:

import re

text = """
    2555    texttext    0   100 100 0   0   0   0   lowness 0
    2557    texttext    10  650 660 0   0   0   0   lowness 0
    2564    texttext    0   30  30  0   0   0   0   lowness 0
    2566    texttext    0   0   0   0   0   0   0   lowness 0
    2567    texttext    10  70  80  0   0   0   0   lowness 0
"""
pattern = re.compile(
    r"s*d+s+(w+)s+(d+)s+(d+)s+d+s+d+s+d+s+d+s+d+s+(w+)s+"
)

for line in text.strip().split('n'):
    match = re.search(pattern, line)
    print(*match.groups())

Output:

texttext 0 100 lowness
texttext 10 650 lowness
texttext 0 30 lowness
texttext 0 0 lowness
texttext 10 70 lowness

But if it is really the case that it’s always the same number of space-separated substrings of characters, then you might really be better off just splitting the lines by spaces:

for line in text.strip().split('n'):
    parts = line.split()
    print(parts[1], parts[2], parts[3], parts[9])

Same output.

Answered By: Daniil Fajnberg

You are not getting a match, because you are only matching a single D and a single s which match a single character.

But in the example data, there are more repetitions of the same characters to get to the next match.

If you fix that, you will get a match but with the wrong data in the groups, see https://regex101.com/r/v3ddai/1


Instead, you can just use 2 capture groups.

As there always seem to be digits present, you can change d* to d+

^s*d+s+(w+s+d+s+d+s+)d+s+d+s+d+s+d+s+d+s+(w+)

Regex demo

Answered By: The fourth bird
for e in text.splitlines():
    if e:
        ls = e.split()
        print(ls[1:4] + ls[-2:-1])

['texttext', '0', '100', 'lowness']
['texttext', '10', '650', 'lowness']
['texttext', '0', '30', 'lowness']
['texttext', '0', '0', 'lowness']
['texttext', '10', '70', 'lowness']
Answered By: LetzerWille
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.