how to extract text in between every two regex words in a large chunk of text in python?

Question:

I have a df with just one column ‘text’ that contains a large chunk of text structured like:

text
"dd ee Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."

How to I get below output:

text_new
Apple A1 a b c d 
Apple A2 e f g 
Apple B1 hi g
Apple C1 r 5 6
Apple D1 ...
...

in the sense that every row include all texts between every occurance of the regex "Apple[space][letter][number]"?

Asked By: FloriaT

||

Answers:

Try this:

s = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."

result = ["Apple " + p
          for p in s.split("Apple ") if p]

print('n'.join(result))

One-liner approach:

print(*(f'Apple {p}' for p in "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1...".split("Apple ") if p), sep='n')

Prints:

Apple A1 a b c d 
Apple A2 e f g 
Apple B1 hi g 
Apple C1 r 5 6 
Apple D1...
Answered By: rv.kvetch

Your example doesn’t need a regular expression, but this should work if " Apple" might appear somewhere else and it has to be done on Apple[space][letter][digit].

import re
    
text = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."

re.sub(r' (Apple [A-Z]d)', r'n1', text)
Answered By: Jules SJ

With Series.str.findall and specific regex pattern:

pd.DataFrame({'text_col': df['text'].str.findall(r'(Apple .+?)(?=Apple|$)')[0]})

            text_col
0  Apple A1 a b c d 
1    Apple A2 e f g 
2     Apple B1 hi g 
3    Apple C1 r 5 6 
4       Apple D1...

Or with Series.str.extractall:

df['text'].str.extractall(r'(Apple .+?)(?=Apple|$)').reset_index(drop=True)

                   0
0  Apple A1 a b c d 
1    Apple A2 e f g 
2     Apple B1 hi g 
3    Apple C1 r 5 6 
4       Apple D1...
Answered By: RomanPerekhrest

I do not like wrecking my head with complicated regex. So let me offer an alternative.

Split the string before ‘Apple’ and remove white spaces with .strip(). The result will have one empty element from the start of the my_strwhich can be removed by selecting elements if its not empty. if substr at the end of the list comprehension performs this task.

my_str = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1"

# Split the input at 'Apple'
substrings = [substr.strip() for substr in my_str.split('Apple') if substr]

# add 'Apple' to the beginning of each substring
result = ['Apple ' + substr for substr in substrings]

df = pd.DataFrame({'text_new': result})

Output:

    text_new
0   Apple A1 a b c d
1   Apple A2 e f g
2   Apple B1 hi g
3   Apple C1 r 5 6
4   Apple D1

Answered By: Ugyen Norbu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.