how to extract text in between every two regex words in a large chunk of text in python?
Question:
I have a df with just one column ‘text’ that contains a large chunk of text structured like:
text
"dd ee Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
How to I get below output:
text_new
Apple A1 a b c d
Apple A2 e f g
Apple B1 hi g
Apple C1 r 5 6
Apple D1 ...
...
in the sense that every row include all texts between every occurance of the regex "Apple[space][letter][number]"?
Answers:
Try this:
s = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
result = ["Apple " + p
for p in s.split("Apple ") if p]
print('n'.join(result))
One-liner approach:
print(*(f'Apple {p}' for p in "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1...".split("Apple ") if p), sep='n')
Prints:
Apple A1 a b c d
Apple A2 e f g
Apple B1 hi g
Apple C1 r 5 6
Apple D1...
Your example doesn’t need a regular expression, but this should work if " Apple" might appear somewhere else and it has to be done on Apple[space][letter][digit].
import re
text = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
re.sub(r' (Apple [A-Z]d)', r'n1', text)
With Series.str.findall
and specific regex pattern:
pd.DataFrame({'text_col': df['text'].str.findall(r'(Apple .+?)(?=Apple|$)')[0]})
text_col
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1...
Or with Series.str.extractall
:
df['text'].str.extractall(r'(Apple .+?)(?=Apple|$)').reset_index(drop=True)
0
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1...
I do not like wrecking my head with complicated regex
. So let me offer an alternative.
Split the string before ‘Apple’ and remove white spaces with .strip()
. The result will have one empty element from the start of the my_str
which can be removed by selecting elements if its not empty. if substr
at the end of the list comprehension performs this task.
my_str = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1"
# Split the input at 'Apple'
substrings = [substr.strip() for substr in my_str.split('Apple') if substr]
# add 'Apple' to the beginning of each substring
result = ['Apple ' + substr for substr in substrings]
df = pd.DataFrame({'text_new': result})
Output:
text_new
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1
I have a df with just one column ‘text’ that contains a large chunk of text structured like:
text
"dd ee Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
How to I get below output:
text_new
Apple A1 a b c d
Apple A2 e f g
Apple B1 hi g
Apple C1 r 5 6
Apple D1 ...
...
in the sense that every row include all texts between every occurance of the regex "Apple[space][letter][number]"?
Try this:
s = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
result = ["Apple " + p
for p in s.split("Apple ") if p]
print('n'.join(result))
One-liner approach:
print(*(f'Apple {p}' for p in "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1...".split("Apple ") if p), sep='n')
Prints:
Apple A1 a b c d
Apple A2 e f g
Apple B1 hi g
Apple C1 r 5 6
Apple D1...
Your example doesn’t need a regular expression, but this should work if " Apple" might appear somewhere else and it has to be done on Apple[space][letter][digit].
import re
text = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1..."
re.sub(r' (Apple [A-Z]d)', r'n1', text)
With Series.str.findall
and specific regex pattern:
pd.DataFrame({'text_col': df['text'].str.findall(r'(Apple .+?)(?=Apple|$)')[0]})
text_col
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1...
Or with Series.str.extractall
:
df['text'].str.extractall(r'(Apple .+?)(?=Apple|$)').reset_index(drop=True)
0
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1...
I do not like wrecking my head with complicated regex
. So let me offer an alternative.
Split the string before ‘Apple’ and remove white spaces with .strip()
. The result will have one empty element from the start of the my_str
which can be removed by selecting elements if its not empty. if substr
at the end of the list comprehension performs this task.
my_str = "Apple A1 a b c d Apple A2 e f g Apple B1 hi g Apple C1 r 5 6 Apple D1"
# Split the input at 'Apple'
substrings = [substr.strip() for substr in my_str.split('Apple') if substr]
# add 'Apple' to the beginning of each substring
result = ['Apple ' + substr for substr in substrings]
df = pd.DataFrame({'text_new': result})
Output:
text_new
0 Apple A1 a b c d
1 Apple A2 e f g
2 Apple B1 hi g
3 Apple C1 r 5 6
4 Apple D1