How to find ellipses in text string Python?
Question:
Fairly new to Python (And Stack Overflow!) here. I have a data set with subject line data (text strings) that I am working on building a bag of words model with. I’m creating new variables that flags a 0 or 1 for various possible scenarios, but I’m stuck trying to identify where there is an ellipsis (“…”) in the text. Here’s where I’m starting from:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('(w+).{2,}(.+)')
Inputting (‘…’) doesn’t work for obvious reasons, but the above RegEx code was suggested–still not working. Also tried this:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('...')
No dice.
The above code shell works for other variables I’ve created, but I’m also having trouble creating a 0-1 output instead of True/False (would be an ‘as.numeric’ argument in R.) Any help here would also be appreciated.
Thanks!
Answers:
Using search()
instead of match()
would spot an ellipses at any point in the text. In Pandas str.contains()
supports regular expressions:
For example in Pandas:
import pandas as pd
df = pd.DataFrame({'Text' : ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]})
df['Ellipses'] = df.Text.str.contains(r'w+(.{3,})|…')
print(df)
Giving you:
Text Ellipses
0 hello.. False
1 again... this True
2 is......a test True
3 Real ellipses… here True
4 ...not here False
Or without pandas:
import re
for test in ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]:
print(int(bool(re.search(r'w+(.{3,})|…', test))))
This matches on the middle tests giving:
0
1
1
1
0
Take a look at search-vs-match for a good explanation in the Python docs.
To display the matching words:
import re
for test in ["hello..", "again... this", "is......a test", "...def"]:
ellipses = re.search(r'(w+).{3,}', test)
if ellipses:
print(ellipses.group(1))
Giving you:
again
is
Fairly new to Python (And Stack Overflow!) here. I have a data set with subject line data (text strings) that I am working on building a bag of words model with. I’m creating new variables that flags a 0 or 1 for various possible scenarios, but I’m stuck trying to identify where there is an ellipsis (“…”) in the text. Here’s where I’m starting from:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('(w+).{2,}(.+)')
Inputting (‘…’) doesn’t work for obvious reasons, but the above RegEx code was suggested–still not working. Also tried this:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('...')
No dice.
The above code shell works for other variables I’ve created, but I’m also having trouble creating a 0-1 output instead of True/False (would be an ‘as.numeric’ argument in R.) Any help here would also be appreciated.
Thanks!
Using search()
instead of match()
would spot an ellipses at any point in the text. In Pandas str.contains()
supports regular expressions:
For example in Pandas:
import pandas as pd
df = pd.DataFrame({'Text' : ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]})
df['Ellipses'] = df.Text.str.contains(r'w+(.{3,})|…')
print(df)
Giving you:
Text Ellipses
0 hello.. False
1 again... this True
2 is......a test True
3 Real ellipses… here True
4 ...not here False
Or without pandas:
import re
for test in ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]:
print(int(bool(re.search(r'w+(.{3,})|…', test))))
This matches on the middle tests giving:
0
1
1
1
0
Take a look at search-vs-match for a good explanation in the Python docs.
To display the matching words:
import re
for test in ["hello..", "again... this", "is......a test", "...def"]:
ellipses = re.search(r'(w+).{3,}', test)
if ellipses:
print(ellipses.group(1))
Giving you:
again
is