How can I remove characters after underscore?
Question:
I need to convert a text like
photo_id 102297_skdjksd223238 text black dog in a water
to
photo_id 102297 text black dog in a water
by removeing the substring after underscore
inputFile = open("text.txt", "r")
exportFile = open("result", "w")
sub_str = "_"
for line in inputFile:
new_line = line[:line.index(sub_str) + len(sub_str)]
exportFile.writelines(new_line)
but couldn’t access the second underscore as it removed all text after photo_id ..
Answers:
You might use a pattern to capture the leading digits to make it a bit more specific, and then match the underscore followed by optional non whitespace characters.
In the replacement use the first capture group.
b(d+)_S*
Explanation
b
A word boundary to prevent a partial word match
(d+)
Capture group 1, match 1+ digits
_S*
Match an underscore and optional non whitespace characters
See a regex101 demo.
import re
pattern = r"b(d+)_S*"
s = "photo_id 102297_skdjksd223238 text black dog in a water"
result = re.sub(pattern, r"1", s)
if result:
print (result)
Output
photo_id 102297 text black dog in a water
Another option including photo_id
and matching until the first underscore:
b(photo_ids+[^_s]+)_S*
See another regex101 demo.
Note: The question was tagged regex when I wrote this:
_[^s]*
_
– a literal _
[^s]*
– (or S*
if supported) any character but whitespaces – zero or more times
Substitute with a blank string.
inp = 'photo_id 102297_skdjksd223238 text black dog in a water foo_baz bar'
res = re.sub(r'_[^s]*', '', inp)
print(res)
Output
photo 102297 text black dog in a water foo bar
You could split the first underscore from the right:
s= "photo_id 102297_skdjksd223238 text black dog in a water"
prefix, suffix = s.rsplit('_', 1)
print(f"{prefix} {suffix.split(' ', 1)[-1]}")
Out:
photo_id 102297 text black dog in a water
I need to convert a text like
photo_id 102297_skdjksd223238 text black dog in a water
to
photo_id 102297 text black dog in a water
by removeing the substring after underscore
inputFile = open("text.txt", "r")
exportFile = open("result", "w")
sub_str = "_"
for line in inputFile:
new_line = line[:line.index(sub_str) + len(sub_str)]
exportFile.writelines(new_line)
but couldn’t access the second underscore as it removed all text after photo_id ..
You might use a pattern to capture the leading digits to make it a bit more specific, and then match the underscore followed by optional non whitespace characters.
In the replacement use the first capture group.
b(d+)_S*
Explanation
b
A word boundary to prevent a partial word match(d+)
Capture group 1, match 1+ digits_S*
Match an underscore and optional non whitespace characters
See a regex101 demo.
import re
pattern = r"b(d+)_S*"
s = "photo_id 102297_skdjksd223238 text black dog in a water"
result = re.sub(pattern, r"1", s)
if result:
print (result)
Output
photo_id 102297 text black dog in a water
Another option including photo_id
and matching until the first underscore:
b(photo_ids+[^_s]+)_S*
See another regex101 demo.
Note: The question was tagged regex when I wrote this:
_[^s]*
_
– a literal_
[^s]*
– (orS*
if supported) any character but whitespaces – zero or more times
Substitute with a blank string.
inp = 'photo_id 102297_skdjksd223238 text black dog in a water foo_baz bar'
res = re.sub(r'_[^s]*', '', inp)
print(res)
Output
photo 102297 text black dog in a water foo bar
You could split the first underscore from the right:
s= "photo_id 102297_skdjksd223238 text black dog in a water"
prefix, suffix = s.rsplit('_', 1)
print(f"{prefix} {suffix.split(' ', 1)[-1]}")
Out:
photo_id 102297 text black dog in a water