How to match a changing pattern in python?
Question:
So I have a collection of lyrics from different artists, but in the middle of all the lyrics there is always an advertisement I want to remove. It looks like this:
‘lyric lyric See John Mayer LiveGet tickets as low as $53 lyric lyric’
More generally, the pattern is always: ‘See ARTIST LiveGet tickets as low as $NUMBER’
Is there a way I can match this changing pattern so I can get rid of these advertisements in the text?
Answers:
Edit: fixed so it removes the space where the text was removed.
Assuming the ad is ALWAYS in that format, this is a very simplified version that you could expand upon..
import re
lyrics = "lyric lyric See John Mayer Live Get tickets as low as $53 lyric lyric"
pattern = r'Sees+(.*?)s+Live Get tickets as low ass+$[d,]+'
clean_lyrics = re.sub(pattern, '', lyrics).strip()
clean_lyrics = re.sub(r's+', ' ', clean_lyrics)
print(clean_lyrics)
# Output: 'lyric lyric lyric lyric'
The s+ , .*? , d+ are whitespaces, any random characters in a group, and digits in that order. This is used to help identify a pattern.
So I have a collection of lyrics from different artists, but in the middle of all the lyrics there is always an advertisement I want to remove. It looks like this:
‘lyric lyric See John Mayer LiveGet tickets as low as $53 lyric lyric’
More generally, the pattern is always: ‘See ARTIST LiveGet tickets as low as $NUMBER’
Is there a way I can match this changing pattern so I can get rid of these advertisements in the text?
Edit: fixed so it removes the space where the text was removed.
Assuming the ad is ALWAYS in that format, this is a very simplified version that you could expand upon..
import re
lyrics = "lyric lyric See John Mayer Live Get tickets as low as $53 lyric lyric"
pattern = r'Sees+(.*?)s+Live Get tickets as low ass+$[d,]+'
clean_lyrics = re.sub(pattern, '', lyrics).strip()
clean_lyrics = re.sub(r's+', ' ', clean_lyrics)
print(clean_lyrics)
# Output: 'lyric lyric lyric lyric'
The s+ , .*? , d+ are whitespaces, any random characters in a group, and digits in that order. This is used to help identify a pattern.