python regex lookbehind to remove _sublabel1 in string like "__label__label1_sublabel1"
Question:
i have dataset that prepare for train in fasttext and i wanna remove sublabels from dataset
for example:
__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data.
Any help much appreciated
thanks
im tried this:
r'(?<=__label__[^_]+)w+'
isnt working
exact code:
ptrn = r'(?<=__label__[^_]+)w+'
re.sub(ptrn, '', test_String)
and this error was occured:
error:
error Traceback (most recent call
last)
c:UsersTHoseiniDesktopprojectsensani_classificationtes4t.ipynb
Cell 3 in <cell line: 3>()
1 ptrn = r'(?<=label[^_]+)w+’
—-> 3 re.sub(ptrn, ”, test_String)
File
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libre.py:209,
in sub(pattern, repl, string, count, flags)
202 def sub(pattern, repl, string, count=0, flags=0):
203 """Return the string obtained by replacing the leftmost
204 non-overlapping occurrences of the pattern in string by the
205 replacement repl. repl can be either a string or a callable;
206 if a string, backslash escapes in it are processed. If it is
207 a callable, it’s passed the Match object and must return
208 a replacement string to be used."""
–> 209 return _compile(pattern, flags).sub(repl, string, count)
File
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libre.py:303,
in _compile(pattern, flags)
301 if not sre_compile.isstring(pattern):
302 raise TypeError("first argument must be string or compiled pattern")
–> 303 p = sre_compile.compile(pattern, flags)
304 if not (flags & DEBUG):
305 if len(_cache) >= _MAXCACHE:
306 # Drop the oldest item
File
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libsre_compile.py:792,
in compile(p, flags)
–> 198 raise error("look-behind requires fixed-width pattern")
199 emit(lo) # look behind
200 _compile(code, av[1], flags)
error: look-behind requires fixed-width pattern
Answers:
try this regex:
(__label__[^_s]+)w+
and a sample code in python:
import re
test_string = """__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data."""
ptrn = r'(__label__[^_s]+)w+'
re.sub(ptrn, r'1', test_string)
The re.sub()
function stands for a substring and returns a string with replaced values.
[^character_group]
means negation: Matches any single character that is not in character_group. and w
matches any word character. s
matches any white-space character.
and output are like expected:
__label__label1 __label__label2 __label__label __label__label1 sometext some sentce som data.
i have dataset that prepare for train in fasttext and i wanna remove sublabels from dataset
for example:
__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data.
Any help much appreciated
thanks
im tried this:
r'(?<=__label__[^_]+)w+'
isnt working
exact code:
ptrn = r'(?<=__label__[^_]+)w+'
re.sub(ptrn, '', test_String)
and this error was occured:
error:
error Traceback (most recent call
last)
c:UsersTHoseiniDesktopprojectsensani_classificationtes4t.ipynb
Cell 3 in <cell line: 3>()
1 ptrn = r'(?<=label[^_]+)w+’
—-> 3 re.sub(ptrn, ”, test_String)File
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libre.py:209,
in sub(pattern, repl, string, count, flags)
202 def sub(pattern, repl, string, count=0, flags=0):
203 """Return the string obtained by replacing the leftmost
204 non-overlapping occurrences of the pattern in string by the
205 replacement repl. repl can be either a string or a callable;
206 if a string, backslash escapes in it are processed. If it is
207 a callable, it’s passed the Match object and must return
208 a replacement string to be used."""
–> 209 return _compile(pattern, flags).sub(repl, string, count)File
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libre.py:303,
in _compile(pattern, flags)
301 if not sre_compile.isstring(pattern):
302 raise TypeError("first argument must be string or compiled pattern")
–> 303 p = sre_compile.compile(pattern, flags)
304 if not (flags & DEBUG):
305 if len(_cache) >= _MAXCACHE:
306 # Drop the oldest itemFile
c:UsersTHoseiniAppDataLocalProgramsPythonPython310libsre_compile.py:792,
in compile(p, flags)
–> 198 raise error("look-behind requires fixed-width pattern")
199 emit(lo) # look behind
200 _compile(code, av[1], flags)error: look-behind requires fixed-width pattern
try this regex:
(__label__[^_s]+)w+
and a sample code in python:
import re
test_string = """__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data."""
ptrn = r'(__label__[^_s]+)w+'
re.sub(ptrn, r'1', test_string)
The re.sub()
function stands for a substring and returns a string with replaced values.
[^character_group]
means negation: Matches any single character that is not in character_group. and w
matches any word character. s
matches any white-space character.
and output are like expected:
__label__label1 __label__label2 __label__label __label__label1 sometext some sentce som data.