Python regex works on Regex101, but it does not work in Python 2
Question:
I create a regex to match the Chinese and English name of the TV shows.
My regex is located at https://regex101.com/r/rBJHDG. It is working perfectly on the regex. However, this regex does not work in Python 2.
For examples, string 亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4
.
The regex does not match 亿万
as name_chs
in expect. Instead, it matches 亿万.Billions
as name_en
.
In [68]: r = '^(?P<name_chs>(?:[\u3007\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)(?=\.))?(?P<name_en>\S+).S(?P<season>\d{2})E(?P<episode>\d{2})'
In [69]: re.match(r, u'亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4').grou
...: pdict()
Out[69]:
{'episode': u'01',
'name_chs': None,
'name_en': u'u4ebfu4e07.Billions',
'season': u'01'}
Second question:
How can I remove the .
in name_en
which between the Chinese name and English name?
# 亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4
Full match 0-18 `亿万.Billions.S01E01`
Group `name_chs` 0-2 `亿万`
Group `name_en` 2-11 `.Billions` <---- This DOT!
Group `season` 13-15 `01`
Group `episode` 16-18 `01`
Answers:
It looks like the problem is that the regex tester includes the global
and multiline
flags but your code does not. If you uncheck those two flags in the regex tester you’ll find that the tester matches your current results.
You could try r = '^(?P<name_chs>(?:[\u3007\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)(?=\.))?(?P<name_en>\S+).S(?P<season>\d{2})E(?P<episode>\d{2})', re.MULTILINE)
and
re.search(r, u'亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4').grou
...: pdict()
As for your second question:
I would just make that dot its own capture group by adding (.)
in front of the English name, like so…
^(?P<name_chs>(?:[u3007u4e00-u9fffu3400-u4dbfuf900-ufaff]+)(?=.))?(.)(?P<name_en>S+).S(?P<season>d{2})E(?P<episode>d{2})
Now when you print the English name it will only be the word because the dot is in its own capture group.
I create a regex to match the Chinese and English name of the TV shows.
My regex is located at https://regex101.com/r/rBJHDG. It is working perfectly on the regex. However, this regex does not work in Python 2.
For examples, string 亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4
.
The regex does not match 亿万
as name_chs
in expect. Instead, it matches 亿万.Billions
as name_en
.
In [68]: r = '^(?P<name_chs>(?:[\u3007\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)(?=\.))?(?P<name_en>\S+).S(?P<season>\d{2})E(?P<episode>\d{2})'
In [69]: re.match(r, u'亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4').grou
...: pdict()
Out[69]:
{'episode': u'01',
'name_chs': None,
'name_en': u'u4ebfu4e07.Billions',
'season': u'01'}
Second question:
How can I remove the .
in name_en
which between the Chinese name and English name?
# 亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4
Full match 0-18 `亿万.Billions.S01E01`
Group `name_chs` 0-2 `亿万`
Group `name_en` 2-11 `.Billions` <---- This DOT!
Group `season` 13-15 `01`
Group `episode` 16-18 `01`
It looks like the problem is that the regex tester includes the global
and multiline
flags but your code does not. If you uncheck those two flags in the regex tester you’ll find that the tester matches your current results.
You could try r = '^(?P<name_chs>(?:[\u3007\u4e00-\u9fff\u3400-\u4dbf\uf900-\ufaff]+)(?=\.))?(?P<name_en>\S+).S(?P<season>\d{2})E(?P<episode>\d{2})', re.MULTILINE)
and
re.search(r, u'亿万.Billions.S01E01.中英字幕.HDTVrip.1024X576.mp4').grou
...: pdict()
As for your second question:
I would just make that dot its own capture group by adding (.)
in front of the English name, like so…
^(?P<name_chs>(?:[u3007u4e00-u9fffu3400-u4dbfuf900-ufaff]+)(?=.))?(.)(?P<name_en>S+).S(?P<season>d{2})E(?P<episode>d{2})
Now when you print the English name it will only be the word because the dot is in its own capture group.