Python: How to use RegEx in an if statement?
Question:
I have the following code which looks through the files in one directory and copies files that contain a certain string into another directory, but I am trying to use Regular Expressions as the string could be upper and lowercase or a mix of both.
Here is the code that works, before I tried to use RegEx’s
import os
import re
import shutil
def test():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
if ("Hello World" in content)
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
Here is my code when I have tried to use RegEx’s
import os
import re
import shutil
def test2():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
regex_txt = "facebook.com"
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
Im guessing that I need a line of code that is something like
if regex = re.compile(regex_txt, re.IGNORECASE) == True
But I cant seem to get anything to work, if someone could point me in the right direction it would be appreciated.
Answers:
First you compile the regex, then you have to use it with match
, find
, or some other method to actually run it against some input.
import os
import re
import shutil
def test():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
pattern = re.compile(regex_txt, re.IGNORECASE)
for x in (files):
with open((x), 'r') as input_file:
for line in input_file:
if pattern.search(line):
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
break
The REPL makes it easy to learn APIs. Just run python
, create an object and then ask for help
:
$ python
>>> import re
>>> help(re.compile(r''))
at the command line shows, among other things:
search(...)
search(string[, pos[, endpos]])
–> match object or None
.
Scan through string looking for a match, and return a corresponding
MatchObject
instance. Return None
if no position in the string matches.
so you can do
regex = re.compile(regex_txt, re.IGNORECASE)
match = regex.search(content) # From your file reading code.
if match is not None:
# use match
Incidentally,
regex_txt = "facebook.com"
has a .
which matches any character, so re.compile("facebook.com").search("facebookkcom") is not None
is true because .
matches any character. Maybe
regex_txt = r"(?i)facebook.com"
The .
matches a literal "."
character instead of treating .
as a special regular expression operator.
The r"..."
bit means that the regular expression compiler gets the escape in .
instead of the python parser interpreting it.
The (?i)
makes the regex case-insensitive like re.IGNORECASE
but self-contained.
import re
if re.match(regex, content):
blah..
You could also use re.search
depending on how you want it to match.
You can run this example:
"""
very nice interface to try regexes: https://regex101.com/
"""
# %%
"""Simple if statement with a regex"""
import re
regex = r"s*Proof.s*"
contents = ['Proof.n', 'nProof.n']
for content in contents:
assert re.match(regex, content), f'Failed on {content=} with {regex=}'
if re.match(regex, content):
print(content)
Regex’s shouldn’t really be used in this fashion – unless you want something more complicated than what you’re trying to do – for instance, you could just normalise your content string and comparision string to be:
if 'facebook.com' in content.lower():
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
if re.search(r'pattern', string):
Simple if-regex example:
if re.search(r'ingb', "seeking a great perhaps"): # any words end with ing?
print("yes")
Complex if-regex example (pattern check, extract a substring, case insensitive):
search_object = re.search(r'^OUGHT (.*) BE$', "ought to be", flags=re.IGNORECASE)
if search_object:
assert "to" == search_object.group(1) # what's between ought and be?
Notes:
-
Use re.search()
not re.match. The match method restricts to the start of the string, a confusing convention. If you want that, search explicitly with caret: re.search(r'^...', ...)
(Or in re.MULTILINE mode use A
)
-
Use raw string syntax r'pattern'
for the first parameter. Otherwise you would need to double up backslashes, as in re.search('ing\b', ...)
-
In these examples, '\b'
or r'b'
is a special sequence meaning word-boundary for regex purposes. Not to be confused with 'b'
or 'x08'
backspace.
-
re.search()
returns None
if it doesn’t find anything, which is always falsy.
-
re.search()
returns a Match object if it finds anything, which is always truthy.
-
even though re.search() returns a Match object (type(search_object) is re.Match
) I have taken to calling the return value a search_object
. I keep returning to my own bloody answer here because I can’t remember whether to use match or search. It’s search, dammit.
-
a group is what matched inside pattern parentheses.
-
group numbering starts at 1.
-
-
With a running example:
"""
very nive interface to try regexes: https://regex101.com/
"""
# %%
"""Simple if statement with a regex"""
import re
regex = r"s*Proof.s*"
contents = ['Proof.n', 'nProof.n']
for content in contents:
assert re.match(regex, content), f'Failed on {content=} with {regex=}'
if re.match(regex, content):
print(content)
I have the following code which looks through the files in one directory and copies files that contain a certain string into another directory, but I am trying to use Regular Expressions as the string could be upper and lowercase or a mix of both.
Here is the code that works, before I tried to use RegEx’s
import os
import re
import shutil
def test():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
if ("Hello World" in content)
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
Here is my code when I have tried to use RegEx’s
import os
import re
import shutil
def test2():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
regex_txt = "facebook.com"
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
Im guessing that I need a line of code that is something like
if regex = re.compile(regex_txt, re.IGNORECASE) == True
But I cant seem to get anything to work, if someone could point me in the right direction it would be appreciated.
First you compile the regex, then you have to use it with match
, find
, or some other method to actually run it against some input.
import os
import re
import shutil
def test():
os.chdir("C:/Users/David/Desktop/Test/MyFiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/Desktop/Test/MyFiles2")
pattern = re.compile(regex_txt, re.IGNORECASE)
for x in (files):
with open((x), 'r') as input_file:
for line in input_file:
if pattern.search(line):
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
break
The REPL makes it easy to learn APIs. Just run python
, create an object and then ask for help
:
$ python
>>> import re
>>> help(re.compile(r''))
at the command line shows, among other things:
search(...)
search(string[, pos[, endpos]])
–> match object orNone
.
Scan through string looking for a match, and return a corresponding
MatchObject
instance. ReturnNone
if no position in the string matches.
so you can do
regex = re.compile(regex_txt, re.IGNORECASE)
match = regex.search(content) # From your file reading code.
if match is not None:
# use match
Incidentally,
regex_txt = "facebook.com"
has a .
which matches any character, so re.compile("facebook.com").search("facebookkcom") is not None
is true because .
matches any character. Maybe
regex_txt = r"(?i)facebook.com"
The .
matches a literal "."
character instead of treating .
as a special regular expression operator.
The r"..."
bit means that the regular expression compiler gets the escape in .
instead of the python parser interpreting it.
The (?i)
makes the regex case-insensitive like re.IGNORECASE
but self-contained.
import re
if re.match(regex, content):
blah..
You could also use re.search
depending on how you want it to match.
You can run this example:
"""
very nice interface to try regexes: https://regex101.com/
"""
# %%
"""Simple if statement with a regex"""
import re
regex = r"s*Proof.s*"
contents = ['Proof.n', 'nProof.n']
for content in contents:
assert re.match(regex, content), f'Failed on {content=} with {regex=}'
if re.match(regex, content):
print(content)
Regex’s shouldn’t really be used in this fashion – unless you want something more complicated than what you’re trying to do – for instance, you could just normalise your content string and comparision string to be:
if 'facebook.com' in content.lower():
shutil.copy(x, "C:/Users/David/Desktop/Test/MyFiles2")
if re.search(r'pattern', string):
Simple if-regex example:
if re.search(r'ingb', "seeking a great perhaps"): # any words end with ing?
print("yes")
Complex if-regex example (pattern check, extract a substring, case insensitive):
search_object = re.search(r'^OUGHT (.*) BE$', "ought to be", flags=re.IGNORECASE)
if search_object:
assert "to" == search_object.group(1) # what's between ought and be?
Notes:
-
Use
re.search()
not re.match. The match method restricts to the start of the string, a confusing convention. If you want that, search explicitly with caret:re.search(r'^...', ...)
(Or in re.MULTILINE mode useA
) -
Use raw string syntax
r'pattern'
for the first parameter. Otherwise you would need to double up backslashes, as inre.search('ing\b', ...)
-
In these examples,
'\b'
orr'b'
is a special sequence meaning word-boundary for regex purposes. Not to be confused with'b'
or'x08'
backspace. -
re.search()
returnsNone
if it doesn’t find anything, which is always falsy. -
re.search()
returns a Match object if it finds anything, which is always truthy. -
even though re.search() returns a Match object (
type(search_object) is re.Match
) I have taken to calling the return value asearch_object
. I keep returning to my own bloody answer here because I can’t remember whether to use match or search. It’s search, dammit. -
a group is what matched inside pattern parentheses.
-
group numbering starts at 1.
With a running example:
"""
very nive interface to try regexes: https://regex101.com/
"""
# %%
"""Simple if statement with a regex"""
import re
regex = r"s*Proof.s*"
contents = ['Proof.n', 'nProof.n']
for content in contents:
assert re.match(regex, content), f'Failed on {content=} with {regex=}'
if re.match(regex, content):
print(content)