re find date in string
Question:
Can someone explain how to use re.find all to separate only dates from the following strings? When the date can be either of the format- 1.1.2001 or 11.11.2001. There is volatile number of digits in the string representing days and months-
import re
str = "This is my date: 1.1.2001 fooo bla bla bla"
str2 = "This is my date: 11.11.2001 bla bla foo bla"
I know i should use re.findall(pattern, string) but to be honest I am completely confused about those patterns. I don’t know how to assemble the pattern to fit in my case.
I have found something like this but I absolutely don’t know why there is the r letter before the pattern … means start of string? d means digit? and number in {} means how many?
match = re.search(r'd{2}.d{2}.d{4}', text)
Thanks a lot!
Answers:
There are actually two distinct processes happening in this code.
- When you enter some text
"..."
it first needs to be interpreted by the python interpreter at runtime
- Then the python interpreter passes the result
result("...")
to its own internal regex interpreter
In order to match a special character like a digit, python’s internal regex interpreter supports special characters like d
. So the regex interpreter is expecting to get d
. Unfortunately, the character
is also an escape character for the python interpreter in the first step of the process.
In order to avoid the python interpreter eating up
and only passing d
to the regex interpreter. We put r"..."
in front of our strings to indicate a "raw string" – which means "Hey python interpreter, don’t touch my
characters!". This will result in the correct special characters being passed through.
Use r
is a raw string which means it will not get escaped or altered by
in a string
Python describes
as this:
Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence;
Basically meaning that if you use a character that would normally be a special character to regex it ignores this.
{}
are used for repetitions:
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.
Meaning that it will repeat the previous character the number you specified in {}
d
is a special character that matches any digit from 0 to 9.
I highly recommend you this tutorial
re.findall()
returns a list of everything it matches using that regex.
The r
prefix to the strings tells the Python Interpreter it is a raw string, which essentially means backslashes
are no longer treated as escape characters and are literal backslashes. For re
module it’s useful because backslashes are used a lot, so to avoid a lot of \
(escaping the backslash) most would use a raw string instead.
What you’re looking for is this:
match = re.search(r'd{1,2}.d{1,2}.d{4}', text)
The {}
tells regex how many occurrences of the preceding set you wanted. {1,2}
means a minimum of 1 and a maxmium of 2 d
, and {4}
means an exact match of 4 occurrences.
Note that the .
is also escaped by .
, since in regex .
means any character, but in this case you are looking for the literal .
so you escape it to tell regex to look for the literal character.
See this for more explanation: https://regex101.com/r/v2QScR/1
Can someone explain how to use re.find all to separate only dates from the following strings? When the date can be either of the format- 1.1.2001 or 11.11.2001. There is volatile number of digits in the string representing days and months-
import re
str = "This is my date: 1.1.2001 fooo bla bla bla"
str2 = "This is my date: 11.11.2001 bla bla foo bla"
I know i should use re.findall(pattern, string) but to be honest I am completely confused about those patterns. I don’t know how to assemble the pattern to fit in my case.
I have found something like this but I absolutely don’t know why there is the r letter before the pattern … means start of string? d means digit? and number in {} means how many?
match = re.search(r'd{2}.d{2}.d{4}', text)
Thanks a lot!
There are actually two distinct processes happening in this code.
- When you enter some text
"..."
it first needs to be interpreted by the python interpreter at runtime - Then the python interpreter passes the result
result("...")
to its own internal regex interpreter
In order to match a special character like a digit, python’s internal regex interpreter supports special characters like d
. So the regex interpreter is expecting to get d
. Unfortunately, the character is also an escape character for the python interpreter in the first step of the process.
In order to avoid the python interpreter eating up and only passing
d
to the regex interpreter. We put r"..."
in front of our strings to indicate a "raw string" – which means "Hey python interpreter, don’t touch my characters!". This will result in the correct special characters being passed through.
Use r
is a raw string which means it will not get escaped or altered by in a string
Python describes as this:
Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence;
Basically meaning that if you use a character that would normally be a special character to regex it ignores this.
{}
are used for repetitions:
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.
Meaning that it will repeat the previous character the number you specified in {}
d
is a special character that matches any digit from 0 to 9.
I highly recommend you this tutorial
re.findall()
returns a list of everything it matches using that regex.
The r
prefix to the strings tells the Python Interpreter it is a raw string, which essentially means backslashes are no longer treated as escape characters and are literal backslashes. For
re
module it’s useful because backslashes are used a lot, so to avoid a lot of \
(escaping the backslash) most would use a raw string instead.
What you’re looking for is this:
match = re.search(r'd{1,2}.d{1,2}.d{4}', text)
The {}
tells regex how many occurrences of the preceding set you wanted. {1,2}
means a minimum of 1 and a maxmium of 2 d
, and {4}
means an exact match of 4 occurrences.
Note that the .
is also escaped by .
, since in regex .
means any character, but in this case you are looking for the literal .
so you escape it to tell regex to look for the literal character.
See this for more explanation: https://regex101.com/r/v2QScR/1