re find date in string

Question:

Can someone explain how to use re.find all to separate only dates from the following strings? When the date can be either of the format- 1.1.2001 or 11.11.2001. There is volatile number of digits in the string representing days and months-

import re 
str = "This is my date: 1.1.2001 fooo bla bla bla"
str2 = "This is my date: 11.11.2001 bla bla foo bla"

I know i should use re.findall(pattern, string) but to be honest I am completely confused about those patterns. I don’t know how to assemble the pattern to fit in my case.

I have found something like this but I absolutely don’t know why there is the r letter before the pattern … means start of string? d means digit? and number in {} means how many?

match = re.search(r'd{2}.d{2}.d{4}', text)

Thanks a lot!

Asked By: Slav3k

||

Answers:

There are actually two distinct processes happening in this code.

  1. When you enter some text "..." it first needs to be interpreted by the python interpreter at runtime
  2. Then the python interpreter passes the result result("...") to its own internal regex interpreter

In order to match a special character like a digit, python’s internal regex interpreter supports special characters like d. So the regex interpreter is expecting to get d. Unfortunately, the character is also an escape character for the python interpreter in the first step of the process.

In order to avoid the python interpreter eating up and only passing d to the regex interpreter. We put r"..." in front of our strings to indicate a "raw string" – which means "Hey python interpreter, don’t touch my characters!". This will result in the correct special characters being passed through.

Answered By: AlanSTACK

Use r is a raw string which means it will not get escaped or altered by in a string

Python describes as this:

Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence;

Basically meaning that if you use a character that would normally be a special character to regex it ignores this.

{} are used for repetitions:

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.

Meaning that it will repeat the previous character the number you specified in {}

d is a special character that matches any digit from 0 to 9.

I highly recommend you this tutorial

re.findall() returns a list of everything it matches using that regex.

Answered By: Xantium

The r prefix to the strings tells the Python Interpreter it is a raw string, which essentially means backslashes are no longer treated as escape characters and are literal backslashes. For re module it’s useful because backslashes are used a lot, so to avoid a lot of \ (escaping the backslash) most would use a raw string instead.

What you’re looking for is this:

match = re.search(r'd{1,2}.d{1,2}.d{4}', text)

The {} tells regex how many occurrences of the preceding set you wanted. {1,2} means a minimum of 1 and a maxmium of 2 d, and {4} means an exact match of 4 occurrences.

Note that the . is also escaped by ., since in regex . means any character, but in this case you are looking for the literal . so you escape it to tell regex to look for the literal character.

See this for more explanation: https://regex101.com/r/v2QScR/1

Answered By: r.ook
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.