Regular Expression to Pull Information from String in Python

Question:

What I am trying to do is take my current string and remove all data from it that doesn’t contain the actual software version. Here is the string I am currently working with:

print (CurrentVersion)

Delivers the output:

2018, \\some\directory\is\here, \\some\directory\is\here,  2019, \\here\is\another\directory, \\here\is\another\directory,  2021, \\here\is\another\path_2021,   2020, http://some.will/even/look/like/this,   2022r2,   2023

When what I really want is this for an output:

2018, 2019, 2020, 2021, 2022r2, 2023

What I have tried was to come up with a regular expression to remove the excess data. It looks like ‘[0-9, ]’ will pull out the numbers and commas getting me closer to my goal. So I came up with this code:

RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())

But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can’t seem to get that far.

So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?

Asked By: Acuity

||

Answers:

You might use a capture group:

(?:^|,s*)(d{4}w*)(?=,|$)

The pattern matches:

  • (?:^|,s*) Match either the start of the string, or match a comma followed by optional whitespace chars
  • (d{4}w*) Capture at least 4 digits followed by optional word characters
  • (?=,|$) Assert either a comma or the end of the string to the right

See a regex demo

Example

import re
 
pattern = r"(?:^|,s*)(d{4}w*)(?=,|$)"
 
s = ("2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here,  2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory,  2021, \\\\here\\is\\another\\path_2021,   2020, http://s...content-available-to-author-only...e.will/even/look/like/this,   2022r2,   2023n")
 
print(re.findall(pattern, s))

Output

['2018', '2019', '2021', '2020', '2022r2', '2023']

Other options could be finding all the years that start with 20 and then optionally match r followed by 1 of more digits:

(?:^|,s*)(20dd(?:rd+)?)(?=,|$)

Regex demo

Or matching 4 digits followed by all except a comma:

(?:^|,s*)(d{4}[^,]*)

Regex demo

Answered By: The fourth bird

Your first problem is that the regex [0-9, ] will match any character that is a digit from 0 to 9, a comma, or a space. This will match each digit in a number individually, as well as commas and spaces which you don’t want. Additionally, it won’t match the r in your expected output of 2022r2 and will match the digits of 2021 in "\hereisanotherpath_2021"

I would instead recommend using (:? |^)(d+(?:rd)?). First, this checks to make sure that the year is preceded with either a space or the start of the string. Next is a capturing group, which matches a string which contains 1 or more digits (d+), and optionally matches an extension ((?:rd)?) containing the letter "r" and one more digit. If your input could contain more than one digit following the letter "r", you could instead replace this part with (?:rd+)?.

Your second, bigger problem is that you use RegexVersion.search(CurrentVersion), which only returns the first match in the string.

I would instead recommend using RegexVersion.findall(CurrentVersion), which would return an array of all matches. You could then optionally join that array into one long comma-seperated string using

", ".join(CurrentVersion).

Answered By: NonBinaryProgrammer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.