finding an exact match using RegEx in Python

Question:

I’m searching for exact course codes in a text. Codes look like this

MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*

So 3 or 4 uppercase letters followed by 4 digits.

I only want ones that do not end with "*" symbol.

I have tried

course_code = re.compile('[A-Z]{4}[0-9]{4}|[A-Z]{3}[0-9]{4}')

which is probably one of the worse ways to do it but kinda works as I can get all the courses listed above. The issue is I don’t want those 3 course codes ending with a "*" (failed courses have a * next to their codes) to be included in the list.

I tried adding w or $ to the end of the expression. Whichever I add, the code returns an empty list.

Asked By: yokartikcem

||

Answers:

If I read your requirements correctly, you want this pattern:

^[A-Z]{3,4}[0-9]{4}$

This assumes that you would be searching your entire text stored in a Python string using regex in multiline mode, q.v. this demo:

inp = """MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*"""

matches = re.findall(r'^[A-Z]{3,4}[0-9]{4}$', inp, flags=re.M)
print(matches)  # ['MAT1051']
Answered By: Tim Biegeleisen
import re
# Add a "$" at the end of the re.
# It requires the match to end after the 4 digits.
course_code = re.compile('[A-Z]{4}[0-9]{4}$|[A-Z]{3}[0-9]{4}$')

# No match here
m = re.match(course_code, "MAT1051*")
print(m)
# This matches
m = re.match(course_code, "MAT1051")
print(m)
Answered By: C. Pappy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.