Regex: getting the follow data into groups

Question:

I’ve got the following 2 records:

Input
Marvel Comics Presents12 (1982) #125
Marvel Comics Presents #1427 (1988)

I want to parse it into the following format using RegEx:

Title Year Serial Number
Marvel Comics Presents12 (1982) #125
Marvel Comics Presents (1988) #1427

I do know basic RegEx but feel like I’m a little lackluster here. Is there a specific topic within RegEx that helps with this type of problem?

Asked By: Benjamin Stringer

||

Answers:

Try creating match groups for what’s inside the parentheses and the number after the #, then use the same RegEx again to replace that text with nothing. Like this:

import re


def extract(el):
    year = int(re.search(r'((.*))', el).group(1))
    el = re.sub(r'(.*)', '', el)
    serial = int(re.search(r'#(d*)', el).group(1))
    el = re.sub(r'#d*', '', el)
    return {'year': year, 'serial': serial, 'title': el.strip()}


data = ['Marvel Comics Presents12 (1982) #125', 'Marvel Comics Presents #1427 (1988)']
data = [extract(el) for el in data]
print(data)  # => [{'year': 1982, 'serial': 125, 'title': 'Marvel Comics Presents12'}, {'year': 1988, 'serial': 1427, 'title': 'Marvel Comics Presents'}]

The RegExs here are:

  1. ((.*)) to match what is inside the parentheses
  2. #(d*) to match the number after the # symbol.

I removed the match groups from the RegExs that replace text because they are not needed and might speed up the code a bit.

Answered By: Michael M.
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.