python re.sub problems backreferencing from a function

Question:

I want to ‘join’ certain numbers, that clearly should be together, although I don’t want them to join every number.

What I have:

'Canesten 1 500 mg meka kapsula za rodnicui'
'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10 000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1 000 000 IU filmom obložene tablete'

I want to have:

'Canesten 1500 mg meka kapsula za rodnicui'
'Clexane 10000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1000000 IU filmom obložene tablete'

It may be easier to see which ones I’m trying to join here: https://regex101.com/r/Ht9ZVi/1

I can match each one of the numbers I want to join using ([^a-zA-Z](?:d+s+)*d+sd+0{2}), but because this regex is not perfect regarding the blank spaces I thought about using a function to only remove the blank spaces between numbers.

def spaces(s):
    return re.sub('(?<=d) (?=d)', '', s)

cr['Name'].apply(lambda x: re.sub(r"([^a-zA-Z](?:d+s*)*d+sd+0{2})", spaces(r'1'), x))

This returns the strings unaltered, what am I doing wrong?
I know this is a common question, and the solution is probably really simple but I can’t wrap my head around it..

Asked By: Pedro Domingues

||

Answers:

In your pattern you want to match a leading single char other than a-zA-Z with [^a-zA-Z], but you can assert not an uppercase A-Z directly to the left instead to account for Docile10 000

Then you don’t need a capture group and you could match the digits with at least 1 space in between followed by asserting one of the allowed units.

Then remove the spaces from the match with .group(0)

This part [^Sn]+ matches whitespace chars without newlines. If you want to allow crossing newlines, you can use s+ instead

(?<![A-Z])d+(?:[^Sn]+d+)+(?=[^Sn]*(?:mg|IU)b)

Regex demo

You can also omit the assertion for the unit at the end for the current example data:

(?<![A-Z])d+(?:[^Sn]+d+)+

Example

strings = [
    'Canesten 1 500 mg meka kapsula za rodnicui',
    'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju',
    'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku',
    'Docile10 000 IU/ml oralne kapi, otopina',
    'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka',
    'Prostin E2 2 mg gel za rodnicu',
    'Silapen K 1 000 000 IU filmom obložene tablete'
]

pattern = r"(?<![A-Z])d+(?:[^Sn]+d+)+(?=[^Sn]*(?:mg|IU)b)"

for s in strings:
    print(re.sub(pattern, lambda x: re.sub(r"s+", "", x.group()), s))

Output

Canesten 1500 mg meka kapsula za rodnicui
Clexane 10000 IU (100 mg)/1 ml otopina za injekciju
Humulin M3 100 IU/ml suspenzija za injekciju u ulošku
Docile10000 IU/ml oralne kapi, otopina
POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka
Prostin E2 2 mg gel za rodnicu
Silapen K 1000000 IU filmom obložene tablete
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.