How to remove white space in between ascii and nonascii chars?

Question:

For example:

import re

s1 = 'LOGO 设计'
## s2 = '设计 LOGO'

s = re.sub('[a-zA-Z0-9]{3,}(s)[^a-zA-Z0-9]', '', s1)

print(s)

I want to find at least 3 ascii chars, followed by a space, then followed by a nonascii char, and replace the white space with empty string. My code has two issues:

  1. How to write the replacement string for (s)?

  2. How to make it also work for the reverse order of s2?:

    [^a-zA-Z0-9]

Asked By: marlon

||

Answers:

Put the strings that you want to keep in the result in capture groups, then reference them in the replacement.

s = re.sub(r'([a-zA-Z0-9]{3})s([^a-zA-Z0-9])', r'12', s1)

You don’t need to use {3,}, you can just use {3}. This will copy the last 3 characters to the result. All the preceding characters will be copied by default because they’re not being replaced.

You can also do it with lookarounds, by matching a space that’s preceded by 3 ASCII characters and followed by a non-ASCII. Then you replace the space with an empty string.

s = re.sub(r'(?<=[a-zA-Z0-9]{3})s(?=[^a-zA-Z0-9])', '', s1)

You can use alternative in this method to match both orders

s = re.sub(r'(?<=[a-zA-Z0-9]{3})s(?=[^a-zA-Z0-9])|(?<=[^a-zA-Z0-9])s(?=[a-zA-Z0-9]{3})', '', s1)
Answered By: Barmar

With lookahead and lookbehind

s1 = 'LOGO 设计 SKY  आकाश'

st = re.split(r'(?<=[^a-zA-Z])(?=[a-zA-Z])',s1)

[re.sub(r's+','',e) for e in st]

['LOGO设计', 'SKYआकाश']
Answered By: LetzerWille
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.