How can I remove all unneeded whitespaces in string but keep symbols like 'n'?

Question:

I have such a string:

s = 'Hello   nWorld!nToday is a wonderful day'

And I need to get this:

'Hello nWorld!nToday is a wonderful day'

I tried to use split and join like:

' '.join('Hello   nWorld!nToday is a wonderful day'.split())

But I’m getting this:

'Hello World!Today is a wonderful day'

Regular expressions like:

re.sub(r"s+", " ", 'Hello   nWorld!nToday is a wonderful day')

are giving the same result.

Asked By: ArtGMlg

||

Answers:

There are several things you could do.

You could simply only replace any occurrence of at least one space with a single space:

re.sub(r'( )+', ' ', s)                                                         

To cover more types of (horizontal) whitespace, you could include tabs (t) and feed (f) characters (see regex101):

re.sub(r'[tf ]+', ' ', s)                                                     

Alternatively, instead of specifying the characters you do want to replace, you could exclude the ones you don’t want to replace (double negative!):

re.sub(r'[^Snr]+', ' ', s)                                                   

In this last example, the ^ signifies that any character not present in the list should be matched, the S signifies all non-whitespace characters, and n and r are newline and carriage return characters. See regex101.

Answered By: buddemat

Using str‘s methods you might get desired output following way

s1= 'Hello   nWorld!nToday is a wonderful day'
' '.join(i for i in 'Hello   nWorld!nToday is a wonderful day'.split(' ') if i)

gives

'Hello nWorld!nToday is a wonderful day'

Explanation: split at space characters then use comprehension to filter out empty string (those does arise from adjacent spaces) then join what was left

Answered By: Daweo

Here are two ways to do that for each of two interpretations of the question.

First interpretation: where there are two or more of the same whitespace character in a row, other than newlines (n), remove all but one of those characters.

Replace each match of the regular expression

([ trfv])1*(?=1)

with an empty string.

Demo

This regular expression has the following elements.

(               Begin capture group 1
  [ trfv]   Match a whitespace other than a newline (`n`)
)               End capture group 1
1*             Match the character in character class 1 zero or more times 
(?=1)          Positive lookahead asserts that the next character matches
                the content of character class 1

Alternatively, replace each match of

([ trfv])1+

with the contents of capture group 1.

Demo

This regular expression has the following elements.

(              Begin capture group 1
  [ trfv]  Match a whitespace character other than n
)              End capture group 1
1+            Match the content of capture group 1 one or more times

Second interpretation: where there are two or more of whitespace character in a row, other than newlines (s), remove all but the last whitespace character.

Replace each match of the regular expression

[ trfv](?=[ trfv])

with an empty string.

Demo

This regular expression has the following elements.

[ trfv]+    Match one or more whitespace characters other than `n`
(?=             Begin a positive lookahead
  [ trfv]   Match a whitespace character other than `n`
)               End positive lookahead

Alternatively, replace each match of

[ trfv]{2,}

Demo

This regular expression reads, "match a whitespace character other than a newline (n) two or more times, as many as possible.

Answered By: Cary Swoveland
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.