How can I remove all unneeded whitespaces in string but keep symbols like 'n'?
Question:
I have such a string:
s = 'Hello nWorld!nToday is a wonderful day'
And I need to get this:
'Hello nWorld!nToday is a wonderful day'
I tried to use split
and join
like:
' '.join('Hello nWorld!nToday is a wonderful day'.split())
But I’m getting this:
'Hello World!Today is a wonderful day'
Regular expressions like:
re.sub(r"s+", " ", 'Hello nWorld!nToday is a wonderful day')
are giving the same result.
Answers:
There are several things you could do.
You could simply only replace any occurrence of at least one space with a single space:
re.sub(r'( )+', ' ', s)
To cover more types of (horizontal) whitespace, you could include tabs (t
) and feed (f
) characters (see regex101):
re.sub(r'[tf ]+', ' ', s)
Alternatively, instead of specifying the characters you do want to replace, you could exclude the ones you don’t want to replace (double negative!):
re.sub(r'[^Snr]+', ' ', s)
In this last example, the ^
signifies that any character not present in the list should be matched, the S
signifies all non-whitespace characters, and n
and r
are newline and carriage return characters. See regex101.
Using str
‘s methods you might get desired output following way
s1= 'Hello nWorld!nToday is a wonderful day'
' '.join(i for i in 'Hello nWorld!nToday is a wonderful day'.split(' ') if i)
gives
'Hello nWorld!nToday is a wonderful day'
Explanation: split at space characters then use comprehension to filter out empty string (those does arise from adjacent spaces) then join what was left
Here are two ways to do that for each of two interpretations of the question.
First interpretation: where there are two or more of the same whitespace character in a row, other than newlines (n
), remove all but one of those characters.
Replace each match of the regular expression
([ trfv])1*(?=1)
with an empty string.
This regular expression has the following elements.
( Begin capture group 1
[ trfv] Match a whitespace other than a newline (`n`)
) End capture group 1
1* Match the character in character class 1 zero or more times
(?=1) Positive lookahead asserts that the next character matches
the content of character class 1
Alternatively, replace each match of
([ trfv])1+
with the contents of capture group 1.
This regular expression has the following elements.
( Begin capture group 1
[ trfv] Match a whitespace character other than n
) End capture group 1
1+ Match the content of capture group 1 one or more times
Second interpretation: where there are two or more of whitespace character in a row, other than newlines (s
), remove all but the last whitespace character.
Replace each match of the regular expression
[ trfv](?=[ trfv])
with an empty string.
This regular expression has the following elements.
[ trfv]+ Match one or more whitespace characters other than `n`
(?= Begin a positive lookahead
[ trfv] Match a whitespace character other than `n`
) End positive lookahead
Alternatively, replace each match of
[ trfv]{2,}
This regular expression reads, "match a whitespace character other than a newline (n
) two or more times, as many as possible.
I have such a string:
s = 'Hello nWorld!nToday is a wonderful day'
And I need to get this:
'Hello nWorld!nToday is a wonderful day'
I tried to use split
and join
like:
' '.join('Hello nWorld!nToday is a wonderful day'.split())
But I’m getting this:
'Hello World!Today is a wonderful day'
Regular expressions like:
re.sub(r"s+", " ", 'Hello nWorld!nToday is a wonderful day')
are giving the same result.
There are several things you could do.
You could simply only replace any occurrence of at least one space with a single space:
re.sub(r'( )+', ' ', s)
To cover more types of (horizontal) whitespace, you could include tabs (t
) and feed (f
) characters (see regex101):
re.sub(r'[tf ]+', ' ', s)
Alternatively, instead of specifying the characters you do want to replace, you could exclude the ones you don’t want to replace (double negative!):
re.sub(r'[^Snr]+', ' ', s)
In this last example, the ^
signifies that any character not present in the list should be matched, the S
signifies all non-whitespace characters, and n
and r
are newline and carriage return characters. See regex101.
Using str
‘s methods you might get desired output following way
s1= 'Hello nWorld!nToday is a wonderful day'
' '.join(i for i in 'Hello nWorld!nToday is a wonderful day'.split(' ') if i)
gives
'Hello nWorld!nToday is a wonderful day'
Explanation: split at space characters then use comprehension to filter out empty string (those does arise from adjacent spaces) then join what was left
Here are two ways to do that for each of two interpretations of the question.
First interpretation: where there are two or more of the same whitespace character in a row, other than newlines (n
), remove all but one of those characters.
Replace each match of the regular expression
([ trfv])1*(?=1)
with an empty string.
This regular expression has the following elements.
( Begin capture group 1
[ trfv] Match a whitespace other than a newline (`n`)
) End capture group 1
1* Match the character in character class 1 zero or more times
(?=1) Positive lookahead asserts that the next character matches
the content of character class 1
Alternatively, replace each match of
([ trfv])1+
with the contents of capture group 1.
This regular expression has the following elements.
( Begin capture group 1
[ trfv] Match a whitespace character other than n
) End capture group 1
1+ Match the content of capture group 1 one or more times
Second interpretation: where there are two or more of whitespace character in a row, other than newlines (s
), remove all but the last whitespace character.
Replace each match of the regular expression
[ trfv](?=[ trfv])
with an empty string.
This regular expression has the following elements.
[ trfv]+ Match one or more whitespace characters other than `n`
(?= Begin a positive lookahead
[ trfv] Match a whitespace character other than `n`
) End positive lookahead
Alternatively, replace each match of
[ trfv]{2,}
This regular expression reads, "match a whitespace character other than a newline (n
) two or more times, as many as possible.