Preserve whitespaces when using split() and join() in python

Question:

I have a data file with columns like

BBP1   0.000000  -0.150000    2.033000  0.00 -0.150   1.77

and the individual columns are separated by a varying number of whitespaces.

My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.

My approach would be reading in the lines as items of a list. And then I would use split() on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join() the columns together and write the lines from the list to a new text file.

The problem is that I have those varying amount of whitespaces. I don’t know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?

Asked By: user2015601

||

Answers:

You want to use re.split() in that case, with a group:

re.split(r'(s+)', line)

would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.

Example:

>>> re.split(r'(s+)', line)
['BBP1', '   ', '0.000000', '  ', '-0.150000', '    ', '2.033000', '  ', '0.00', ' ', '-0.150', '   ', '1.77']

You probably do want to remove the newline from the end.

Answered By: Martijn Pieters

Other way to do this is:

s = 'BBP1   0.000000  -0.150000    2.033000  0.00 -0.150   1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']

If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after ‘join’ function.

Answered By: Gaurav Bishnoi

For lines that have whitespace at the beginning and/or end, a more robust pattern is (S+) to split at non-whitespace characters:

import re

line1 = ' 4   426.2   orangen'
line2 = '12    82.1   applen'

re_S = re.compile(r'(S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1)  # [' ', '4', '   ', '426.2', '   ', 'orange', 'n']
print(items2)  # ['', '12', '    ', '82.1', '   ', 'apple', 'n']

These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:

print(repr(''.join(items1)))  # ' 4   426.2   orangen'
print(repr(''.join(items2)))  # '12    82.1   applen'

To contrast the example with a similar pattern (s+) (lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:

re_s = re.compile(r'(s+)')
print(re_s.split(line1))  # ['', ' ', '4', '    ', '20.0', '   ', 'orange', 'n', '']
print(re_s.split(line2))  # ['12', '    ', '82.1', '   ', 'apple', 'n', '']

As you can see, this would be a bit more difficult to process in a consistent manner.

Answered By: Mike T
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.