Preserve whitespaces when using split() and join() in python
Question:
I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split()
on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join()
the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don’t know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?
Answers:
You want to use re.split()
in that case, with a group:
re.split(r'(s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.
Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after ‘join’ function.
For lines that have whitespace at the beginning and/or end, a more robust pattern is (S+)
to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orangen'
line2 = '12 82.1 applen'
re_S = re.compile(r'(S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', 'n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', 'n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orangen'
print(repr(''.join(items2))) # '12 82.1 applen'
To contrast the example with a similar pattern (s+)
(lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', 'n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', 'n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.
I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split()
on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join()
the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don’t know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?
You want to use re.split()
in that case, with a group:
re.split(r'(s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.
Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after ‘join’ function.
For lines that have whitespace at the beginning and/or end, a more robust pattern is (S+)
to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orangen'
line2 = '12 82.1 applen'
re_S = re.compile(r'(S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', 'n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', 'n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orangen'
print(repr(''.join(items2))) # '12 82.1 applen'
To contrast the example with a similar pattern (s+)
(lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', 'n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', 'n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.