String split on new line, tab and some number of spaces
Question:
I’m trying to perform a string split on a set of somewhat irregular data that looks something like:
ntName: John Smith
nt Home: Anytown USA
nt Phone: 555-555-555
nt Other Home: Somewhere Else
nt Notes: Other data
ntName: Jane Smith
nt Misc: Data with spaces
I’d like to convert this into a tuple/dict where I later will split on the colon :
, but first I need to get rid of all the extra whitespace. I’m guessing a regex is the best way but I can’t seem to get one that works, below is my attempt.
data_string.split('nt *')
Answers:
Just use .strip(), it removes all whitespace for you, including tabs and newlines, while splitting. The splitting itself can then be done with data_string.splitlines()
:
[s.strip() for s in data_string.splitlines()]
Output:
>>> [s.strip() for s in data_string.splitlines()]
['Name: John Smith', 'Home: Anytown USA', 'Phone: 555-555-555', 'Other Home: Somewhere Else', 'Notes: Other data', 'Name: Jane Smith', 'Misc: Data with spaces']
You can even inline the splitting on :
as well now:
>>> [s.strip().split(': ') for s in data_string.splitlines()]
[['Name', 'John Smith'], ['Home', 'Anytown USA'], ['Phone', '555-555-555'], ['Other Home', 'Somewhere Else'], ['Notes', 'Other data'], ['Name', 'Jane Smith'], ['Misc', 'Data with spaces']]
You can use this
string.strip().split(":")
>>> for line in s.splitlines():
... line = line.strip()
... if not line:continue
... ary.append(line.split(":"))
...
>>> ary
[['Name', ' John Smith'], ['Home', ' Anytown USA'], ['Misc', ' Data with spaces'
]]
>>> dict(ary)
{'Home': ' Anytown USA', 'Misc': ' Data with spaces', 'Name': ' John Smith'}
>>>
You can kill two birds with one regex stone:
>>> r = """
... ntName: John Smith
... nt Home: Anytown USA
... nt Phone: 555-555-555
... nt Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt Misc: Data with spaces
... """
>>> import re
>>> print re.findall(r'(S[^:]+):s*(.*S)', r)
[('Name', 'John Smith'), ('Home', 'Anytown USA'), ('Phone', '555-555-555'), ('Other Home', 'Somewhere Else'), ('Notes', 'Other data'), ('Name', 'Jane Smith'), ('Misc', 'Data with spaces')]
>>>
Regex’s aren’t really the best tool for the job here. As others have said, using a combination of str.strip()
and str.split()
is the way to go. Here’s a one liner to do it:
>>> data = '''ntName: John Smith
... nt Home: Anytown USA
... nt Phone: 555-555-555
... nt Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt Misc: Data with spaces'''
>>> {line.strip().split(': ')[0]:line.split(': ')[1] for line in data.splitlines() if line.strip() != ''}
{'Name': 'Jane Smith', 'Other Home': 'Somewhere Else', 'Notes': 'Other data', 'Misc': 'Data with spaces', 'Phone': '555-555-555', 'Home': 'Anytown USA'}
If you look at the documentation for str.split
:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
In other words, if you’re trying to figure out what to pass to split
to get 'ntName: Jane Smith'
to ['Name:', 'Jane', 'Smith']
, just pass nothing (or None).
This almost solves your whole problem. There are two parts left.
First, you’ve only got two fields, the second of which can contain spaces. So, you only want one split, not as many as possible. So:
s.split(None, 1)
Next, you’ve still got those pesky colons. But you don’t need to split on them. At least given the data you’ve shown us, the colon always appears at the end of the first field, with no space before and always space after, so you can just remove it:
key, value = s.split(None, 1)
key = key[:-1]
There are a million other ways to do this, of course; this is just the one that seems closest to what you were already trying.
I had to split a string on newline (n) and tab (t). What I did was first to replace n by t and then split on t
example_arr = example_string.replace("n", "t").split("t")
If you want to split your data by more than one delimiter, say by tabs and new lines, you can use Boolean sign "or" in split function as it follows:
mysplit = data_string.split("t"or"n")
Not to mention, you can also replace multiple delimiters in this way with one common delimiter, such as comma, and then split resulted list by that (such as when you want to have a list similar to .csv files format as well as final split list):
mylist = data_string.replace("t"or"n",",")
mysplit = mylist.split(",")
I’m trying to perform a string split on a set of somewhat irregular data that looks something like:
ntName: John Smith
nt Home: Anytown USA
nt Phone: 555-555-555
nt Other Home: Somewhere Else
nt Notes: Other data
ntName: Jane Smith
nt Misc: Data with spaces
I’d like to convert this into a tuple/dict where I later will split on the colon :
, but first I need to get rid of all the extra whitespace. I’m guessing a regex is the best way but I can’t seem to get one that works, below is my attempt.
data_string.split('nt *')
Just use .strip(), it removes all whitespace for you, including tabs and newlines, while splitting. The splitting itself can then be done with data_string.splitlines()
:
[s.strip() for s in data_string.splitlines()]
Output:
>>> [s.strip() for s in data_string.splitlines()]
['Name: John Smith', 'Home: Anytown USA', 'Phone: 555-555-555', 'Other Home: Somewhere Else', 'Notes: Other data', 'Name: Jane Smith', 'Misc: Data with spaces']
You can even inline the splitting on :
as well now:
>>> [s.strip().split(': ') for s in data_string.splitlines()]
[['Name', 'John Smith'], ['Home', 'Anytown USA'], ['Phone', '555-555-555'], ['Other Home', 'Somewhere Else'], ['Notes', 'Other data'], ['Name', 'Jane Smith'], ['Misc', 'Data with spaces']]
You can use this
string.strip().split(":")
>>> for line in s.splitlines():
... line = line.strip()
... if not line:continue
... ary.append(line.split(":"))
...
>>> ary
[['Name', ' John Smith'], ['Home', ' Anytown USA'], ['Misc', ' Data with spaces'
]]
>>> dict(ary)
{'Home': ' Anytown USA', 'Misc': ' Data with spaces', 'Name': ' John Smith'}
>>>
You can kill two birds with one regex stone:
>>> r = """
... ntName: John Smith
... nt Home: Anytown USA
... nt Phone: 555-555-555
... nt Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt Misc: Data with spaces
... """
>>> import re
>>> print re.findall(r'(S[^:]+):s*(.*S)', r)
[('Name', 'John Smith'), ('Home', 'Anytown USA'), ('Phone', '555-555-555'), ('Other Home', 'Somewhere Else'), ('Notes', 'Other data'), ('Name', 'Jane Smith'), ('Misc', 'Data with spaces')]
>>>
Regex’s aren’t really the best tool for the job here. As others have said, using a combination of str.strip()
and str.split()
is the way to go. Here’s a one liner to do it:
>>> data = '''ntName: John Smith
... nt Home: Anytown USA
... nt Phone: 555-555-555
... nt Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt Misc: Data with spaces'''
>>> {line.strip().split(': ')[0]:line.split(': ')[1] for line in data.splitlines() if line.strip() != ''}
{'Name': 'Jane Smith', 'Other Home': 'Somewhere Else', 'Notes': 'Other data', 'Misc': 'Data with spaces', 'Phone': '555-555-555', 'Home': 'Anytown USA'}
If you look at the documentation for str.split
:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
In other words, if you’re trying to figure out what to pass to split
to get 'ntName: Jane Smith'
to ['Name:', 'Jane', 'Smith']
, just pass nothing (or None).
This almost solves your whole problem. There are two parts left.
First, you’ve only got two fields, the second of which can contain spaces. So, you only want one split, not as many as possible. So:
s.split(None, 1)
Next, you’ve still got those pesky colons. But you don’t need to split on them. At least given the data you’ve shown us, the colon always appears at the end of the first field, with no space before and always space after, so you can just remove it:
key, value = s.split(None, 1)
key = key[:-1]
There are a million other ways to do this, of course; this is just the one that seems closest to what you were already trying.
I had to split a string on newline (n) and tab (t). What I did was first to replace n by t and then split on t
example_arr = example_string.replace("n", "t").split("t")
If you want to split your data by more than one delimiter, say by tabs and new lines, you can use Boolean sign "or" in split function as it follows:
mysplit = data_string.split("t"or"n")
Not to mention, you can also replace multiple delimiters in this way with one common delimiter, such as comma, and then split resulted list by that (such as when you want to have a list similar to .csv files format as well as final split list):
mylist = data_string.replace("t"or"n",",")
mysplit = mylist.split(",")