String split on new line, tab and some number of spaces

Question:

I’m trying to perform a string split on a set of somewhat irregular data that looks something like:

ntName: John Smith
nt  Home: Anytown USA
nt    Phone: 555-555-555
nt  Other Home: Somewhere Else
nt Notes: Other data
ntName: Jane Smith
nt  Misc: Data with spaces

I’d like to convert this into a tuple/dict where I later will split on the colon :, but first I need to get rid of all the extra whitespace. I’m guessing a regex is the best way but I can’t seem to get one that works, below is my attempt.

data_string.split('nt *')
Asked By: PopeJohnPaulII

||

Answers:

Just use .strip(), it removes all whitespace for you, including tabs and newlines, while splitting. The splitting itself can then be done with data_string.splitlines():

[s.strip() for s in data_string.splitlines()]

Output:

>>> [s.strip() for s in data_string.splitlines()]
['Name: John Smith', 'Home: Anytown USA', 'Phone: 555-555-555', 'Other Home: Somewhere Else', 'Notes: Other data', 'Name: Jane Smith', 'Misc: Data with spaces']

You can even inline the splitting on : as well now:

>>> [s.strip().split(': ') for s in data_string.splitlines()]
[['Name', 'John Smith'], ['Home', 'Anytown USA'], ['Phone', '555-555-555'], ['Other Home', 'Somewhere Else'], ['Notes', 'Other data'], ['Name', 'Jane Smith'], ['Misc', 'Data with spaces']]
Answered By: Martijn Pieters

You can use this

string.strip().split(":")
Answered By: Rakesh
>>> for line in s.splitlines():
...     line = line.strip()
...     if not line:continue
...     ary.append(line.split(":"))
...
>>> ary
[['Name', ' John Smith'], ['Home', ' Anytown USA'], ['Misc', ' Data with spaces'
]]
>>> dict(ary)
{'Home': ' Anytown USA', 'Misc': ' Data with spaces', 'Name': ' John Smith'}
>>>
Answered By: Joran Beasley

You can kill two birds with one regex stone:

>>> r = """
... ntName: John Smith
... nt  Home: Anytown USA
... nt    Phone: 555-555-555
... nt  Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt  Misc: Data with spaces
... """
>>> import re
>>> print re.findall(r'(S[^:]+):s*(.*S)', r)
[('Name', 'John Smith'), ('Home', 'Anytown USA'), ('Phone', '555-555-555'), ('Other Home', 'Somewhere Else'), ('Notes', 'Other data'), ('Name', 'Jane Smith'), ('Misc', 'Data with spaces')]
>>> 
Answered By: georg

Regex’s aren’t really the best tool for the job here. As others have said, using a combination of str.strip() and str.split() is the way to go. Here’s a one liner to do it:

>>> data = '''ntName: John Smith
... nt  Home: Anytown USA
... nt    Phone: 555-555-555
... nt  Other Home: Somewhere Else
... nt Notes: Other data
... ntName: Jane Smith
... nt  Misc: Data with spaces'''
>>> {line.strip().split(': ')[0]:line.split(': ')[1] for line in data.splitlines() if line.strip() != ''}
{'Name': 'Jane Smith', 'Other Home': 'Somewhere Else', 'Notes': 'Other data', 'Misc': 'Data with spaces', 'Phone': '555-555-555', 'Home': 'Anytown USA'}
Answered By: Matthew Adams

If you look at the documentation for str.split:

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

In other words, if you’re trying to figure out what to pass to split to get 'ntName: Jane Smith' to ['Name:', 'Jane', 'Smith'], just pass nothing (or None).

This almost solves your whole problem. There are two parts left.

First, you’ve only got two fields, the second of which can contain spaces. So, you only want one split, not as many as possible. So:

s.split(None, 1)

Next, you’ve still got those pesky colons. But you don’t need to split on them. At least given the data you’ve shown us, the colon always appears at the end of the first field, with no space before and always space after, so you can just remove it:

key, value = s.split(None, 1)
key = key[:-1]

There are a million other ways to do this, of course; this is just the one that seems closest to what you were already trying.

Answered By: abarnert

I had to split a string on newline (n) and tab (t). What I did was first to replace n by t and then split on t

example_arr = example_string.replace("n", "t").split("t")
Answered By: Joe

If you want to split your data by more than one delimiter, say by tabs and new lines, you can use Boolean sign "or" in split function as it follows:

mysplit = data_string.split("t"or"n")

Not to mention, you can also replace multiple delimiters in this way with one common delimiter, such as comma, and then split resulted list by that (such as when you want to have a list similar to .csv files format as well as final split list):

mylist = data_string.replace("t"or"n",",")
mysplit = mylist.split(",")
Answered By: Hamed Sabagh
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.