Get last "column" after .str.split() operation on column in pandas DataFrame
Question:
I have a column in a pandas DataFrame that I would like to split on a single space. The splitting is simple enough with DataFrame.str.split(' ')
, but I can’t make a new column from the last entry. When I .str.split()
the column I get a list of arrays and I don’t know how to manipulate this to get a new column for my DataFrame.
Here is an example. Each entry in the column contains ‘symbol data price’ and I would like to split off the price (and eventually remove the “p”… or “c” in half the cases).
import pandas as pd
temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
temp2 = temp.ticker.str.split(' ')
which yields
0 ['spx', '5/25/2001', 'p500']
1 ['spx', '5/25/2001', 'p600']
2 ['spx', '5/25/2001', 'p700']
But temp2[0]
just gives one list entry’s array and temp2[:][-1]
fails. How can I convert the last entry in each array to a new column? Thanks!
Answers:
You could use the tolist
method as an intermediary:
In [99]: import pandas as pd
In [100]: d1 = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
In [101]: d1.ticker.str.split().tolist()
Out[101]:
[['spx', '5/25/2001', 'p500'],
['spx', '5/25/2001', 'p600'],
['spx', '5/25/2001', 'p700']]
From which you could make a new DataFrame:
In [102]: d2 = pd.DataFrame(d1.ticker.str.split().tolist(),
.....: columns="symbol date price".split())
In [103]: d2
Out[103]:
symbol date price
0 spx 5/25/2001 p500
1 spx 5/25/2001 p600
2 spx 5/25/2001 p700
For good measure, you could fix the price:
In [104]: d2["price"] = d2["price"].str.replace("p","").astype(float)
In [105]: d2
Out[105]:
symbol date price
0 spx 5/25/2001 500
1 spx 5/25/2001 600
2 spx 5/25/2001 700
PS: but if you really just want the last column, apply
would suffice:
In [113]: temp2.apply(lambda x: x[2])
Out[113]:
0 p500
1 p600
2 p700
Name: ticker
Do this:
In [43]: temp2.str[-1]
Out[43]:
0 p500
1 p600
2 p700
Name: ticker
So all together it would be:
>>> temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
>>> temp['ticker'].str.split(' ').str[-1]
0 p500
1 p600
2 p700
Name: ticker, dtype: object
https://pandas.pydata.org/pandas-docs/stable/text.html
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_').str.get(1)
or
s2.str.split('_').str[1]
Using Pandas 0.20.3:
In [10]: import pandas as pd
...: temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
...:
In [11]: temp2 = temp.ticker.str.split(' ', expand=True) # the expand=True return a DataFrame
In [12]: temp2
Out[12]:
0 1 2
0 spx 5/25/2001 p500
1 spx 5/25/2001 p600
2 spx 5/25/2001 p700
In [13]: temp3 = temp.join(temp2[2])
In [14]: temp3
Out[14]:
ticker 2
0 spx 5/25/2001 p500 p500
1 spx 5/25/2001 p600 p600
2 spx 5/25/2001 p700 p700
If you are looking for a one-liner (like I came here for), this should do nicely:
temp2 = temp.ticker.str.split(' ', expand = True)[-1]
You can also trivially modify this answer to assign this column back to the original DataFrame as follows:
temp['last_split'] = temp.ticker.str.split(' ', expand = True)[-1]
Which I imagine is a popular use case here.
I have a column in a pandas DataFrame that I would like to split on a single space. The splitting is simple enough with DataFrame.str.split(' ')
, but I can’t make a new column from the last entry. When I .str.split()
the column I get a list of arrays and I don’t know how to manipulate this to get a new column for my DataFrame.
Here is an example. Each entry in the column contains ‘symbol data price’ and I would like to split off the price (and eventually remove the “p”… or “c” in half the cases).
import pandas as pd
temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
temp2 = temp.ticker.str.split(' ')
which yields
0 ['spx', '5/25/2001', 'p500']
1 ['spx', '5/25/2001', 'p600']
2 ['spx', '5/25/2001', 'p700']
But temp2[0]
just gives one list entry’s array and temp2[:][-1]
fails. How can I convert the last entry in each array to a new column? Thanks!
You could use the tolist
method as an intermediary:
In [99]: import pandas as pd
In [100]: d1 = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
In [101]: d1.ticker.str.split().tolist()
Out[101]:
[['spx', '5/25/2001', 'p500'],
['spx', '5/25/2001', 'p600'],
['spx', '5/25/2001', 'p700']]
From which you could make a new DataFrame:
In [102]: d2 = pd.DataFrame(d1.ticker.str.split().tolist(),
.....: columns="symbol date price".split())
In [103]: d2
Out[103]:
symbol date price
0 spx 5/25/2001 p500
1 spx 5/25/2001 p600
2 spx 5/25/2001 p700
For good measure, you could fix the price:
In [104]: d2["price"] = d2["price"].str.replace("p","").astype(float)
In [105]: d2
Out[105]:
symbol date price
0 spx 5/25/2001 500
1 spx 5/25/2001 600
2 spx 5/25/2001 700
PS: but if you really just want the last column, apply
would suffice:
In [113]: temp2.apply(lambda x: x[2])
Out[113]:
0 p500
1 p600
2 p700
Name: ticker
Do this:
In [43]: temp2.str[-1]
Out[43]:
0 p500
1 p600
2 p700
Name: ticker
So all together it would be:
>>> temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
>>> temp['ticker'].str.split(' ').str[-1]
0 p500
1 p600
2 p700
Name: ticker, dtype: object
https://pandas.pydata.org/pandas-docs/stable/text.html
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_').str.get(1)
or
s2.str.split('_').str[1]
Using Pandas 0.20.3:
In [10]: import pandas as pd
...: temp = pd.DataFrame({'ticker' : ['spx 5/25/2001 p500', 'spx 5/25/2001 p600', 'spx 5/25/2001 p700']})
...:
In [11]: temp2 = temp.ticker.str.split(' ', expand=True) # the expand=True return a DataFrame
In [12]: temp2
Out[12]:
0 1 2
0 spx 5/25/2001 p500
1 spx 5/25/2001 p600
2 spx 5/25/2001 p700
In [13]: temp3 = temp.join(temp2[2])
In [14]: temp3
Out[14]:
ticker 2
0 spx 5/25/2001 p500 p500
1 spx 5/25/2001 p600 p600
2 spx 5/25/2001 p700 p700
If you are looking for a one-liner (like I came here for), this should do nicely:
temp2 = temp.ticker.str.split(' ', expand = True)[-1]
You can also trivially modify this answer to assign this column back to the original DataFrame as follows:
temp['last_split'] = temp.ticker.str.split(' ', expand = True)[-1]
Which I imagine is a popular use case here.