How can I split a column of tuples in a Pandas dataframe?
Question:
I have a Pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example, I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
Answers:
You can do this by doing pd.DataFrame(col.tolist())
on that column:
In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [3]: df
Out[3]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]
In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
0 1
0 1 2
1 3 4
In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
In [7]: df
Out[7]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series)
instead of pd.DataFrame(df['b'].tolist(), index=df.index)
. That works as well (because it makes a Series of each tuple, which is then seen as a row of a dataframe), but it is slower / uses more memory than the tolist
version, as noted by the other answers here (thanks to denfromufa).
On much larger datasets, I found that .apply()
is few orders of magnitude slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
.
This performance issue was closed in GitHub, although I do not agree with this decision:
performance issue – apply with pd.Series vs tuple #11615
It is based on this answer.
A caveat of the second solution,
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, Pandas will try to match the indices from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and Pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.,
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I think a simpler way is:
>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
>>> df
a b
0 1 (1, 2)
1 2 (3, 4)
>>> df['b_a'] = df['b'].str[0]
>>> df['b_b'] = df['b'].str[1]
>>> df
a b b_a b_b
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
The str
accessor that is available to pandas.Series
objects of dtype == object
is actually an iterable.
Assume a pandas.DataFrame
df
:
df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))
df
col
0 (a, 10)
1 (b, 20)
2 (c, 30)
3 (d, 40)
4 (e, 50)
5 (f, 60)
6 (g, 70)
7 (h, 80)
8 (i, 90)
9 (j, 100)
We can test if it is an iterable:
from collections import Iterable
isinstance(df.col.str, Iterable)
True
We can then assign from it like we do other iterables:
var0, var1 = 'xy'
print(var0, var1)
x y
Simplest solution
So in one line we can assign both columns:
df['a'], df['b'] = df.col.str
df
col a b
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
Faster solution
Only slightly more complicated, we can use zip
to create a similar iterable:
df['c'], df['d'] = zip(*df.col)
df
col a b c d
0 (a, 10) a 10 a 10
1 (b, 20) b 20 b 20
2 (c, 30) c 30 c 30
3 (d, 40) d 40 d 40
4 (e, 50) e 50 e 50
5 (f, 60) f 60 f 60
6 (g, 70) g 70 g 70
7 (h, 80) h 80 h 80
8 (i, 90) i 90 i 90
9 (j, 100) j 100 j 100
Inline
Meaning, don’t mutate existing df
.
This works because assign
takes keyword arguments where the keywords are the new (or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with **
and have it act as the keyword arguments.
So this is a clever way of assigning a new column named 'g'
that is the first item in the df.col.str
iterable and 'h'
that is the second item in the df.col.str
iterable:
df.assign(**dict(zip('gh', df.col.str)))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
My version of the list
approach
With modern list comprehension and variable unpacking.
Note: also inline using join
df.join(pd.DataFrame([*df.col], df.index, [*'ef']))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
The mutating version would be
df[['e', 'f']] = pd.DataFrame([*df.col], df.index)
Naive Time Test
Short DataFrame
Use the one defined above:
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
1.16 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 µs ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Long DataFrame
10^3 times bigger
df = pd.concat([df] * 1000, ignore_index=True)
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas.Series.str.extract is another option, use the data from
https://opendataportal-lasvegas.opendata.arcgis.com/datasets/restaurant-inspections-open-data/explore
import pandas as pd
df = pd.read_csv('raw_data.csv', low_memory=False)
df[['latitude', 'longitude']] = df['Location_1'].str.extract(pat = '(-?d+.d+),s*(-?d+.d+)')
df.to_csv('result.csv')
I have a Pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example, I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
You can do this by doing pd.DataFrame(col.tolist())
on that column:
In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [3]: df
Out[3]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]
In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
0 1
0 1 2
1 3 4
In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
In [7]: df
Out[7]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series)
instead of pd.DataFrame(df['b'].tolist(), index=df.index)
. That works as well (because it makes a Series of each tuple, which is then seen as a row of a dataframe), but it is slower / uses more memory than the tolist
version, as noted by the other answers here (thanks to denfromufa).
On much larger datasets, I found that .apply()
is few orders of magnitude slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
.
This performance issue was closed in GitHub, although I do not agree with this decision:
performance issue – apply with pd.Series vs tuple #11615
It is based on this answer.
A caveat of the second solution,
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, Pandas will try to match the indices from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and Pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.,
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I think a simpler way is:
>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
>>> df
a b
0 1 (1, 2)
1 2 (3, 4)
>>> df['b_a'] = df['b'].str[0]
>>> df['b_b'] = df['b'].str[1]
>>> df
a b b_a b_b
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
The str
accessor that is available to pandas.Series
objects of dtype == object
is actually an iterable.
Assume a pandas.DataFrame
df
:
df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))
df
col
0 (a, 10)
1 (b, 20)
2 (c, 30)
3 (d, 40)
4 (e, 50)
5 (f, 60)
6 (g, 70)
7 (h, 80)
8 (i, 90)
9 (j, 100)
We can test if it is an iterable:
from collections import Iterable
isinstance(df.col.str, Iterable)
True
We can then assign from it like we do other iterables:
var0, var1 = 'xy'
print(var0, var1)
x y
Simplest solution
So in one line we can assign both columns:
df['a'], df['b'] = df.col.str
df
col a b
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
Faster solution
Only slightly more complicated, we can use zip
to create a similar iterable:
df['c'], df['d'] = zip(*df.col)
df
col a b c d
0 (a, 10) a 10 a 10
1 (b, 20) b 20 b 20
2 (c, 30) c 30 c 30
3 (d, 40) d 40 d 40
4 (e, 50) e 50 e 50
5 (f, 60) f 60 f 60
6 (g, 70) g 70 g 70
7 (h, 80) h 80 h 80
8 (i, 90) i 90 i 90
9 (j, 100) j 100 j 100
Inline
Meaning, don’t mutate existing df
.
This works because assign
takes keyword arguments where the keywords are the new (or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with **
and have it act as the keyword arguments.
So this is a clever way of assigning a new column named 'g'
that is the first item in the df.col.str
iterable and 'h'
that is the second item in the df.col.str
iterable:
df.assign(**dict(zip('gh', df.col.str)))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
My version of the list
approach
With modern list comprehension and variable unpacking.
Note: also inline using join
df.join(pd.DataFrame([*df.col], df.index, [*'ef']))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
The mutating version would be
df[['e', 'f']] = pd.DataFrame([*df.col], df.index)
Naive Time Test
Short DataFrame
Use the one defined above:
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
1.16 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 µs ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Long DataFrame
10^3 times bigger
df = pd.concat([df] * 1000, ignore_index=True)
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas.Series.str.extract is another option, use the data from
https://opendataportal-lasvegas.opendata.arcgis.com/datasets/restaurant-inspections-open-data/explore
import pandas as pd
df = pd.read_csv('raw_data.csv', low_memory=False)
df[['latitude', 'longitude']] = df['Location_1'].str.extract(pat = '(-?d+.d+),s*(-?d+.d+)')
df.to_csv('result.csv')