Using `split` on columns too slow – how can I get better performance?
Question:
I’ve a dataset (around 10Gb) of call records. There’s column with ip addresses that I want to split into four new columns. I’m trying to use:
df['ip'].fillna('0.0.0.0', inplace=True)
df = df.join(df['ip'].apply(lambda x: Series(x.split('.'))))
but it’s tooooo slow… the fillna
is fast, like 10s, but then it stays in the split for like 5 hours…
Is there any better way to do it?
Answers:
Example data (your questions are more likely to be answered if you provide this):
import pandas as pd
import random
def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))
df = pd.DataFrame({'ip': [make_ip() for i in range(20)]})
df
Out[4]:
ip
0 153.1.219.147
1 110.170.184.123
2 91.100.92.150
3 61.148.99.64
4 94.175.253.3
5 30.29.220.218
6 7.118.167.173
7 71.99.78.94
8 240.122.200.194
9 48.16.177.0
10 81.155.96.173
11 202.91.134.9
12 90.155.159.176
13 169.74.28.73
14 149.133.115.45
15 168.196.41.132
16 145.195.15.234
17 12.200.28.27
18 146.255.29.80
19 228.226.143.45
Use pandas’ builtin str methods for efficient string operations, and add them on directly to avoid a slow join:
df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
df
Out[8]:
ip ip0 ip1 ip2 ip3
0 153.1.219.147 153 1 219 147
1 110.170.184.123 110 170 184 123
2 91.100.92.150 91 100 92 150
3 61.148.99.64 61 148 99 64
4 94.175.253.3 94 175 253 3
5 30.29.220.218 30 29 220 218
6 7.118.167.173 7 118 167 173
7 71.99.78.94 71 99 78 94
8 240.122.200.194 240 122 200 194
9 48.16.177.0 48 16 177 0
10 81.155.96.173 81 155 96 173
11 202.91.134.9 202 91 134 9
12 90.155.159.176 90 155 159 176
13 169.74.28.73 169 74 28 73
14 149.133.115.45 149 133 115 45
15 168.196.41.132 168 196 41 132
16 145.195.15.234 145 195 15 234
17 12.200.28.27 12 200 28 27
18 146.255.29.80 146 255 29 80
19 228.226.143.45 228 226 143 45
This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split
method should now be fast.
It turns out that the str.split
in Pandas (in core/strings.py
as str_split
) is actually very slow; it isn’t any more efficient, and still iterates through using Python, offering no speedup whatsoever.
Actually, see below. Pandas performance on this is simply miserable; it’s not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!
Interestingly, though, there’s a trick solution that’s much faster: writing the Series out to text, and then reading it in again, with ‘.’ as the separator:
df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
To compare, I use Marius’ code and generate 20,000 ips:
import pandas as pd
import random
import numpy as np
from StringIO import StringIO
def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))
df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop
Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:
iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop
What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop
There’s something very wrong here.
I’ve a dataset (around 10Gb) of call records. There’s column with ip addresses that I want to split into four new columns. I’m trying to use:
df['ip'].fillna('0.0.0.0', inplace=True)
df = df.join(df['ip'].apply(lambda x: Series(x.split('.'))))
but it’s tooooo slow… the fillna
is fast, like 10s, but then it stays in the split for like 5 hours…
Is there any better way to do it?
Example data (your questions are more likely to be answered if you provide this):
import pandas as pd
import random
def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))
df = pd.DataFrame({'ip': [make_ip() for i in range(20)]})
df
Out[4]:
ip
0 153.1.219.147
1 110.170.184.123
2 91.100.92.150
3 61.148.99.64
4 94.175.253.3
5 30.29.220.218
6 7.118.167.173
7 71.99.78.94
8 240.122.200.194
9 48.16.177.0
10 81.155.96.173
11 202.91.134.9
12 90.155.159.176
13 169.74.28.73
14 149.133.115.45
15 168.196.41.132
16 145.195.15.234
17 12.200.28.27
18 146.255.29.80
19 228.226.143.45
Use pandas’ builtin str methods for efficient string operations, and add them on directly to avoid a slow join:
df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
df
Out[8]:
ip ip0 ip1 ip2 ip3
0 153.1.219.147 153 1 219 147
1 110.170.184.123 110 170 184 123
2 91.100.92.150 91 100 92 150
3 61.148.99.64 61 148 99 64
4 94.175.253.3 94 175 253 3
5 30.29.220.218 30 29 220 218
6 7.118.167.173 7 118 167 173
7 71.99.78.94 71 99 78 94
8 240.122.200.194 240 122 200 194
9 48.16.177.0 48 16 177 0
10 81.155.96.173 81 155 96 173
11 202.91.134.9 202 91 134 9
12 90.155.159.176 90 155 159 176
13 169.74.28.73 169 74 28 73
14 149.133.115.45 149 133 115 45
15 168.196.41.132 168 196 41 132
16 145.195.15.234 145 195 15 234
17 12.200.28.27 12 200 28 27
18 146.255.29.80 146 255 29 80
19 228.226.143.45 228 226 143 45
This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split
method should now be fast.
It turns out that the str.split
in Pandas (in core/strings.py
as str_split
) is actually very slow; it isn’t any more efficient, and still iterates through using Python, offering no speedup whatsoever.
Actually, see below. Pandas performance on this is simply miserable; it’s not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!
Interestingly, though, there’s a trick solution that’s much faster: writing the Series out to text, and then reading it in again, with ‘.’ as the separator:
df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
To compare, I use Marius’ code and generate 20,000 ips:
import pandas as pd
import random
import numpy as np
from StringIO import StringIO
def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))
df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop
Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:
iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop
What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] =
pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop
There’s something very wrong here.