Using `split` on columns too slow – how can I get better performance?

Question

I’ve a dataset (around 10Gb) of call records. There’s column with ip addresses that I want to split into four new columns. I’m trying to use:

df['ip'].fillna('0.0.0.0', inplace=True)
df = df.join(df['ip'].apply(lambda x: Series(x.split('.'))))

but it’s tooooo slow… the fillna is fast, like 10s, but then it stays in the split for like 5 hours…

Is there any better way to do it?

Asked By: tubadc

||

Source

Answer 1

Example data (your questions are more likely to be answered if you provide this):

import pandas as pd
import random

def make_ip():
    return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20)]})

df
Out[4]: 
                 ip
0     153.1.219.147
1   110.170.184.123
2     91.100.92.150
3      61.148.99.64
4      94.175.253.3
5     30.29.220.218
6     7.118.167.173
7       71.99.78.94
8   240.122.200.194
9       48.16.177.0
10    81.155.96.173
11     202.91.134.9
12   90.155.159.176
13     169.74.28.73
14   149.133.115.45
15   168.196.41.132
16   145.195.15.234
17     12.200.28.27
18    146.255.29.80
19   228.226.143.45

Use pandas’ builtin str methods for efficient string operations, and add them on directly to avoid a slow join:

df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')

df
Out[8]: 
                 ip  ip0  ip1  ip2  ip3
0     153.1.219.147  153    1  219  147
1   110.170.184.123  110  170  184  123
2     91.100.92.150   91  100   92  150
3      61.148.99.64   61  148   99   64
4      94.175.253.3   94  175  253    3
5     30.29.220.218   30   29  220  218
6     7.118.167.173    7  118  167  173
7       71.99.78.94   71   99   78   94
8   240.122.200.194  240  122  200  194
9       48.16.177.0   48   16  177    0
10    81.155.96.173   81  155   96  173
11     202.91.134.9  202   91  134    9
12   90.155.159.176   90  155  159  176
13     169.74.28.73  169   74   28   73
14   149.133.115.45  149  133  115   45
15   168.196.41.132  168  196   41  132
16   145.195.15.234  145  195   15  234
17     12.200.28.27   12  200   28   27
18    146.255.29.80  146  255   29   80
19   228.226.143.45  228  226  143   45

Answered By: Marius

Answer 2

This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split method should now be fast.

It turns out that the str.split in Pandas (in core/strings.py as str_split) is actually very slow; it isn’t any more efficient, and still iterates through using Python, offering no speedup whatsoever.

Actually, see below. Pandas performance on this is simply miserable; it’s not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!

Interestingly, though, there’s a trick solution that’s much faster: writing the Series out to text, and then reading it in again, with ‘.’ as the separator:

df[['ip0', 'ip1', 'ip2', 'ip3']] = 
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')

To compare, I use Marius’ code and generate 20,000 ips:

import pandas as pd
import random
import numpy as np
from StringIO import StringIO

def make_ip():
    return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = 
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop

Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:

iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop

What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = 
    pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop

There’s something very wrong here.

Answered By: cge

Using `split` on columns too slow – how can I get better performance?

Question:

Answers: