pandas dataframe: loc vs query performance

Question:

I have 2 dataframes in python that I would like to query for data.

  • DF1: 4M records x 3 columns. The query function seems more
    efficient than the loc function.

  • DF2: 2K records x 6 columns. The loc function seems much more
    efficient than the query function.

Both queries return a single record. The simulation was done by running the same operation 10K times.

Running python 2.7 and pandas 0.16.0

Any recommendations to improve the query speed?

Asked By: Syntax_Error

||

Answers:

For improve performance is possible use numexpr:

import numexpr

np.random.seed(125)
N = 40000000
df = pd.DataFrame({'A':np.random.randint(10, size=N)})

def ne(df):
    x = df.A.values
    return df[numexpr.evaluate('(x > 5)')]
print (ne(df))

In [138]: %timeit (ne(df))
1 loop, best of 3: 494 ms per loop

In [139]: %timeit df[df.A > 5]
1 loop, best of 3: 536 ms per loop

In [140]: %timeit df.query('A > 5')
1 loop, best of 3: 781 ms per loop

In [141]: %timeit df[df.eval('A > 5')]
1 loop, best of 3: 770 ms per loop

import numexpr
np.random.seed(125)

def ne(x):
    x = x.A.values
    return x[numexpr.evaluate('(x > 5)')]

def be(x):
    return x[x.A > 5]

def q(x):
    return x.query('A > 5')

def ev(x):
    return x[x.eval('A > 5')]


def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[ne, be, q, ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

graph

Edit:

Graph with modified ne and improved be:

def ne(x):
    return x[numexpr.evaluate('(x > 5)')]

def bex(x):
    return x[x.A.values > 5]

enter image description here

Answered By: jezrael

I’ve added the .loc function and used the same code @jezrael used to test the performance in python 3.9.5 and pandas 1.2.5.

import numpy as np
import pandas as pd
import numexpr
import perfplot
np.random.seed(125)

def ne(x):
    return x[numexpr.evaluate('(x > 5)')]

def bex(x):
    return x[x.A.values > 5]

def be(x):
    return x[x.A > 5]

def lc(x):
    return x.loc[x.A > 5]

def lcx(x):
    return x.loc[x.A.values > 5]

def q(x):
    return x.query('A > 5')

def ev(x):
    return x[x.eval('A > 5')]

def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[ne, lc, lcx, be, bex, q, ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

The using the .loc function does not afect the result of the be funcion and until 10^6 lines, using te column .values seens to be fastest option. The query method was the slowest.

top5

here just the loc and be, to see the overlap.
loc vs. be

Answered By: Rafael Braga