Pandas read_csv with JIT Bodo is slower than regular Python

Question:

I’m trying out Bodo to speed up certain Pandas operations, the first being pd.read_csv(...). Bodo requires the compatible pandas code to be in a separate function, separate from non-Bodo compatible code. For example, this is my code:

With Bodo:

import bodo

@bodo.jit
def loadDataFileWithJIT(filePath):
    df = pd.read_csv(filePath, header=0, sep="t", names=["patid", "eventdate", "prodcode", "consid", "issueseq"],
                       usecols=[0, 1, 3, 4, 12],
                       dtype={"patid": "str", "eventdate": "str", "prodcode": "str", "consid": "str", "issueseq": "str"},
                       low_memory=False)
    return df

Over 5 files I see these times:

  • 14.24 <— first time, so this is when JIT compiles
  • 9.67
  • 10.72
  • 9.51
  • 9.42

Without Bodo (the function decorator and import statement have been removed… nothing else has changed):

  • 4.66
  • 4.68
  • 4.59
  • 4.61
  • 4.60

Each file is approximately 170MB.

Update

Having spoken with the authors of Bodo I need to be running Python from mpiexec -n # (where # is number of cores > 1) if I’m to see a speed up.

Asked By: Anthony Nash

||

Answers:

TLDR: speeding up I/O operations requires parallelism. You’d need to use mpiexec with more than one process.

Bodo currently reuses pandas read_csv under the hood to ensure full compatibility. JIT compilation enables parallelism, but doesn’t improve anything on a single core (and in fact has some overhead as you are observing).

You can use ipyparallel to launch and manage Bodo/MPI processes within a single process:
https://github.com/ipython/ipyparallel

Bodo Slack discussion:
https://bodocommunity.slack.com/archives/C01KRTQ1KDY/p1661704632557289

Answered By: Ehsan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.