Improving time efficiency of code, working with a big Data Set using Python

Question

I’m trying to optimize my code to reduce the time it takes to run through the data set.

I’m working with a .csv file that has three columns: time_UTC, vmag2D and vdir. The data set as around 1420000 lines (one million, four hundred and 20 thousand).

This for cycle took around 15/20 minutes to run on my Mac with the M1 processor. I’m sure I complicated something for it to take so much time. (I believe the processor is good enough to run this little piece of code faster).

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data + ' *name of the .csv file* '

data = pd.read_csv(file)

time_UTC = []
vmag2D = []
vdir = []

for i in range(len(data)):
    x = data.iloc[i][0]
    x1 = x.split(' ')
    x2 = x1[1].split(';')
    date = x.split(' ')[0]
    time_UTC.append(x2[0])
    vmag2D.append(x2[1])
    vdir.append(x2[2])

The code is parsing each of the lines in the .csv file, and each of them has the same “template”: ‘1994-01-01 00:05:00;0.52;193’

Asked By: João Roquette Saldanha

||

Source

Answer 1

You can split the entire column at once

import pandas as pd
import numpy as np

df = pd.DataFrame({"all": ["1994-01-01 00:05:00;0.52;193"]*1000})

# split at space " "
df[["date", "time vmag vdir"]] = df["all"].str.split(" ", expand=True)

# split at ";"
df[["time", "vmag2D", "vdir"]] = df['time vmag vdir'].str.split(';', expand=True)

date = pd.to_datetime(df["date"]).to_list()
time_UTC = pd.to_datetime(df["time"]).to_list()
vmag2D = pd.to_numeric(df["vmag2D"]).to_list()
vdir = pd.to_numeric(df["vdir"]).to_list()

Answered By: Colim

Answer 2

It shouldn’t be necessary to use any type of for loop for your code. You are reading the CSV using pandas, but you don’t seem to specify the correct parameters.

import pandas as pd

path_data = '" *insert a path here* "'
file = path_data + ' *name of the .csv file* '

df = pd.read_csv(file, sep=';', parse_dates=[0], engine='c', header=None)

time_UTC = df.iloc[:, 0]
vmag2D = df.iloc[:, 1]
vdir = df.iloc[:, 2]

If you run this, your resulting variables (time_UTC, …) will be of type pandas.Series. You can convert those to list with .to_list() or access the numpy array using .values.

Note that I am specifying engine='c' here in the pandas CSV parser, which is using a native C parser that is faster than its python equivalent, as you are processing a large file here.

Answered By: carlo_barth

Improving time efficiency of code, working with a big Data Set using Python

Question:

Answers: