Why this code take more than 5 second if execute first time and when repeat takes less then 0.2s

Question:

I begin from a question:

Why first time execution is taking longer time then usual. Should I be worried about malware or similar?

I have code snippets below. I have pickled numpy ndarray in file. I tried to read all and generate total ndarray.

I registered a strange behavior.

Execution time takes more than 5s and only when I turn off Linux machine and I open terminal and i run first time this program.
Next program executions takes less then 0.2s.

Why first time execution is taking longer time then usual.
First I was thinking that it is a extra time for pycache

I am running this code under venv and venv is activated via main.sh

read_pickle.py:

#!/home/****/bin/python3
# coding=utf-8

from os import listdir
from os.path import isfile, join
import multiprocessing
import numpy as np
import pandas as pd
import time as t
from utilsC0 import read_pickle_from_file
from enums import *


def process(file):
    directory = "some localization" # external hdd
    path = directory + f"/{file}"
    arr = read_pickle_from_file(path)
    return arr


def f2():
    directory = "some localization" # external hdd
    onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]

    with multiprocessing.Pool() as p: 
        result = p.map(process, onlyfiles)

    result = np.vstack(result)

    print("End f2")


if __name__ == "__main__":
    t1 = t.time()
    f2()
    t2 = t.time()
    print(f"{t2 - t1}")

And simple main.sh:

#!/usr/bin/env bash

source /home/***/bin/activate
python3 src/read_pickle.py
Asked By: luki

||

Answers:

Probably because all the files are loaded into the cache?

Shortly after the OS uses files, it is quicker for it to read them the next time. I would only expect this magnitude of difference if the files are very numerous or on a very slow external drive. Is that the case?

199 files with capacity 150,6 MB … external HDD

Ah, that sounds very plausible. That would take many seconds to read the first time, and might then be stored in a cache which is very much faster. (See also @Jérôme Richard’s comment pointing out that it is the sheer number of files that is probably the driver here.)

The cache is likely in the OS rather than in Python. As pointed out by @slothorp in the comments below, it could even be on the HDD.

No this does not signify malware

It is normal behaviour for computer systems to be designed to auto-detect what you are doing and be able to do it more efficiently next time, if possible. That is the main reason for cacheing.

Answered By: ProfDFrancis
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.