How to optimize the code and reduce memory usage Python

Question:

The purpose is to reduce memory usage.
Meaning that it should be optimized in a way that the hash is equal to the test hash.

What I’ve tried so far:

  1. Adding __slots__ but it didn’t make any changes.
  2. Change default dtype float64 to float32. Although it reduces the mem usage significantly, it brakes the test by changing the hash.
  3. Converted data into np.array reduced CPU times: from 13 s to 2.05 s but didn’t affect the memory usage.

The code to reproduce:

rows = 40000000
trs = 10

random.seed(42)

generated_data: tp.List[float] = np.array([random.random() for _ in range(rows)])



def df_upd(df_initial: pd.DataFrame, df_new: pd.DataFrame) -> pd.DataFrame:
    return pd.concat((df_initial, df_new), axis=1)


class T:
    """adding a column of random data"""
    __slots__ = ['var']
    def __init__(self, var: float):
        self.var = var

    def transform(self, df_initial: pd.DataFrame) -> pd.DataFrame:
        return df_upd(df_initial, pd.DataFrame({self.var: generated_data}))


class Pipeline:
    __slots__ = ['df', 'transforms']
    def __init__(self):
        self.df = pd.DataFrame()
        self.transforms = np.array([T(f"v{i}") for i in range(trs)])

    def run(self):
        for t in self.transforms:
            self.df = t.transform(self.df)
        return self.df


if __name__ == "__main__":
    
    
    # starting the monitoring
    tracemalloc.start()

    # function call
    pipe = Pipeline()
    %time df = pipe.run()
    print("running")

    # displaying the memory
    current, peak = tracemalloc.get_traced_memory()
    print(f"Current memory usage is {current / 10**3} KB ({(current / 10**3)*0.001} MB); Peak was {peak / 10**3} KB ({(peak / 10**3)*0.001} MB); Diff = {(peak - current) / 10**3} KB ({((peak - current) / 10**3)*0.001} MB)")

    # stopping the library
    tracemalloc.stop()
    
    # should stay unchanged
    %time hashed_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    print("hashed_df", hashed_df)    
    
    assert hashed_df == test_hash

    print("Success!")
Asked By: Vagner

||

Answers:

If you avoid pd.concat() and use the preferred way of augmenting dataframes:

df["new_col_name"] = new_col_data

this will reduce peak memory consumption significantly.


In your code it is sufficient to fix the Transform class:

class Transform:
    """adding a column of random data"""
    __slots__ = ['var']
    def __init__(self, var: str):
        self.var = var

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        df[self.var] = generated_data
        return df

(Note that I also changed the type of var from float to str to reflect how it is used in the code).


In my machine I went from:

Current memory usage is 1600110.987 KB (1600.110987 MB); Peak was 4480116.325 KB (4480.116325 MB); Diff = 2880005.338 KB (2880.005338 MB)

to:

Current memory usage is 1760101.105 KB (1760.101105 MB); Peak was 1760103.477 KB (1760.1034769999999 MB); Diff = 2.372 KB (0.002372 MB)

(I am not sure why the current memory usage is slightly higher in this case).


For faster computation, you may want to do some pre-allocation.

To do that, you could replace, in Pipeline’s __init__():

self.df = pd.DataFrame()

with:

self.df = pd.DataFrame(data=np.empty((rows, trs)), columns=[f"v{i}" for i in range(trs)])

If you want to get even faster, you can compute the DataFrame right away in the Pipeline’s __init__, e.g.:

class Pipeline:
    __slots__ = ['df', 'transforms']
    def __init__(self):
        self.df = pd.DataFrame(data=generated_data[:, None] + np.zeros(trs)[None, :], columns=[f"v{i}" for i in range(trs)])

    def run(self):
        return self.df

but I assume your Transform is a proxy of a more complex operation and I am not sure this simplification is easy to adapt beyond the toy code in the question.

Answered By: norok2