How to optimize the code and reduce memory usage Python
Question:
The purpose is to reduce memory usage.
Meaning that it should be optimized in a way that the hash
is equal to the test hash
.
What I’ve tried so far:
- Adding
__slots__
but it didn’t make any changes.
- Change default dtype
float64
to float32
. Although it reduces the mem usage significantly, it brakes the test by changing the hash.
- Converted data into
np.array
reduced CPU times: from 13 s to 2.05 s
but didn’t affect the memory usage.
The code to reproduce:
rows = 40000000
trs = 10
random.seed(42)
generated_data: tp.List[float] = np.array([random.random() for _ in range(rows)])
def df_upd(df_initial: pd.DataFrame, df_new: pd.DataFrame) -> pd.DataFrame:
return pd.concat((df_initial, df_new), axis=1)
class T:
"""adding a column of random data"""
__slots__ = ['var']
def __init__(self, var: float):
self.var = var
def transform(self, df_initial: pd.DataFrame) -> pd.DataFrame:
return df_upd(df_initial, pd.DataFrame({self.var: generated_data}))
class Pipeline:
__slots__ = ['df', 'transforms']
def __init__(self):
self.df = pd.DataFrame()
self.transforms = np.array([T(f"v{i}") for i in range(trs)])
def run(self):
for t in self.transforms:
self.df = t.transform(self.df)
return self.df
if __name__ == "__main__":
# starting the monitoring
tracemalloc.start()
# function call
pipe = Pipeline()
%time df = pipe.run()
print("running")
# displaying the memory
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**3} KB ({(current / 10**3)*0.001} MB); Peak was {peak / 10**3} KB ({(peak / 10**3)*0.001} MB); Diff = {(peak - current) / 10**3} KB ({((peak - current) / 10**3)*0.001} MB)")
# stopping the library
tracemalloc.stop()
# should stay unchanged
%time hashed_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
print("hashed_df", hashed_df)
assert hashed_df == test_hash
print("Success!")
Answers:
If you avoid pd.concat()
and use the preferred way of augmenting dataframes:
df["new_col_name"] = new_col_data
this will reduce peak memory consumption significantly.
In your code it is sufficient to fix the Transform
class:
class Transform:
"""adding a column of random data"""
__slots__ = ['var']
def __init__(self, var: str):
self.var = var
def transform(self, df: pd.DataFrame) -> pd.DataFrame:
df[self.var] = generated_data
return df
(Note that I also changed the type
of var
from float
to str
to reflect how it is used in the code).
In my machine I went from:
Current memory usage is 1600110.987 KB (1600.110987 MB); Peak was 4480116.325 KB (4480.116325 MB); Diff = 2880005.338 KB (2880.005338 MB)
to:
Current memory usage is 1760101.105 KB (1760.101105 MB); Peak was 1760103.477 KB (1760.1034769999999 MB); Diff = 2.372 KB (0.002372 MB)
(I am not sure why the current memory usage is slightly higher in this case).
For faster computation, you may want to do some pre-allocation.
To do that, you could replace, in Pipeline’s __init__()
:
self.df = pd.DataFrame()
with:
self.df = pd.DataFrame(data=np.empty((rows, trs)), columns=[f"v{i}" for i in range(trs)])
If you want to get even faster, you can compute the DataFrame right away in the Pipeline’s __init__
, e.g.:
class Pipeline:
__slots__ = ['df', 'transforms']
def __init__(self):
self.df = pd.DataFrame(data=generated_data[:, None] + np.zeros(trs)[None, :], columns=[f"v{i}" for i in range(trs)])
def run(self):
return self.df
but I assume your Transform
is a proxy of a more complex operation and I am not sure this simplification is easy to adapt beyond the toy code in the question.
The purpose is to reduce memory usage.
Meaning that it should be optimized in a way that the hash
is equal to the test hash
.
What I’ve tried so far:
- Adding
__slots__
but it didn’t make any changes. - Change default dtype
float64
tofloat32
. Although it reduces the mem usage significantly, it brakes the test by changing the hash. - Converted data into
np.array
reduced CPU times:from 13 s to 2.05 s
but didn’t affect the memory usage.
The code to reproduce:
rows = 40000000
trs = 10
random.seed(42)
generated_data: tp.List[float] = np.array([random.random() for _ in range(rows)])
def df_upd(df_initial: pd.DataFrame, df_new: pd.DataFrame) -> pd.DataFrame:
return pd.concat((df_initial, df_new), axis=1)
class T:
"""adding a column of random data"""
__slots__ = ['var']
def __init__(self, var: float):
self.var = var
def transform(self, df_initial: pd.DataFrame) -> pd.DataFrame:
return df_upd(df_initial, pd.DataFrame({self.var: generated_data}))
class Pipeline:
__slots__ = ['df', 'transforms']
def __init__(self):
self.df = pd.DataFrame()
self.transforms = np.array([T(f"v{i}") for i in range(trs)])
def run(self):
for t in self.transforms:
self.df = t.transform(self.df)
return self.df
if __name__ == "__main__":
# starting the monitoring
tracemalloc.start()
# function call
pipe = Pipeline()
%time df = pipe.run()
print("running")
# displaying the memory
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**3} KB ({(current / 10**3)*0.001} MB); Peak was {peak / 10**3} KB ({(peak / 10**3)*0.001} MB); Diff = {(peak - current) / 10**3} KB ({((peak - current) / 10**3)*0.001} MB)")
# stopping the library
tracemalloc.stop()
# should stay unchanged
%time hashed_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
print("hashed_df", hashed_df)
assert hashed_df == test_hash
print("Success!")
If you avoid pd.concat()
and use the preferred way of augmenting dataframes:
df["new_col_name"] = new_col_data
this will reduce peak memory consumption significantly.
In your code it is sufficient to fix the Transform
class:
class Transform:
"""adding a column of random data"""
__slots__ = ['var']
def __init__(self, var: str):
self.var = var
def transform(self, df: pd.DataFrame) -> pd.DataFrame:
df[self.var] = generated_data
return df
(Note that I also changed the type
of var
from float
to str
to reflect how it is used in the code).
In my machine I went from:
Current memory usage is 1600110.987 KB (1600.110987 MB); Peak was 4480116.325 KB (4480.116325 MB); Diff = 2880005.338 KB (2880.005338 MB)
to:
Current memory usage is 1760101.105 KB (1760.101105 MB); Peak was 1760103.477 KB (1760.1034769999999 MB); Diff = 2.372 KB (0.002372 MB)
(I am not sure why the current memory usage is slightly higher in this case).
For faster computation, you may want to do some pre-allocation.
To do that, you could replace, in Pipeline’s __init__()
:
self.df = pd.DataFrame()
with:
self.df = pd.DataFrame(data=np.empty((rows, trs)), columns=[f"v{i}" for i in range(trs)])
If you want to get even faster, you can compute the DataFrame right away in the Pipeline’s __init__
, e.g.:
class Pipeline:
__slots__ = ['df', 'transforms']
def __init__(self):
self.df = pd.DataFrame(data=generated_data[:, None] + np.zeros(trs)[None, :], columns=[f"v{i}" for i in range(trs)])
def run(self):
return self.df
but I assume your Transform
is a proxy of a more complex operation and I am not sure this simplification is easy to adapt beyond the toy code in the question.