pandas.DataFrame.convert_dtypes increasing memory usage

Question:

Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.

I have this DF imported from a SAS table:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   cd_unco_tab      857613 non-null  object        
 1   cd_ref_cnv       856389 non-null  object        
 2   cd_cli           849637 non-null  object        
 3   nm_prd           857613 non-null  object        
 4   nm_ctgr_cpr      857613 non-null  object        
 5   ts_cpr           857229 non-null  datetime64[ns]
 6   ts_cnfc          857613 non-null  datetime64[ns]
 7   ts_incl          857613 non-null  datetime64[ns]
 8   vl_cmss_rec      857613 non-null  float64       
 9   qt_prd           857613 non-null  float64       
 10  pc_cmss_rec      857242 non-null  float64       
 11  nm_loja          857242 non-null  object        
 12  vl_brto_cpr      857242 non-null  float64       
 13  vl_cpr           857242 non-null  float64       
 14  qt_dvlc          857613 non-null  float64       
 15  cd_in_evt_espl   857613 non-null  float64       
 16  cd_mm_aa_ref     840959 non-null  object        
 17  nr_est_ctbc_evt  857613 non-null  float64       
 18  nr_est_cnfc_pcr  18963 non-null   float64       
 19  cd_tran_pcr      0 non-null       object        
 20  ts_est           18963 non-null   datetime64[ns]
 21  tx_est_tran      18963 non-null   object        
 22  vl_tran          18963 non-null   float64       
 23  cd_pcr           0 non-null       float64       
 24  vl_cbac_cli      653563 non-null  float64       
 25  pc_cbac_cli      653563 non-null  float64       
 26  cd_vndr          18963 non-null   float64       
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB

Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).

I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:

dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   cd_unco_tab      857613 non-null  string        
 1   cd_ref_cnv       856389 non-null  string        
 2   cd_cli           849637 non-null  string        
 3   nm_prd           857613 non-null  string        
 4   nm_ctgr_cpr      857613 non-null  string        
 5   ts_cpr           857229 non-null  datetime64[ns]
 6   ts_cnfc          857613 non-null  datetime64[ns]
 7   ts_incl          857613 non-null  datetime64[ns]
 8   vl_cmss_rec      857613 non-null  Float64       
 9   qt_prd           857613 non-null  Int64         
 10  pc_cmss_rec      857242 non-null  Float64       
 11  nm_loja          857242 non-null  string        
 12  vl_brto_cpr      857242 non-null  Float64       
 13  vl_cpr           857242 non-null  Float64       
 14  qt_dvlc          857613 non-null  Int64         
 15  cd_in_evt_espl   857613 non-null  Int64         
 16  cd_mm_aa_ref     840959 non-null  string        
 17  nr_est_ctbc_evt  857613 non-null  Int64         
 18  nr_est_cnfc_pcr  18963 non-null   Int64         
 19  cd_tran_pcr      0 non-null       Int64         
 20  ts_est           18963 non-null   datetime64[ns]
 21  tx_est_tran      18963 non-null   string        
 22  vl_tran          18963 non-null   Float64       
 23  cd_pcr           0 non-null       Int64         
 24  vl_cbac_cli      653563 non-null  Float64       
 25  pc_cbac_cli      653563 non-null  Float64       
 26  cd_vndr          18963 non-null   Int64         
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB

Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!

Any guess?

Asked By: FábioRB

||

Answers:

After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.

Here is a small example to demonstrate this:

pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()

Index    128
col       80 # 8 byte per item * 10 items
dtype: int64

pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()

Index    128
col       90 # 9 byte per item * 10 items
dtype: int64

Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes

>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes 
12.2682523727417

Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.

Some details about new nullable integer datatype:

Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:

s = pd.Series([1, 2, np.nan])
print(s)

0    1.0
1    2.0
2    NaN
dtype: float64

Let’s say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.

s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)

0       1
1       2
2    <NA>
dtype: Int64

Some details on string dtype

As of now there isn’t a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:

Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.

Answered By: Shubham Sharma
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.