Meaning of sparse=False pre-processing data with OneHotEncoder
Question:
I came across the meaning of setting sparse=False
pre-processing my data with a OneHotEncoder. I did:
ct = ColumnTransformer([
("scaling", StandardScaler(), sca_col), #sca_col containing 3 columns
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), ohe_col)]) #ohe_col containing 15 columns
Then I train my model with:
feat = df.drop("label", axis=1)
X_train, X_test, y_train, y_test = train_test_split(feat, df.label, random_state=0)
ct.fit(X_train)
I get the error
[...]
MemoryError: Unable to allocate 151. GiB for an array with shape (239076, 84497) and data type float64
with the right shape according to my data and columns, but obviously not fitting in my RAM.
If I set sparse=True
, which is default, it works.
In which case you need to set sparse=False
, which I did for no obvious reasons a couple of weeks ago?
Answers:
By setting this flag, you choose to represent your data in a sparse formatting. This saves a lot of memory when you have an array where most of the elements are zero.
From Scikit-learn’s ColumTransformer documentation:
sparse_thresholdfloat, default=0.3
If the output of the different
transformers contains sparse matrices, these will be stacked as a
sparse matrix if the overall density is lower than this value. Use
sparse_threshold=0 to always return dense. When the transformed output
consists of all dense data, the stacked result will be dense, and this
keyword will be ignored.
Whether to use a sparse matrix or not depends on the matrix’s sparsity, which is the percentage of the values that are zero. In your case, if using a sparse matrix tackles your memory restrictions, then it’s the way to go. In case your matrix is not sparse enough, you won’t have any benefit in memory savings or any improvement in computational speed, if algorithms designed for sparse matrices are used.
In which case you need to set sparse=False
…?
Some algorithms are not written to operate on sparse matrices, and so forcing your OneHotEncoder
to produce dense output is desirable despite the additional memory use. (See this question for a recent example.) It’s also easier to view the output of a dense matrix.
If your matrix is say < 64KB columns/rows then a very fast BLAS routine can be used that uses a mix of super-highly-optimized low-level c and/or fortran matrix primitives. It takes a reasonably high degree of sparsity to "catch up" to the speed of using those libraries on dense matrix calculations.
I came across the meaning of setting sparse=False
pre-processing my data with a OneHotEncoder. I did:
ct = ColumnTransformer([
("scaling", StandardScaler(), sca_col), #sca_col containing 3 columns
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), ohe_col)]) #ohe_col containing 15 columns
Then I train my model with:
feat = df.drop("label", axis=1)
X_train, X_test, y_train, y_test = train_test_split(feat, df.label, random_state=0)
ct.fit(X_train)
I get the error
[...]
MemoryError: Unable to allocate 151. GiB for an array with shape (239076, 84497) and data type float64
with the right shape according to my data and columns, but obviously not fitting in my RAM.
If I set sparse=True
, which is default, it works.
In which case you need to set sparse=False
, which I did for no obvious reasons a couple of weeks ago?
By setting this flag, you choose to represent your data in a sparse formatting. This saves a lot of memory when you have an array where most of the elements are zero.
From Scikit-learn’s ColumTransformer documentation:
sparse_thresholdfloat, default=0.3
If the output of the different
transformers contains sparse matrices, these will be stacked as a
sparse matrix if the overall density is lower than this value. Use
sparse_threshold=0 to always return dense. When the transformed output
consists of all dense data, the stacked result will be dense, and this
keyword will be ignored.
Whether to use a sparse matrix or not depends on the matrix’s sparsity, which is the percentage of the values that are zero. In your case, if using a sparse matrix tackles your memory restrictions, then it’s the way to go. In case your matrix is not sparse enough, you won’t have any benefit in memory savings or any improvement in computational speed, if algorithms designed for sparse matrices are used.
In which case you need to set
sparse=False
…?
Some algorithms are not written to operate on sparse matrices, and so forcing your OneHotEncoder
to produce dense output is desirable despite the additional memory use. (See this question for a recent example.) It’s also easier to view the output of a dense matrix.
If your matrix is say < 64KB columns/rows then a very fast BLAS routine can be used that uses a mix of super-highly-optimized low-level c and/or fortran matrix primitives. It takes a reasonably high degree of sparsity to "catch up" to the speed of using those libraries on dense matrix calculations.