Pandas : ValueError ( any way to convert Sparse[float64, 0.0] dtypes to float64 datatype )
Question:
I have a dataframe X_train to which i am concatenating a couple of another dataframe. This second & third dataframe is obtained from sparse matrix which has been been generated by a TF-IDF VEctorizer
q1_train_df = pd.DataFrame.sparse.from_spmatrix(q1_tdidf_train,index=X_train.index,columns=q1_features)
q2_train_df = pd.DataFrame.sparse.from_spmatrix(q2_tdidf_train,index=X_train.index,columns=q2_features)
X_train_final = pd.concat([X_train,q1_train_df,q2_train_df],axis=1)
X_train_final dtypes is looking as below
X_train_final.dtypes
cwc_min float64
cwc_max float64
csc_min float64
csc_max float64
ctc_min float64
...
q2_zealand Sparse[float64, 0.0]
q2_zero Sparse[float64, 0.0]
q2_zinc Sparse[float64, 0.0]
q2_zone Sparse[float64, 0.0]
q2_zuckerberg Sparse[float64, 0.0]
Length: 10015, dtype: object
I am using XGBoost to train on this final dataframe and this is throwing error while trying to fit the data
model.fit( X_train_final,y_train)
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields q1_04, q1_10, q1_100, q
I think the error is due to Sparse[float64,0.0] dtypes present in it . Can you please help me out, not able to figure out how to get out of this error ??
Answers:
I actually just came across the same exact issue. I have a list of columns that were generated using TF-IDF vectorizor and I was attempting to use XGBoost on the dataset.
This ended up working for me:
import xgboost as xgb
train_df = train_df.apply(pd.to_numeric, errors='coerce')
train_df[tf_idf_column_names] = train_df[tf_idf_column_names].sparse.to_dense()
train_x = train_df.iloc[:,1:]
train_y = train_df.iloc[:,:1]
dtrain= xgb.DMatrix(data=train_x, label=train_y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
num_round = 2
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)
X_train_final = hstack( blocks=(x_tr_cwc_min,
x_tr_cwc_max,
x_tr_csc_min,
x_tr_csc_max,
x_tr_ctc_min,
x_tr_ctc_max,
x_tr_last_word_eq,
x_tr_first_word_eq,
x_tr_abs_len_diff,
x_tr_mean_len,
x_tr_token_set_ratio,
x_tr_token_sort_ratio,
x_tr_fuzz_ratio,
x_tr_fuzz_partial_ratio,
x_tr_longest_substr_ratio,
q1_tdidf_train,q2_tdidf_train
)
).tocsr()
Here instead of using X_train dataframe directly, i used individual columns of X_train and converted each of these to ndarrays.
To dense was working but for the dataframe i used, it consumed almost 3 GB of space !!! So had to go with this approach
if df
is Sparse[float64, 0]
, you can use df.values
to float64
.
I have a dataframe X_train to which i am concatenating a couple of another dataframe. This second & third dataframe is obtained from sparse matrix which has been been generated by a TF-IDF VEctorizer
q1_train_df = pd.DataFrame.sparse.from_spmatrix(q1_tdidf_train,index=X_train.index,columns=q1_features)
q2_train_df = pd.DataFrame.sparse.from_spmatrix(q2_tdidf_train,index=X_train.index,columns=q2_features)
X_train_final = pd.concat([X_train,q1_train_df,q2_train_df],axis=1)
X_train_final dtypes is looking as below
X_train_final.dtypes
cwc_min float64
cwc_max float64
csc_min float64
csc_max float64
ctc_min float64
...
q2_zealand Sparse[float64, 0.0]
q2_zero Sparse[float64, 0.0]
q2_zinc Sparse[float64, 0.0]
q2_zone Sparse[float64, 0.0]
q2_zuckerberg Sparse[float64, 0.0]
Length: 10015, dtype: object
I am using XGBoost to train on this final dataframe and this is throwing error while trying to fit the data
model.fit( X_train_final,y_train)
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields q1_04, q1_10, q1_100, q
I think the error is due to Sparse[float64,0.0] dtypes present in it . Can you please help me out, not able to figure out how to get out of this error ??
I actually just came across the same exact issue. I have a list of columns that were generated using TF-IDF vectorizor and I was attempting to use XGBoost on the dataset.
This ended up working for me:
import xgboost as xgb
train_df = train_df.apply(pd.to_numeric, errors='coerce')
train_df[tf_idf_column_names] = train_df[tf_idf_column_names].sparse.to_dense()
train_x = train_df.iloc[:,1:]
train_y = train_df.iloc[:,:1]
dtrain= xgb.DMatrix(data=train_x, label=train_y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
num_round = 2
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)
X_train_final = hstack( blocks=(x_tr_cwc_min,
x_tr_cwc_max,
x_tr_csc_min,
x_tr_csc_max,
x_tr_ctc_min,
x_tr_ctc_max,
x_tr_last_word_eq,
x_tr_first_word_eq,
x_tr_abs_len_diff,
x_tr_mean_len,
x_tr_token_set_ratio,
x_tr_token_sort_ratio,
x_tr_fuzz_ratio,
x_tr_fuzz_partial_ratio,
x_tr_longest_substr_ratio,
q1_tdidf_train,q2_tdidf_train
)
).tocsr()
Here instead of using X_train dataframe directly, i used individual columns of X_train and converted each of these to ndarrays.
To dense was working but for the dataframe i used, it consumed almost 3 GB of space !!! So had to go with this approach
if df
is Sparse[float64, 0]
, you can use df.values
to float64
.