Pandas : TypeError: float() argument must be a string or a number
Question:
I have a dataframe that contains
user_id date browser conversion test sex age country
1 2015-12-03 IE 1 0 M 32.0 US
Here is my code:
from sklearn import tree
data['date'] = pd.to_datetime(data.date)
columns = [c for c in data.columns.tolist() if c not in ["test"]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[columns], data["test"])
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-560-95a8a54aa939> in <module>()
4 from sklearn import tree
5 clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf = (len(data)/100) )
----> 6 clf = clf.fit(data[columns],data["test"])
C:UsersSnehaPriyaAnaconda2libsite-packagessklearntreetree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
152 random_state = check_random_state(self.random_state)
153 if check_input:
--> 154 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
155 if issparse(X):
156 X.sort_indices()
C:UsersSnehaPriyaAnaconda2libsite-packagessklearnutilsvalidation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
371 force_all_finite)
372 else:
--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)
374
375 if ensure_2d:
TypeError: float() argument must be a string or a number
How do I overcome this error?
Answers:
IIUC you need exclude column date
also:
columns = [c for c in columns if c not in ["test", 'date']]
because error:
TypeError: float() argument must be a string or a number, not ‘Timestamp’
A solution which keeps the date(time) column:
data['date'] = pd.to_numeric(pd.to_datetime(data['date']))
Ideas to preserve datetime as features in the model
Assuming the dates are relevant only with respect to how much time has passed since the observation, a solution to keep the datetime column as a feature in the model is to convert it into time difference between now and the datetimes.
data['date'] = (pd.Timestamp('now') - pd.to_datetime(data['date'])).dt.total_seconds()
Or you can convert the datetimes into integers straight up.
data['date'] = pd.to_datetime(data['date']).astype('int64')
N.B. To convert strings to datetime, passing format=
makes the conversion run much, much faster (25 times faster). See this post for the benchmark and see this post for ideas to pass the format if your datetime column doesn’t have a uniform format.
I have a dataframe that contains
user_id date browser conversion test sex age country
1 2015-12-03 IE 1 0 M 32.0 US
Here is my code:
from sklearn import tree
data['date'] = pd.to_datetime(data.date)
columns = [c for c in data.columns.tolist() if c not in ["test"]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[columns], data["test"])
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-560-95a8a54aa939> in <module>()
4 from sklearn import tree
5 clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf = (len(data)/100) )
----> 6 clf = clf.fit(data[columns],data["test"])
C:UsersSnehaPriyaAnaconda2libsite-packagessklearntreetree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
152 random_state = check_random_state(self.random_state)
153 if check_input:
--> 154 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
155 if issparse(X):
156 X.sort_indices()
C:UsersSnehaPriyaAnaconda2libsite-packagessklearnutilsvalidation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
371 force_all_finite)
372 else:
--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)
374
375 if ensure_2d:
TypeError: float() argument must be a string or a number
How do I overcome this error?
IIUC you need exclude column date
also:
columns = [c for c in columns if c not in ["test", 'date']]
because error:
TypeError: float() argument must be a string or a number, not ‘Timestamp’
A solution which keeps the date(time) column:
data['date'] = pd.to_numeric(pd.to_datetime(data['date']))
Ideas to preserve datetime as features in the model
Assuming the dates are relevant only with respect to how much time has passed since the observation, a solution to keep the datetime column as a feature in the model is to convert it into time difference between now and the datetimes.
data['date'] = (pd.Timestamp('now') - pd.to_datetime(data['date'])).dt.total_seconds()
Or you can convert the datetimes into integers straight up.
data['date'] = pd.to_datetime(data['date']).astype('int64')
N.B. To convert strings to datetime, passing format=
makes the conversion run much, much faster (25 times faster). See this post for the benchmark and see this post for ideas to pass the format if your datetime column doesn’t have a uniform format.