Pandas : TypeError: float() argument must be a string or a number

Question:

I have a dataframe that contains

user_id    date       browser  conversion  test  sex  age  country
   1    2015-12-03       IE        1         0    M   32.0   US

Here is my code:

from sklearn import tree
data['date'] = pd.to_datetime(data.date)
columns = [c for c in data.columns.tolist() if c not in ["test"]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[columns], data["test"])

I am getting this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-560-95a8a54aa939> in <module>()
      4 from sklearn import tree
      5 clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf = (len(data)/100) )
----> 6 clf = clf.fit(data[columns],data["test"])

C:UsersSnehaPriyaAnaconda2libsite-packagessklearntreetree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    152         random_state = check_random_state(self.random_state)
    153         if check_input:
--> 154             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    155             if issparse(X):
    156                 X.sort_indices()

C:UsersSnehaPriyaAnaconda2libsite-packagessklearnutilsvalidation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

TypeError: float() argument must be a string or a number

How do I overcome this error?

Asked By: Gingerbread

||

Answers:

IIUC you need exclude column date also:

columns = [c for c in columns if c not in ["test", 'date']]

because error:

TypeError: float() argument must be a string or a number, not ‘Timestamp’

Answered By: jezrael

A solution which keeps the date(time) column:

data['date'] = pd.to_numeric(pd.to_datetime(data['date']))
Answered By: niowniow
Ideas to preserve datetime as features in the model

Assuming the dates are relevant only with respect to how much time has passed since the observation, a solution to keep the datetime column as a feature in the model is to convert it into time difference between now and the datetimes.

data['date'] = (pd.Timestamp('now') - pd.to_datetime(data['date'])).dt.total_seconds()

Or you can convert the datetimes into integers straight up.

data['date'] = pd.to_datetime(data['date']).astype('int64')

N.B. To convert strings to datetime, passing format= makes the conversion run much, much faster (25 times faster). See this post for the benchmark and see this post for ideas to pass the format if your datetime column doesn’t have a uniform format.

Answered By: cottontail