Unsupported operand type on Time series management

Question:

Today, I will to manage my time series dataset for using the library TSfresh and make time series classification.

I use this tutorial, to adapt the code to my data.
For now, I realize some steps, but one error occurs in the splitting data :

import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Import the data path (my time series path)
data_path = 'PATH'

#Import the csv containing the label (in my case "Reussite_Sevrage")
target_df = pd.read_csv("PATH.csv",encoding="ISO-8859-1", dtype={'ID': 'str'})

# Delete the useless lines (containing nan values in the end of the dataset)
target_df = target_df.iloc[0:57,:]

# Definition of the labels
labels = target_df['Reussite_sevrage']

# Definition of the df containing the IDs
sequence_ids=target_df['ID']


#Splitting the data 
train_ids, test_ids, train_labels, test_labels = train_test_split(sequence_ids, labels, test_size=0.2)

#Create the X_train and X_test dataframe
X_train = pd.DataFrame()
X_test = pd.DataFrame()

# Now, will loop through the training sequence IDs and the testing sequence IDs. 
# For each of these sequence IDs, we will read the corresponding time series data CSV file and add it to the main dataframe.
# We will also add a column for the sequence number and a step column which contains integers representing the time step in the sequence

for i, sequence in enumerate(train_ids):
    inputfile = 'PATH'/ f"{sequence}.txt"
    if inputfile.exists():
        df = pd.read_csv(os.path.join(data_path, 'PAD/', "%s.txt" % sequence), 
                delimiter='t',  # columns are separated by spaces
                header=None,  # there's no header information
                #parse_dates=[[0, 1]],  # the first and second columns should be combined and converted to datetime objects
                #infer_datetime_format=True,
                decimal=",")
        df = df.iloc[:,1]
        df = df.to_frame(name ='values')
        df.insert(0, 'sequence', i)
        df['step'] = np.arange(df.shape[0]) # creates a range of integers starting from 0 to the number of the measurements.
        X_train = pd.concat([X_train, df])

I add a condition in the loop to check and process only the files existing. The missing data are represent by a missing files. If I omit this condition, the loop stop when it detect a missing files.

inputfile = PATH / f"{sequence}.txt"
    if inputfile.exists():

But this errors occurs :
unsupported operand type(s) for /: 'str' and 'str'

I don’t know if the error is due to the dtype={'ID': 'str'} during the data loading, but i need it, because the ID are formated like that : 0001, 0002, 0003… If I don’t add this condition, the ID are converted in : 1,2,3…

The sequence_ids, train_ids, train_labels, test_ids and test_labels are series format, and sequence are str format.

Can you have a solution for this problem please ?

Thank you very much

Asked By: Romain LOMBARDI

||

Answers:

I would suggest using Path library to deal with file paths. You can import using from pathlib import Path and inputfile will be Path(data_path) / f"PAD/{sequence}.txt" This will create a Path object to the path of the sequence file. Now you should be able to call exists() method on this.

Final code:

from pathlib import Path

# Import the data path (my time series path)
data_path = 'PATH'

...


for i, sequence in enumerate(train_ids):
    inputfile = Path(data_path) / f"PAD/{sequence}.txt"
    
    if inputfile.exists():
        df = pd.read_csv(
            inputfile, 
            delimiter='t',  # columns are separated by spaces
            header=None,  # there's no header information
            decimal=","
        )

        ...

Answered By: Ashyam

Thank you for you help !

I tried to optimise my code by taking into account the empty files (corresponding to missing data). I tried adding an empty dataframe for the empty files, however no value is added. Wouldn’t it be possible to add missing data for these files, in order to avoid having a mismatch later (and thus error generation) between the number of lines in the X_train and train_labels? like this :

from pathlib import Path
from pandas.errors import EmptyDataError

for i, sequence in enumerate(test_ids):
    inputfile = Path(data_path) / f"PAD/{sequence}.txt"
    if inputfile.exists():
        try :
            df = pd.read_csv(os.path.join(data_path, 'PAD/', "%s.txt" % sequence), 
                delimiter='t',  # columns are separated by spaces
                header=None,  # there's no header information
                #parse_dates=[[0, 1]],  # the first and second columns should be combined and converted to datetime objects
                #infer_datetime_format=True,
                decimal=",")
            df = df.iloc[:,1]
            df = df.to_frame(name ='values')
        except EmptyDataError:
            df = pd.DataFrame()
        df.insert(0, 'sequence', i)
        df['step'] = np.arange(df.shape[0]) # creates a range of integers starting from 0 to the number of the measurements.
        X_test = pd.concat([X_test, df])

I don’t know how to do that.
Thank you !

Answered By: Romain LOMBARDI