Why do I get a TypeError (unsupported operand type(s) for -: 'str' and 'int') while the data is numerical?

Question:

I want to create a new column and calculate the value based on the other column values of the dataframe. The input data is a large .csv file containing temperature values for each hour of the day, for multiple years.
The dataframe looks like this:

HH      T        date  t_min  t_max
0   8   94.0  1991-04-01     81    110
1   9   90.0  1991-04-01     81    110
2  10   95.0  1991-04-01     81    110
3  11  108.0  1991-04-01     81    110
4  12  110.0  1991-04-01     81    110
5  13  109.0  1991-04-01     81    110
6  14   81.0  1991-04-01     81    110
7  15   85.0  1991-04-01     81    110
8  16   85.0  1991-04-01     81    110
9  17   87.0  1991-04-01     81    110

HH = hours; T = temperature of the hour; t_min = lowest day temp; t_max = highest day temp

I tried calculating a new column "HTD" (hourly temp deviation) with the following code:

import pandas as pd
df_t = pd.read_csv('file.csv')

# calcualation = ( Tu - Tmin ) / ( Tmax - Tmin ) * (Tmax - Tmin)
df_t[ 'HTD'] = (df_t.T - df_t.t_min) / (df_t.t_max - df_t.t_min) * (df_t.t_max - df_t.t_min)

This results in a TypeError: unsupported operand type(s) for -: ‘str’ and ‘int’ at the last line. The problem seems to be the T column, for the code runs when I use df_t.t_min instead of df_t.T. I checked the data of the T column:

#First check: 0 values
df_t.isnull().sum()

#Second check: non-numerical values:
result = df_t.applymap(np.isreal)
for value in result['T']:
    if value == False:
        print(value)

Which showed no Null values, and reported no non-numerical values. I also tried using .astype() to make sure the data is the right type.

What is my best course of action to try solve this issue?
(apologies if my question is incomplete or unclear, this is my first time)

Asked By: Ellis

||

Answers:

Normally you can use df.<colname> as a shortcut for df['colname'], however df.T is a property in pandas that returns the transpose of the dataframe (rows become columns, columns become rows).

As a simple example:

df = pd.DataFrame({"Something": [1,2,3], 'T': [35,36,37]})
df.T # returns the transpose, not the 'T' column

You can fix this by simply accessing the column using square brackets:

df["T"] # returns the column you want

Or by renaming your column to one that doesn’t clash with a built in pandas dataframe property or method (other examples include df.shape, df.size etc; use dir(df) to see all the potential clashes!)

In general the square bracket access is safer, if a bit less convinient, as you will be guaranteed to never clash names. I would stick to the .attribute access only as a shortcut or for names I am certain will not clash (columns with an uppercase name for example)

Answered By: T. Hall

Don’t use df_t.T, it transposes the dataframe,
here is the solution:

import pandas as pd

df_t = pd.DataFrame({"HH":[8,9,10,11,12,13,14,15,16,17],
               "T":[94.0, 90.0, 95.0,108.0,110.0,109.0,81.0,85.0,85.0,87.0],
               "date":["1991-04-01", "1991-04-01", '1991-04-01', '1991-04-01', '1991-04-01', '1991-04-01','1991-04-01','1991-04-01','1991-04-01','1991-04-01'],
               "t_min":[81,81,81,81,81,81,81,81,81,81],
               "t_max": [110,110,110,110,110,110,110,110,110,110]
}
                  )

# calcualation = ( Tu - Tmin ) / ( Tmax - Tmin ) * (Tmax - Tmin)
df_t[ 'HTD'] = (df_t['T'] - df_t.t_min) / (df_t.t_max - df_t.t_min) * 
(df_t.t_max - df_t.t_min)
print(df_t.head())

HH      T        date  t_min  t_max   HTD
0   8   94.0  1991-04-01     81    110  13.0
1   9   90.0  1991-04-01     81    110   9.0
2  10   95.0  1991-04-01     81    110  14.0
3  11  108.0  1991-04-01     81    110  27.0
4  12  110.0  1991-04-01     81    110  29.0
Answered By: Ajeet Verma