I think I'm building my dataframe in a bad way
Question:
I’m trying to do some gymnastics on a badly formatted csv so that I can make some analyse’s. Here is what the ugly csv looks like when I import it:
df = pd.read_csv(File)
print(df)
Flow_DK Unnamed: 1 ... Unnamed: 14 Unnamed: 15
0 Data was last updated 31-12-2022 NaN ... NaN NaN
1 NaN Hours ... DK1 > NL NL > DK1
2 2022-01-01 00:00:00 00 - 01 ... 700,0 0,0
3 2022-01-01 00:00:00 01 - 02 ... 700,0 0,0
4 2022-01-01 00:00:00 02 - 03 ... 700,0 0,0
[5 rows x 16 columns]
The important parts are the date, which I want in a datetime hourly format rather than two columns and the quantities for each column with the three letter codes e.g. DK1 > NL.
In the next few lines of code I am trying to format the dataframe so that I get something that looks like this:
DateTimeUtc DK2 > SE4 SE4 > DK2 ... DE > DK2 DK1 > NL NL > DK1
2 2022-01-01 00:00:00 0.0 261.5 ... 150.0 700.0 0.0
3 2022-01-01 01:00:00 0.0 493.2 ... 0.0 700.0 0.0
4 2022-01-01 02:00:00 0.0 951.0 ... 0.0 700.0 0.0
5 2022-01-01 03:00:00 0.0 940.3 ... 0.0 700.0 0.0
6 2022-01-01 04:00:00 0.0 464.8 ... 0.0 0.0 0.0
[5 rows x 15 columns]
This is how I do it, first I create a date array that will effectively replace the first column and the second column, as a datetime formatted column of the length I require.
dates = pd.DataFrame(pd.date_range("2022-01-01", "2023-1-1", freq='H'))
dates.drop(dates.index[-1], inplace=True)
df_NP['DateTimeUtc'] = dates
df.drop(df.index[-1], inplace=True)
df_NP = df_NP.set_index('DateTimeUtc')
Then I get rid of the offending columns I have replaced:
df.drop(['Unnamed: 1'], axis=1, inplace=True)
df.drop([0], axis=0, inplace=True)
I rename the first column and row entry so I can copy the full list of row names into the column headers:
df.iloc[0, 0] = 'DateTimeUtc'
cols = df.iloc[0,:].to_list()
df.set_axis([cols],axis = 1, inplace = True)
df.drop([1], inplace=True)
df.iloc[0:, 0] = dates.iloc[0:, 0]
Then I have to convert the entries from strings to floats. Now here I would rather do it using inplace = True
for the existing Dataframe (df), in which I’m organising everything, but I’m not sure how you do that? Any suggestions? Instead I’m making a new df: (df_2) which I then add the datetime column to.
df.replace(',', '.', regex=True,inplace = True)
df_2 = df.iloc[0:, 1:].astype(float)
df_2['DateTimeUtc'] = df['DateTimeUtc']
Then I take a list of the column names so i can rearrange into a better order. I don’t like this step, and I think this is where things start to get messy.
df_2 = df_2[cols]
Now when I set the datetime column as the index it looks different in the new df. The index says Timestamp['2022-01-01 00:00:00']
for each entry, and I’m not sure why.
df_2.set_index('DateTimeUtc',inplace = True)
Now I want to sum the relevant flows columns and take the difference between inputs and outputs.
# Flows
df_2['flow_in_DK1'] = df_2.iloc[:,[2,4,7,9,13]].sum(axis = 1)
df_2['flow_out_DK1'] = df_2.iloc[:,[3,5,6,8,12]].sum(axis = 1)
df_2['flow_in_DK2'] = df_2.iloc[:,[1,6,11]].sum(axis = 1)
df_2['flow_out_DK2'] = df_2.iloc[:,[0,7,10]].sum(axis = 1)
df_2['DK1_NP'] = df_2.iloc[:,15] - df_2.iloc[:,14]
df_2['DK2_NP'] = df_2.iloc[:,17] - df_2.iloc[:,16]
Now when I try to call a column from df_2['DK2_NP]
I get the error:
TypeError: unhashable type: 'numpy.ndarray'
But I don’t know why. I can call it by df_2.iloc[:,15]
but why not by column name?
Lastly when I try to call df_2.head()
I get the same error. Can anyone explain whats going on here?
TypeError: unhashable type: 'numpy.ndarray'
First 10 lines of csv:
Flow
Data was last updated 31-12-2022
Hours DK2 > SE4 SE4 > DK2 SE3 > DK1 DK1 > SE3 NO2 > DK1 DK1 > NO2 DK1 > DK2 DK2 > DK1 DK1 > DE DE > DK1 DK2 > DE DE > DK2 DK1 > NL NL > DK1
01/01/2022 00 - 01 0,0 261,5 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 766,8 0,0 150,0 700,0 0,0
01/01/2022 01 - 02 0,0 493,2 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 709,8 154,2 0,0 700,0 0,0
01/01/2022 02 - 03 0,0 951,0 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 728,6 672,0 0,0 700,0 0,0
01/01/2022 03 - 04 0,0 940,3 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 775,7 682,0 0,0 700,0 0,0
01/01/2022 04 - 05 0,0 464,8 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 398,1 151,9 0,0 0,0 0,0
01/01/2022 05 - 06 0,0 529,7 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 679,3 160,5 0,0 0,0 0,0
01/01/2022 06 - 07 0,0 495,1 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 769,2 0,0 0,0 0,0 0,0
Answers:
Try to read your file with the following parameters:
# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
decimal=',', skipinitialspace=True)
# Combine date and time
combined = df.index + ' ' + df.pop('Hours').str.split('s+-s+').str[0]
dti = pd.DatetimeIndex(pd.to_datetime(combined, format='%d/%m/%Y %H'), name='DateTimeUtc')
# Fix your dataframe
df = df.set_axis(dti).reset_index()
Another solution for the same result inspired by @wjandrea:
# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
decimal=',', skipinitialspace=True, parse_dates=[0])
# Combine date and time
df.index += pd.TimedeltaIndex(df.pop('Hours').str.split('s+-s+').str[0].astype(int), unit='H')
# Fix your dataframe
df = df.rename_axis('DateTimeUtc').reset_index()
Output
>>> df
DateTimeUtc DK2 > SE4 SE4 > DK2 SE3 > DK1 DK1 > SE3 NO2 > DK1 DK1 > NO2 DK1 > DK2 DK2 > DK1 DK1 > DE DE > DK1 DK2 > DE DE > DK2 DK1 > NL NL > DK1
0 2022-01-01 00:00:00 0.0 261.5 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 766.8 0.0 150.0 700.0 0.0
1 2022-01-01 01:00:00 0.0 493.2 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 709.8 154.2 0.0 700.0 0.0
2 2022-01-01 02:00:00 0.0 951.0 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 728.6 672.0 0.0 700.0 0.0
3 2022-01-01 03:00:00 0.0 940.3 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 775.7 682.0 0.0 700.0 0.0
4 2022-01-01 04:00:00 0.0 464.8 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 398.1 151.9 0.0 0.0 0.0
5 2022-01-01 05:00:00 0.0 529.7 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 679.3 160.5 0.0 0.0 0.0
6 2022-01-01 06:00:00 0.0 495.1 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 769.2 0.0 0.0 0.0 0.0
I’m trying to do some gymnastics on a badly formatted csv so that I can make some analyse’s. Here is what the ugly csv looks like when I import it:
df = pd.read_csv(File)
print(df)
Flow_DK Unnamed: 1 ... Unnamed: 14 Unnamed: 15
0 Data was last updated 31-12-2022 NaN ... NaN NaN
1 NaN Hours ... DK1 > NL NL > DK1
2 2022-01-01 00:00:00 00 - 01 ... 700,0 0,0
3 2022-01-01 00:00:00 01 - 02 ... 700,0 0,0
4 2022-01-01 00:00:00 02 - 03 ... 700,0 0,0
[5 rows x 16 columns]
The important parts are the date, which I want in a datetime hourly format rather than two columns and the quantities for each column with the three letter codes e.g. DK1 > NL.
In the next few lines of code I am trying to format the dataframe so that I get something that looks like this:
DateTimeUtc DK2 > SE4 SE4 > DK2 ... DE > DK2 DK1 > NL NL > DK1
2 2022-01-01 00:00:00 0.0 261.5 ... 150.0 700.0 0.0
3 2022-01-01 01:00:00 0.0 493.2 ... 0.0 700.0 0.0
4 2022-01-01 02:00:00 0.0 951.0 ... 0.0 700.0 0.0
5 2022-01-01 03:00:00 0.0 940.3 ... 0.0 700.0 0.0
6 2022-01-01 04:00:00 0.0 464.8 ... 0.0 0.0 0.0
[5 rows x 15 columns]
This is how I do it, first I create a date array that will effectively replace the first column and the second column, as a datetime formatted column of the length I require.
dates = pd.DataFrame(pd.date_range("2022-01-01", "2023-1-1", freq='H'))
dates.drop(dates.index[-1], inplace=True)
df_NP['DateTimeUtc'] = dates
df.drop(df.index[-1], inplace=True)
df_NP = df_NP.set_index('DateTimeUtc')
Then I get rid of the offending columns I have replaced:
df.drop(['Unnamed: 1'], axis=1, inplace=True)
df.drop([0], axis=0, inplace=True)
I rename the first column and row entry so I can copy the full list of row names into the column headers:
df.iloc[0, 0] = 'DateTimeUtc'
cols = df.iloc[0,:].to_list()
df.set_axis([cols],axis = 1, inplace = True)
df.drop([1], inplace=True)
df.iloc[0:, 0] = dates.iloc[0:, 0]
Then I have to convert the entries from strings to floats. Now here I would rather do it using inplace = True
for the existing Dataframe (df), in which I’m organising everything, but I’m not sure how you do that? Any suggestions? Instead I’m making a new df: (df_2) which I then add the datetime column to.
df.replace(',', '.', regex=True,inplace = True)
df_2 = df.iloc[0:, 1:].astype(float)
df_2['DateTimeUtc'] = df['DateTimeUtc']
Then I take a list of the column names so i can rearrange into a better order. I don’t like this step, and I think this is where things start to get messy.
df_2 = df_2[cols]
Now when I set the datetime column as the index it looks different in the new df. The index says Timestamp['2022-01-01 00:00:00']
for each entry, and I’m not sure why.
df_2.set_index('DateTimeUtc',inplace = True)
Now I want to sum the relevant flows columns and take the difference between inputs and outputs.
# Flows
df_2['flow_in_DK1'] = df_2.iloc[:,[2,4,7,9,13]].sum(axis = 1)
df_2['flow_out_DK1'] = df_2.iloc[:,[3,5,6,8,12]].sum(axis = 1)
df_2['flow_in_DK2'] = df_2.iloc[:,[1,6,11]].sum(axis = 1)
df_2['flow_out_DK2'] = df_2.iloc[:,[0,7,10]].sum(axis = 1)
df_2['DK1_NP'] = df_2.iloc[:,15] - df_2.iloc[:,14]
df_2['DK2_NP'] = df_2.iloc[:,17] - df_2.iloc[:,16]
Now when I try to call a column from df_2['DK2_NP]
I get the error:
TypeError: unhashable type: 'numpy.ndarray'
But I don’t know why. I can call it by df_2.iloc[:,15]
but why not by column name?
Lastly when I try to call df_2.head()
I get the same error. Can anyone explain whats going on here?
TypeError: unhashable type: 'numpy.ndarray'
First 10 lines of csv:
Flow
Data was last updated 31-12-2022
Hours DK2 > SE4 SE4 > DK2 SE3 > DK1 DK1 > SE3 NO2 > DK1 DK1 > NO2 DK1 > DK2 DK2 > DK1 DK1 > DE DE > DK1 DK2 > DE DE > DK2 DK1 > NL NL > DK1
01/01/2022 00 - 01 0,0 261,5 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 766,8 0,0 150,0 700,0 0,0
01/01/2022 01 - 02 0,0 493,2 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 709,8 154,2 0,0 700,0 0,0
01/01/2022 02 - 03 0,0 951,0 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 728,6 672,0 0,0 700,0 0,0
01/01/2022 03 - 04 0,0 940,3 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 775,7 682,0 0,0 700,0 0,0
01/01/2022 04 - 05 0,0 464,8 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 398,1 151,9 0,0 0,0 0,0
01/01/2022 05 - 06 0,0 529,7 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 679,3 160,5 0,0 0,0 0,0
01/01/2022 06 - 07 0,0 495,1 250,0 0,0 0,0 1143,0 0,0 600,0 0,0 769,2 0,0 0,0 0,0 0,0
Try to read your file with the following parameters:
# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
decimal=',', skipinitialspace=True)
# Combine date and time
combined = df.index + ' ' + df.pop('Hours').str.split('s+-s+').str[0]
dti = pd.DatetimeIndex(pd.to_datetime(combined, format='%d/%m/%Y %H'), name='DateTimeUtc')
# Fix your dataframe
df = df.set_axis(dti).reset_index()
Another solution for the same result inspired by @wjandrea:
# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
decimal=',', skipinitialspace=True, parse_dates=[0])
# Combine date and time
df.index += pd.TimedeltaIndex(df.pop('Hours').str.split('s+-s+').str[0].astype(int), unit='H')
# Fix your dataframe
df = df.rename_axis('DateTimeUtc').reset_index()
Output
>>> df
DateTimeUtc DK2 > SE4 SE4 > DK2 SE3 > DK1 DK1 > SE3 NO2 > DK1 DK1 > NO2 DK1 > DK2 DK2 > DK1 DK1 > DE DE > DK1 DK2 > DE DE > DK2 DK1 > NL NL > DK1
0 2022-01-01 00:00:00 0.0 261.5 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 766.8 0.0 150.0 700.0 0.0
1 2022-01-01 01:00:00 0.0 493.2 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 709.8 154.2 0.0 700.0 0.0
2 2022-01-01 02:00:00 0.0 951.0 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 728.6 672.0 0.0 700.0 0.0
3 2022-01-01 03:00:00 0.0 940.3 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 775.7 682.0 0.0 700.0 0.0
4 2022-01-01 04:00:00 0.0 464.8 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 398.1 151.9 0.0 0.0 0.0
5 2022-01-01 05:00:00 0.0 529.7 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 679.3 160.5 0.0 0.0 0.0
6 2022-01-01 06:00:00 0.0 495.1 250.0 0.0 0.0 1143.0 0.0 600.0 0.0 769.2 0.0 0.0 0.0 0.0