I think I'm building my dataframe in a bad way

Question

I’m trying to do some gymnastics on a badly formatted csv so that I can make some analyse’s. Here is what the ugly csv looks like when I import it:

df = pd.read_csv(File)
print(df)

     Flow_DK Unnamed: 1  ... Unnamed: 14 Unnamed: 15
0  Data was last updated 31-12-2022        NaN  ...         NaN         NaN
1                               NaN      Hours  ...    DK1 > NL    NL > DK1
2               2022-01-01 00:00:00    00 - 01  ...       700,0         0,0
3               2022-01-01 00:00:00    01 - 02  ...       700,0         0,0
4               2022-01-01 00:00:00    02 - 03  ...       700,0         0,0

[5 rows x 16 columns]

The important parts are the date, which I want in a datetime hourly format rather than two columns and the quantities for each column with the three letter codes e.g. DK1 > NL. In the next few lines of code I am trying to format the dataframe so that I get something that looks like this:

          DateTimeUtc DK2 > SE4 SE4 > DK2  ... DE > DK2 DK1 > NL NL > DK1
2 2022-01-01 00:00:00       0.0     261.5  ...    150.0    700.0      0.0
3 2022-01-01 01:00:00       0.0     493.2  ...      0.0    700.0      0.0
4 2022-01-01 02:00:00       0.0     951.0  ...      0.0    700.0      0.0
5 2022-01-01 03:00:00       0.0     940.3  ...      0.0    700.0      0.0
6 2022-01-01 04:00:00       0.0     464.8  ...      0.0      0.0      0.0

[5 rows x 15 columns]

This is how I do it, first I create a date array that will effectively replace the first column and the second column, as a datetime formatted column of the length I require.

dates = pd.DataFrame(pd.date_range("2022-01-01", "2023-1-1", freq='H'))
dates.drop(dates.index[-1], inplace=True)
df_NP['DateTimeUtc'] = dates
df.drop(df.index[-1], inplace=True)
df_NP = df_NP.set_index('DateTimeUtc')

Then I get rid of the offending columns I have replaced:

df.drop(['Unnamed: 1'], axis=1, inplace=True)
df.drop([0], axis=0, inplace=True)

I rename the first column and row entry so I can copy the full list of row names into the column headers:

df.iloc[0, 0] = 'DateTimeUtc'
cols = df.iloc[0,:].to_list()
df.set_axis([cols],axis = 1, inplace = True)
df.drop([1], inplace=True)
df.iloc[0:, 0] = dates.iloc[0:, 0]

Then I have to convert the entries from strings to floats. Now here I would rather do it using inplace = True for the existing Dataframe (df), in which I’m organising everything, but I’m not sure how you do that? Any suggestions? Instead I’m making a new df: (df_2) which I then add the datetime column to.

df.replace(',', '.', regex=True,inplace = True)
df_2 = df.iloc[0:, 1:].astype(float)
df_2['DateTimeUtc'] = df['DateTimeUtc']

Then I take a list of the column names so i can rearrange into a better order. I don’t like this step, and I think this is where things start to get messy.

df_2 = df_2[cols]

Now when I set the datetime column as the index it looks different in the new df. The index says Timestamp['2022-01-01 00:00:00'] for each entry, and I’m not sure why.

df_2.set_index('DateTimeUtc',inplace = True)

Now I want to sum the relevant flows columns and take the difference between inputs and outputs.

# Flows
df_2['flow_in_DK1'] = df_2.iloc[:,[2,4,7,9,13]].sum(axis = 1)
df_2['flow_out_DK1'] = df_2.iloc[:,[3,5,6,8,12]].sum(axis = 1)
df_2['flow_in_DK2'] = df_2.iloc[:,[1,6,11]].sum(axis = 1)
df_2['flow_out_DK2'] = df_2.iloc[:,[0,7,10]].sum(axis = 1)
df_2['DK1_NP'] = df_2.iloc[:,15] - df_2.iloc[:,14]
df_2['DK2_NP'] = df_2.iloc[:,17] - df_2.iloc[:,16]

Now when I try to call a column from df_2['DK2_NP] I get the error:

TypeError: unhashable type: 'numpy.ndarray'

But I don’t know why. I can call it by df_2.iloc[:,15] but why not by column name?

Lastly when I try to call df_2.head() I get the same error. Can anyone explain whats going on here?

TypeError: unhashable type: 'numpy.ndarray'

First 10 lines of csv:

Flow                                            
    Data was last updated 31-12-2022                                                            
        Hours   DK2 > SE4   SE4 > DK2   SE3 > DK1   DK1 > SE3   NO2 > DK1   DK1 > NO2   DK1 > DK2   DK2 > DK1   DK1 > DE    DE > DK1    DK2 > DE    DE > DK2    DK1 > NL    NL > DK1
    01/01/2022  00 - 01 0,0 261,5   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 766,8   0,0 150,0   700,0   0,0
    01/01/2022  01 - 02 0,0 493,2   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 709,8   154,2   0,0 700,0   0,0
    01/01/2022  02 - 03 0,0 951,0   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 728,6   672,0   0,0 700,0   0,0
    01/01/2022  03 - 04 0,0 940,3   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 775,7   682,0   0,0 700,0   0,0
    01/01/2022  04 - 05 0,0 464,8   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 398,1   151,9   0,0 0,0 0,0
    01/01/2022  05 - 06 0,0 529,7   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 679,3   160,5   0,0 0,0 0,0
    01/01/2022  06 - 07 0,0 495,1   250,0   0,0 0,0 1143,0  0,0 600,0   0,0 769,2   0,0 0,0 0,0 0,0

Asked By: Tom

||

Source

Answer 1

Try to read your file with the following parameters:

# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
                 decimal=',', skipinitialspace=True)

# Combine date and time
combined = df.index + ' ' + df.pop('Hours').str.split('s+-s+').str[0]
dti = pd.DatetimeIndex(pd.to_datetime(combined, format='%d/%m/%Y %H'), name='DateTimeUtc')

# Fix your dataframe
df = df.set_axis(dti).reset_index()

Another solution for the same result inspired by @wjandrea:

# Read CSV correctly
df = pd.read_csv('data.csv', sep='t', index_col=0, skiprows=2,
                 decimal=',', skipinitialspace=True, parse_dates=[0])

# Combine date and time
df.index += pd.TimedeltaIndex(df.pop('Hours').str.split('s+-s+').str[0].astype(int), unit='H')

# Fix your dataframe
df = df.rename_axis('DateTimeUtc').reset_index()

Output

>>> df
          DateTimeUtc  DK2 > SE4  SE4 > DK2  SE3 > DK1  DK1 > SE3  NO2 > DK1  DK1 > NO2  DK1 > DK2  DK2 > DK1  DK1 > DE  DE > DK1  DK2 > DE  DE > DK2  DK1 > NL  NL > DK1
0 2022-01-01 00:00:00        0.0      261.5      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     766.8       0.0     150.0     700.0       0.0
1 2022-01-01 01:00:00        0.0      493.2      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     709.8     154.2       0.0     700.0       0.0
2 2022-01-01 02:00:00        0.0      951.0      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     728.6     672.0       0.0     700.0       0.0
3 2022-01-01 03:00:00        0.0      940.3      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     775.7     682.0       0.0     700.0       0.0
4 2022-01-01 04:00:00        0.0      464.8      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     398.1     151.9       0.0       0.0       0.0
5 2022-01-01 05:00:00        0.0      529.7      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     679.3     160.5       0.0       0.0       0.0
6 2022-01-01 06:00:00        0.0      495.1      250.0        0.0        0.0     1143.0        0.0      600.0       0.0     769.2       0.0       0.0       0.0       0.0

Answered By: Corralien

I think I'm building my dataframe in a bad way

Question:

Answers: