Changing weather data frequency from 3 hours to 1 hour
Question:
I have weather data which has the following column where the first 3 rows look like this
date
hour
city
condition
snow
rain
2023-01-30
3
berlin
snow
1
0
2023-01-30
6
berlin
rain
0
1
2023-01-30
9
berlin
clear
0
0
I want to write code where which will create rows for the missing hours and replace the values with the hour city and date closest to that hour. The result dataframe should look like
date
hour
city
condition
snow
rain
2023-01-30
3
berlin
snow
1
0
2023-01-30
4
berlin
snow
1
0
2023-01-30
5
berlin
snow
1
0
2023-01-30
6
berlin
rain
0
1
2023-01-30
7
berlin
rain
0
1
2023-01-30
8
berlin
rain
0
1
2023-01-30
9
berlin
clear
0
0
2023-01-30
10
berlin
clear
0
0
2023-01-30
10
berlin
clear
0
0
Note: I have many cities and many rows.
I tried this but dint get the right solution and its not optimum for large number of rows (cities and hours)
df_expanded = df.set_index(['date', 'city', 'condition'])
.hour.unstack().reset_index().melt(id_vars=['date', 'city', 'condition'], value_name='hour')
.dropna()
.drop(columns=['variable'])
df_expanded = df_expanded.sort_values(by=['date', 'city', 'condition', 'hour'])
.ffill()
result = df_expanded.merge(df, on=['date', 'city', 'condition', 'hour'], how='left')
.dropna()
.drop_duplicates()
Open to easier and simpler solutions
Answers:
It is easiest to ffill
the missing data like below but I will try to also think of a solution for the closest time
# some sample data
d = {'date': ['2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30'],
'hour': [3, 6, 9, 3, 6, 9],
'city': ['berlin', 'berlin', 'berlin', 'chicago', 'chicago', 'chicago'],
'condition': ['snow', 'rain', 'clear', 'snow', 'snow', 'clear'],
'snow': [1, 0, 0, 1, 1, 0],
'rain': [0, 1, 0, 0, 0, 0]}
df = pd.DataFrame(d)
# convert to datetime and the hour to a timedelta and set as the index
df = df.set_index(pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')).drop(columns=['date', 'hour'])
# groupby the city and resample to the hour and ffill the missing data
df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)
city condition snow rain
2023-01-30 03:00:00 berlin snow 1 0
2023-01-30 04:00:00 berlin snow 1 0
2023-01-30 05:00:00 berlin snow 1 0
2023-01-30 06:00:00 berlin rain 0 1
2023-01-30 07:00:00 berlin rain 0 1
2023-01-30 08:00:00 berlin rain 0 1
2023-01-30 09:00:00 berlin clear 0 0
2023-01-30 03:00:00 chicago snow 1 0
2023-01-30 04:00:00 chicago snow 1 0
2023-01-30 05:00:00 chicago snow 1 0
2023-01-30 06:00:00 chicago snow 1 0
2023-01-30 07:00:00 chicago snow 1 0
2023-01-30 08:00:00 chicago snow 1 0
2023-01-30 09:00:00 chicago clear 0 0
if you want the original columns of date and hour then add the following
new_df = df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)
new_df = new_df.reset_index().rename(columns={'index': 'date'})
new_df['hour'] = new_df['date'].dt.hour
new_df['date'] = new_df['date'].dt.date
date city condition snow rain hour
0 2023-01-30 berlin snow 1 0 3
1 2023-01-30 berlin snow 1 0 4
2 2023-01-30 berlin snow 1 0 5
3 2023-01-30 berlin rain 0 1 6
4 2023-01-30 berlin rain 0 1 7
5 2023-01-30 berlin rain 0 1 8
6 2023-01-30 berlin clear 0 0 9
7 2023-01-30 chicago snow 1 0 3
8 2023-01-30 chicago snow 1 0 4
9 2023-01-30 chicago snow 1 0 5
10 2023-01-30 chicago snow 1 0 6
11 2023-01-30 chicago snow 1 0 7
12 2023-01-30 chicago snow 1 0 8
13 2023-01-30 chicago clear 0 0 9
I have weather data which has the following column where the first 3 rows look like this
date | hour | city | condition | snow | rain |
---|---|---|---|---|---|
2023-01-30 | 3 | berlin | snow | 1 | 0 |
2023-01-30 | 6 | berlin | rain | 0 | 1 |
2023-01-30 | 9 | berlin | clear | 0 | 0 |
I want to write code where which will create rows for the missing hours and replace the values with the hour city and date closest to that hour. The result dataframe should look like
date | hour | city | condition | snow | rain |
---|---|---|---|---|---|
2023-01-30 | 3 | berlin | snow | 1 | 0 |
2023-01-30 | 4 | berlin | snow | 1 | 0 |
2023-01-30 | 5 | berlin | snow | 1 | 0 |
2023-01-30 | 6 | berlin | rain | 0 | 1 |
2023-01-30 | 7 | berlin | rain | 0 | 1 |
2023-01-30 | 8 | berlin | rain | 0 | 1 |
2023-01-30 | 9 | berlin | clear | 0 | 0 |
2023-01-30 | 10 | berlin | clear | 0 | 0 |
2023-01-30 | 10 | berlin | clear | 0 | 0 |
Note: I have many cities and many rows.
I tried this but dint get the right solution and its not optimum for large number of rows (cities and hours)
df_expanded = df.set_index(['date', 'city', 'condition'])
.hour.unstack().reset_index().melt(id_vars=['date', 'city', 'condition'], value_name='hour')
.dropna()
.drop(columns=['variable'])
df_expanded = df_expanded.sort_values(by=['date', 'city', 'condition', 'hour'])
.ffill()
result = df_expanded.merge(df, on=['date', 'city', 'condition', 'hour'], how='left')
.dropna()
.drop_duplicates()
Open to easier and simpler solutions
It is easiest to ffill
the missing data like below but I will try to also think of a solution for the closest time
# some sample data
d = {'date': ['2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30'],
'hour': [3, 6, 9, 3, 6, 9],
'city': ['berlin', 'berlin', 'berlin', 'chicago', 'chicago', 'chicago'],
'condition': ['snow', 'rain', 'clear', 'snow', 'snow', 'clear'],
'snow': [1, 0, 0, 1, 1, 0],
'rain': [0, 1, 0, 0, 0, 0]}
df = pd.DataFrame(d)
# convert to datetime and the hour to a timedelta and set as the index
df = df.set_index(pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')).drop(columns=['date', 'hour'])
# groupby the city and resample to the hour and ffill the missing data
df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)
city condition snow rain
2023-01-30 03:00:00 berlin snow 1 0
2023-01-30 04:00:00 berlin snow 1 0
2023-01-30 05:00:00 berlin snow 1 0
2023-01-30 06:00:00 berlin rain 0 1
2023-01-30 07:00:00 berlin rain 0 1
2023-01-30 08:00:00 berlin rain 0 1
2023-01-30 09:00:00 berlin clear 0 0
2023-01-30 03:00:00 chicago snow 1 0
2023-01-30 04:00:00 chicago snow 1 0
2023-01-30 05:00:00 chicago snow 1 0
2023-01-30 06:00:00 chicago snow 1 0
2023-01-30 07:00:00 chicago snow 1 0
2023-01-30 08:00:00 chicago snow 1 0
2023-01-30 09:00:00 chicago clear 0 0
if you want the original columns of date and hour then add the following
new_df = df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)
new_df = new_df.reset_index().rename(columns={'index': 'date'})
new_df['hour'] = new_df['date'].dt.hour
new_df['date'] = new_df['date'].dt.date
date city condition snow rain hour
0 2023-01-30 berlin snow 1 0 3
1 2023-01-30 berlin snow 1 0 4
2 2023-01-30 berlin snow 1 0 5
3 2023-01-30 berlin rain 0 1 6
4 2023-01-30 berlin rain 0 1 7
5 2023-01-30 berlin rain 0 1 8
6 2023-01-30 berlin clear 0 0 9
7 2023-01-30 chicago snow 1 0 3
8 2023-01-30 chicago snow 1 0 4
9 2023-01-30 chicago snow 1 0 5
10 2023-01-30 chicago snow 1 0 6
11 2023-01-30 chicago snow 1 0 7
12 2023-01-30 chicago snow 1 0 8
13 2023-01-30 chicago clear 0 0 9