How can I split my dataframe by year or month

Question:

I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:

years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
    years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])


RESULT:

    month   day_of_month    value
0   1   1   0
1   1   1   6
2   1   1   2
3   1   1   0
4   1   1   1
...     ...     ...     ...
210259  12  31  6
210260  12  31  2
210261  12  31  4
210262  12  31  5
210263  12  31  1

How can I get the frame constructed having one value column for each single year or month?

Here I leave the original frame from which I want to create the new one, only needed column by now is value

    value    month  day_of_month    week    day_name    year    hour    season  dailyp  day_of_week     ...     hourly_no_noise     daily_trend     daily_seasonal  daily_residuals     daily_no_noise  daily_trend_h   daily_seasonal_h    daily_residuals_h   daily_no_noise_h    Total
date                                                                                    
2015-01-01 00:00:00     0   1   1   1   Thursday    2015    0   Invierno    165.0   3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -3.456929   NaN     NaN     6436996.0
2015-01-01 01:00:00     6   1   1   1   Thursday    2015    1   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -4.879983   NaN     NaN     NaN
2015-01-01 02:00:00     2   1   1   1   Thursday    2015    2   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -5.895367   NaN     NaN     NaN
2015-01-01 03:00:00     0   1   1   1   Thursday    2015    3   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.468616   NaN     NaN     NaN
2015-01-01 04:00:00     1   1   1   1   Thursday    2015    4   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.441830   NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2019-12-31 19:00:00     6   12  31  1   Tuesday     2019    19  Invierno    NaN     1   ...     11.529465   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.805720    1.176491    17.823509   NaN
2019-12-31 20:00:00     3   12  31  1   Tuesday     2019    20  Invierno    NaN     1   ...     11.314857   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.928751    1.176491    17.823509   NaN
2019-12-31 21:00:00     3   12  31  1   Tuesday     2019    21  Invierno    NaN     1   ...     10.141139   230.571429  -4.997480   -11.299166  237.299166  9.613095    1.774848    1.176491    17.823509   NaN
2019-12-31 22:00:00     3   12  31  1   Tuesday     2019    22  Invierno    NaN     1   ...     8.823152    230.571429  -4.997480   -11.299166  237.299166  9.613095    0.663344    1.176491    17.823509   NaN
2019-12-31 23:00:00     6   12  31  1   Tuesday     2019    23  Invierno    NaN     1   ...     6.884636    230.571429  -4.997480   -11.299166  237.299166  9.613095    -1.624980   1.176491    17.823509   NaN

I would like to end up with a dataframe like this:

                         2015      2016      2017      2018      2019
2016-01-01 00:00:00  0.074053  0.218161  0.606810  0.687365  0.352672
2016-01-01 01:00:00  0.465167  0.210297  0.722825  0.683341  0.885175
2016-01-01 02:00:00  0.175964  0.610560  0.722479  0.016842  0.205916
2016-01-01 03:00:00  0.945955  0.807490  0.627525  0.187677  0.535116
2016-01-01 04:00:00  0.757608  0.797835  0.639215  0.455989  0.042285
...                       ...       ...       ...       ...       ...
2016-12-30 20:00:00  0.046138  0.139100  0.397547  0.738687  0.335306
2016-12-30 21:00:00  0.672800  0.802090  0.617625  0.787601  0.007535
2016-12-30 22:00:00  0.698141  0.776686  0.423712  0.667808  0.298338
2016-12-30 23:00:00  0.198089  0.642073  0.586527  0.106567  0.514569
2016-12-31 00:00:00  0.367572  0.390791  0.105193  0.592167  0.007365

where 29 Feb is NaN on non-leap years:

df['2016-02']
                         2015      2016      2017      2018      2019
2016-02-01 00:00:00  0.656703  0.348784  0.383639  0.208786  0.183642
2016-02-01 01:00:00  0.488729  0.909498  0.873642  0.122028  0.547563
2016-02-01 02:00:00  0.210427  0.912393  0.505873  0.085149  0.358841
2016-02-01 03:00:00  0.281107  0.534750  0.622473  0.643611  0.258437
2016-02-01 04:00:00  0.187434  0.327459  0.701008  0.887041  0.385816
...                       ...       ...       ...       ...       ...
2016-02-29 19:00:00       NaN  0.742402       NaN       NaN       NaN
2016-02-29 20:00:00       NaN  0.013419       NaN       NaN       NaN
2016-02-29 21:00:00       NaN  0.517194       NaN       NaN       NaN
2016-02-29 22:00:00       NaN  0.003136       NaN       NaN       NaN
2016-02-29 23:00:00       NaN  0.128406       NaN       NaN       NaN
Asked By: Apollo

||

Answers:

So let’s assume we have the following dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'),
                  columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)

Output:

            Date and Time         str  Year Month Day    Values
0     2015-01-01 00:00:00  2015-01-01  2015    01  01  0.153948
1     2015-01-01 01:00:00  2015-01-01  2015    01  01  0.663132
2     2015-01-01 02:00:00  2015-01-01  2015    01  01  0.141534
3     2015-01-01 03:00:00  2015-01-01  2015    01  01  0.263551
4     2015-01-01 04:00:00  2015-01-01  2015    01  01  0.094391
...                   ...         ...   ...   ...  ..       ...
43820 2019-12-31 20:00:00  2019-12-31  2019    12  31  0.055802
43821 2019-12-31 21:00:00  2019-12-31  2019    12  31  0.952963
43822 2019-12-31 22:00:00  2019-12-31  2019    12  31  0.106768
43823 2019-12-31 23:00:00  2019-12-31  2019    12  31  0.834583
43824 2020-01-01 00:00:00  2020-01-01  2020    01  01  0.325849

[43825 rows x 6 columns]

Now we separate the dataframe by year and save it in a disk:

d = {}
for i in range(2015,2020):
  d[i] = pd.DataFrame(df[df['Year'] == str(i)])
  d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)
for i in range(2015,2020):
  print('Feb', i,':',(d[i][d[i]['Month'] == '02']).shape)
  print((d[i][d[i]['Month'] == '02']).tail(3))
  print('-----------------------------------------------------------------')

Output:

Feb 2015 : (672, 6) 
           Date and Time         str  Year Month Day    Values
1413 2015-02-28 21:00:00  2015-02-28  2015    02  28  0.517525
1414 2015-02-28 22:00:00  2015-02-28  2015    02  28  0.404741
1415 2015-02-28 23:00:00  2015-02-28  2015    02  28  0.299090
-----------------------------------------------------------------
Feb 2016 : (696, 6) 
           Date and Time         str  Year Month Day    Values
1437 2016-02-29 21:00:00  2016-02-29  2016    02  29  0.854047
1438 2016-02-29 22:00:00  2016-02-29  2016    02  29  0.035787
1439 2016-02-29 23:00:00  2016-02-29  2016    02  29  0.955364
-----------------------------------------------------------------
Feb 2017 : (672, 6) 
           Date and Time         str  Year Month Day    Values
1413 2017-02-28 21:00:00  2017-02-28  2017    02  28  0.936354
1414 2017-02-28 22:00:00  2017-02-28  2017    02  28  0.954680
1415 2017-02-28 23:00:00  2017-02-28  2017    02  28  0.625131
-----------------------------------------------------------------
Feb 2018 : (672, 6) 
           Date and Time         str  Year Month Day    Values
1413 2018-02-28 21:00:00  2018-02-28  2018    02  28  0.965274
1414 2018-02-28 22:00:00  2018-02-28  2018    02  28  0.848050
1415 2018-02-28 23:00:00  2018-02-28  2018    02  28  0.238984
-----------------------------------------------------------------
Feb 2019 : (672, 6) 
           Date and Time         str  Year Month Day    Values
1413 2019-02-28 21:00:00  2019-02-28  2019    02  28  0.476142
1414 2019-02-28 22:00:00  2019-02-28  2019    02  28  0.498278
1415 2019-02-28 23:00:00  2019-02-28  2019    02  28  0.127525
-----------------------------------------------------------------

To fix the leap year problem:

There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.

indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns =  df.columns.values , index = indexs)
print(lines.head())

Output:

      Date and Time  str  Year  Month  Day  Values
1416            NaN  NaN   NaN    NaN  NaN     NaN
1417            NaN  NaN   NaN    NaN  NaN     NaN
1418            NaN  NaN   NaN    NaN  NaN     NaN
1419            NaN  NaN   NaN    NaN  NaN     NaN
1420            NaN  NaN   NaN    NaN  NaN     NaN

Then I add the NaN rows to the data frame with the following code:

b = {}
for i in range(2015,2020):
  if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
    bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
    b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
  else:
    b[i] = d[i].copy()
for i in range(2015,2020):
  print(i,':',b[i].shape)
  print(b[i].iloc[1438:1441])
  print('-----------------------------------------------------------------')

Output:

2015 : (8784, 6) 
     Date and Time         str  Year Month  Day    Values
1438           NaT         NaN   NaN   NaN  NaN       NaN
1439           NaT         NaN   NaN   NaN  NaN       NaN
1440    2015-03-01  2015-03-01  2015    03   01  0.676486
-----------------------------------------------------------------
2016 : (8784, 6) 
           Date and Time         str  Year Month Day    Values
1438 2016-02-29 22:00:00  2016-02-29  2016    02  29  0.035787
1439 2016-02-29 23:00:00  2016-02-29  2016    02  29  0.955364
1440 2016-03-01 00:00:00  2016-03-01  2016    03  01  0.014158
-----------------------------------------------------------------
2017 : (8784, 6) 
     Date and Time         str  Year Month  Day    Values
1438           NaT         NaN   NaN   NaN  NaN       NaN
1439           NaT         NaN   NaN   NaN  NaN       NaN
1440    2017-03-01  2017-03-01  2017    03   01  0.035952
-----------------------------------------------------------------
2018 : (8784, 6) 
     Date and Time         str  Year Month  Day   Values
1438           NaT         NaN   NaN   NaN  NaN      NaN
1439           NaT         NaN   NaN   NaN  NaN      NaN
1440    2018-03-01  2018-03-01  2018    03   01  0.44876
-----------------------------------------------------------------
2019 : (8784, 6) 
     Date and Time         str  Year Month  Day    Values
1438           NaT         NaN   NaN   NaN  NaN       NaN
1439           NaT         NaN   NaN   NaN  NaN       NaN
1440    2019-03-01  2019-03-01  2019    03   01  0.096433
-----------------------------------------------------------------

And finally, if we want to create the dataframe you want:

final_df = pd.DataFrame(index = b[2016]['Date and Time'])
for i in range(2015,2020):
  final_df[i] = np.array(b[i]['Values'])

Output:

                         2015      2016      2017      2018      2019
Date and Time                                                        
2016-01-01 00:00:00  0.153948  0.145602  0.957265  0.427620  0.868948
2016-01-01 01:00:00  0.663132  0.318746  0.013658  0.380105  0.442332
2016-01-01 02:00:00  0.141534  0.483471  0.048050  0.139065  0.702211
2016-01-01 03:00:00  0.263551  0.737948  0.528827  0.472889  0.165095
2016-01-01 04:00:00  0.094391  0.939737  0.120343  0.134011  0.297611
...                       ...       ...       ...       ...       ...
2016-02-28 22:00:00  0.404741  0.864423  0.954680  0.848050  0.498278
2016-02-28 23:00:00  0.299090  0.348466  0.625131  0.238984  0.127525
2016-02-29 00:00:00       NaN  0.375469       NaN       NaN       NaN
2016-02-29 01:00:00       NaN  0.186092       NaN       NaN       NaN
...                       ...       ...       ...       ...       ...
2016-02-29 22:00:00       NaN  0.035787       NaN       NaN       NaN
2016-02-29 23:00:00       NaN  0.955364       NaN       NaN       NaN
2016-03-01 00:00:00  0.676486  0.014158  0.035952  0.448760  0.096433
2016-03-01 01:00:00  0.792168  0.520436  0.138874  0.229396  0.913848
...                       ...       ...       ...       ...       ...
2016-12-31 19:00:00  0.517459  0.956219  0.116335  0.736170  0.739740
2016-12-31 20:00:00  0.814362  0.324332  0.324911  0.485508  0.055802
2016-12-31 21:00:00  0.870459  0.809150  0.335461  0.124459  0.952963
2016-12-31 22:00:00  0.549891  0.043623  0.997053  0.144286  0.106768
2016-12-31 23:00:00  0.047090  0.730074  0.698159  0.235253  0.834583

[8784 rows x 5 columns]

IIUC, you just need the original DataFrame:

origin = 2016  # or whatever year of your chosing
newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
newdf = (
    df[['value']]
    .assign(year=df.index.year)
    .set_axis(newidx, axis=0)
    .pivot(columns='year', values='value')
)

Using the small sample data you provided for that "original frame" df, we get:

>>> newdf
year                 2015  2019
date                           
2016-01-01 00:00:00   0.0   NaN
2016-01-01 01:00:00   6.0   NaN
2016-01-01 02:00:00   2.0   NaN
...                   ...   ...
2016-12-31 21:00:00   NaN   3.0
2016-12-31 22:00:00   NaN   3.0
2016-12-31 23:00:00   NaN   6.0

On a larger (made-up) DataFrame:

np.random.seed(0)

ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)

# (code above)

>>> newdf
year                 2015  2016  2017  2018  2019
2016-01-01 00:00:00  44.0  82.0  96.0  68.0  71.0
2016-01-01 01:00:00  47.0  99.0  54.0  44.0  71.0
2016-01-01 02:00:00  64.0  28.0  11.0  10.0  55.0
...                   ...   ...   ...   ...   ...
2016-12-31 21:00:00   0.0  30.0  28.0  53.0  14.0
2016-12-31 22:00:00  47.0  82.0  19.0   6.0  64.0
2016-12-31 23:00:00  22.0  75.0  13.0  37.0  35.0

and, as expected, only 2016 has values for 02/29:

>>> newdf[:'2016-02-29 02:00:00'].tail()
year                 2015  2016  2017  2018  2019
2016-02-28 22:00:00  74.0  54.0  22.0  17.0  39.0
2016-02-28 23:00:00  37.0  61.0  31.0   8.0  62.0
2016-02-29 00:00:00   NaN  34.0   NaN   NaN   NaN
2016-02-29 01:00:00   NaN  82.0   NaN   NaN   NaN
2016-02-29 02:00:00   NaN  67.0   NaN   NaN   NaN

Addendum: by months

The code above can easily be adapted for month columns:

Either using MultiIndex columns:

origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
    df[['value']]
    .assign(year=df.index.year, month=df.index.month)
    .set_axis(newidx, axis=0)
    .pivot(columns=['year', 'month'], values='value')
)
>>> newdf
year                 2015                                                        ...  2019                                                      
month                  1     2     3     4     5     6     7     8     9     10  ...    3     4     5     6     7     8     9     10    11    12
2016-01-01 00:00:00  44.0  49.0  40.0  60.0  71.0  67.0  63.0  16.0  71.0  78.0  ...  32.0  35.0  51.0  35.0  68.0  43.0   4.0  23.0  65.0  19.0
2016-01-01 01:00:00  47.0  71.0  27.0  88.0  68.0  58.0  74.0  67.0  98.0  49.0  ...  85.0  27.0  70.0   8.0   9.0  29.0  78.0  29.0  21.0  68.0
2016-01-01 02:00:00  64.0  90.0   4.0  61.0  95.0   3.0  57.0  41.0  28.0  24.0  ...   7.0  93.0  21.0  10.0  72.0  79.0  46.0  45.0  25.0  99.0
...                   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
2016-01-31 21:00:00  48.0   NaN  24.0   NaN  79.0   NaN  55.0  47.0   NaN  20.0  ...  87.0   NaN  19.0   NaN  56.0  76.0   NaN  91.0   NaN  14.0
2016-01-31 22:00:00  82.0   NaN   6.0   NaN  46.0   NaN   9.0  57.0   NaN  21.0  ...  69.0   NaN  67.0   NaN  85.0  38.0   NaN  34.0   NaN  64.0
2016-01-31 23:00:00  51.0   NaN  97.0   NaN  45.0   NaN  55.0  41.0   NaN  87.0  ...  94.0   NaN  80.0   NaN  37.0  81.0   NaN  98.0   NaN  35.0

or a simple string column made of %Y-%m to indicate year/month:

origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
    df[['value']]
    .assign(ym=df.index.strftime(f'%Y-%m'))
    .set_axis(newidx, axis=0)
    .pivot(columns='ym', values='value')
)
>>> newdf
ym                   2015-01  2015-02  2015-03  2015-04  2015-05  2015-06  2015-07  2015-08  2015-09  2015-10  ...  2019-03  2019-04  2019-05  2019-06  2019-07  2019-08  2019-09  
2016-01-01 00:00:00     44.0     49.0     40.0     60.0     71.0     67.0     63.0     16.0     71.0     78.0  ...     32.0     35.0     51.0     35.0     68.0     43.0      4.0   
2016-01-01 01:00:00     47.0     71.0     27.0     88.0     68.0     58.0     74.0     67.0     98.0     49.0  ...     85.0     27.0     70.0      8.0      9.0     29.0     78.0   
2016-01-01 02:00:00     64.0     90.0      4.0     61.0     95.0      3.0     57.0     41.0     28.0     24.0  ...      7.0     93.0     21.0     10.0     72.0     79.0     46.0   
...                      ...      ...      ...      ...      ...      ...      ...      ...      ...      ...  ...      ...      ...      ...      ...      ...      ...      ...   
2016-01-31 21:00:00     48.0      NaN     24.0      NaN     79.0      NaN     55.0     47.0      NaN     20.0  ...     87.0      NaN     19.0      NaN     56.0     76.0      NaN   
2016-01-31 22:00:00     82.0      NaN      6.0      NaN     46.0      NaN      9.0     57.0      NaN     21.0  ...     69.0      NaN     67.0      NaN     85.0     38.0      NaN   
2016-01-31 23:00:00     51.0      NaN     97.0      NaN     45.0      NaN     55.0     41.0      NaN     87.0  ...     94.0      NaN     80.0      NaN     37.0     81.0      NaN   

ym                   2019-10  2019-11  2019-12  
2016-01-01 00:00:00     23.0     65.0     19.0  
2016-01-01 01:00:00     29.0     21.0     68.0  
2016-01-01 02:00:00     45.0     25.0     99.0  
...                      ...      ...      ...  
2016-01-31 21:00:00     91.0      NaN     14.0  
2016-01-31 22:00:00     34.0      NaN     64.0  
2016-01-31 23:00:00     98.0      NaN     35.0  

The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":

>>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
year                 2015  2016  2017  2018  2019
month                   2     2     2     2     2
2016-01-28 22:00:00  74.0  54.0  22.0  17.0  39.0
2016-01-28 23:00:00  37.0  61.0  31.0   8.0  62.0
2016-01-29 00:00:00   NaN  34.0   NaN   NaN   NaN
2016-01-29 01:00:00   NaN  82.0   NaN   NaN   NaN
2016-01-29 02:00:00   NaN  67.0   NaN   NaN   NaN
Answered By: Pierre D
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.