How to replace blank values and data rearranging by date?

Question:

I have a CSV file containing values for 1988-2020.
The first two columns in the file contain year and month (1988,1 …. 2020,12). The remaining column contains respective values of each date of the month. However, there are missing values for particular dates, and also some months are missing. The missing values need to be replaced by “NA” OR “-99”, and it also needs to consider leap year.
I need to convert this data in the following format without skipping a date from 01-01-1988 to 31-12-2020.

Column1,Column2
“dd-mm-yyyy”, “Value”

example data

Asked By: Bjrj92

||

Answers:

Here I attempted to reshape your data into rows from 01-01-1988 to 31-12-2020 adding missing days filled with NaN in every missing month. Could have missed something, so don’t hesitate to check it yourself and let me know if something was wrong:

import pandas as pd
import numpy as np

# parse first two columns as dates
df = pd.read_csv('/path_to_file/Data.csv', parse_dates=[[0, 1]], header=None)

# rename columns
df.columns = range(0, 33)

# inspecting the df has shown that the row with index 2016-12-01 is invalid
# and the column 32 is empty  
df = df.drop(columns=32).set_index(0).drop(index='2016-12-01')

# fill missing data in non-month-end days with -99
df.iloc[:, 0:28] = df.iloc[:, 0:28].fillna(value=-99, axis=0)

# reformat index
df.index = df.index.strftime('%m-%Y')

print(df)

Intermediate output:

           1     2     3     4     5     6     7     8     9    10  ...   22  
0                                                                   ...        
01-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
02-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
03-1988   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
04-1988   0.0   0.0   0.0   0.0   0.0   0.0  36.1   0.0   0.0  0.0  ...  0.0   
01-1990   9.8  13.0   0.0   0.0   0.0   0.0  16.8  26.0  10.0  0.0  ...  0.0   
...       ...   ...   ...   ...   ...   ...   ...   ...   ...  ...  ...  ...   
08-2020   0.0   8.0  15.0   9.0   8.0  13.0   0.0  12.0   0.0  0.0  ...  0.0   
09-2020  49.0  15.0   7.0   9.0   5.0  22.0   0.0   0.0   5.0  3.0  ...  0.0   
10-2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.0  ...  0.0   
11-2020  32.0   4.0  26.0  20.0  10.0   0.0  10.0   0.0   0.0  0.0  ...  0.0   
12-2020   0.0   0.0   3.0   2.0   0.0   8.0   6.0  23.0   3.0  0.0  ...  0.0   

          23    24    25   26    27   28   29   30    31  
0                                                         
01-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
02-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  NaN   NaN  
03-1988  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
04-1988  0.0  35.3  48.0  0.0   4.0  0.0  0.0  0.0   NaN  
01-1990  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
...      ...   ...   ...  ...   ...  ...  ...  ...   ...  
08-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0  36.0  
09-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   NaN  
10-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  
11-2020  0.0   0.0   0.0  0.0  32.0  2.0  5.0  0.0   NaN  
12-2020  0.0   0.0   0.0  0.0   0.0  0.0  0.0  0.0   0.0  

[199 rows x 31 columns]
# flatten df into 1-d array
flattened = df.to_numpy().flatten()

# expand months in df into days
valid_days = [pd.date_range(start=m, periods=pd.Period(m).days_in_month, freq='D')
              for m in df.index]

# create datetime index by days
valid_days = pd.to_datetime([item for sublist in valid_days 
                             for item in sublist]).strftime('%d-%m-%Y')

# create new DataFrame from flattened values and daily index
df_days = pd.DataFrame(flattened[~np.isnan(flattened)], 
                       index=valid_days, 
                       columns=['value'])

# create the range of all days within the period
total_days = pd.date_range('01-01-1988', '31-12-2020', freq='D').strftime('%d-%m-%Y')

# reindex df_days with all days within the period
df_days = df_days.reindex(total_days)

print(df_days)

Output:

          value
01-01-1988  0.0
02-01-1988  0.0
03-01-1988  0.0
04-01-1988  0.0
05-01-1988  0.0
    ...     ...
27-12-2020  0.0
28-12-2020  0.0
29-12-2020  0.0
30-12-2020  0.0
31-12-2020  0.0
12054 rows × 1 columns

I would be glad to know how to solve this in a much more elegant way.

Answered By: n.shabankin
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.