How do I handle date Working with date in multiindexed columns

Question:

In my code, I’m assigning an idex to the date field, then converting it from string to datetime as below:

path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')

df = data.groupby(['DATE1']).agg({'QTY': "sum"})
df

Then I split the dataset into training and testing and plotting the data, as below:

train = df.loc[df.index < '2022-05-01'] 
test = df.loc[df.index >= '2022-05-01']

fig, ax = plt.subplots(figsize=(15, 5))
train.plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test.plot(ax=ax, label='Test Set')
ax.axvline('2022-05-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Everything is fine, and I, getting the required outputs:

enter image description here

Now, I’m trying to do the same with multiple indexes, so I read the data as below:

path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')

df = data.groupby(['DATE1', 'ITEM_ID', 'SLS_CNTR_ID']).agg({'QTY': "sum"})
df

And I got the required dataframe correctly:
enter image description here

But now I’ve 2 issues:

  1. How can I convert index[0], i.e. the 'DATE1' field from str to date?
  2. How can I compare the value of index[0], i.e. the 'DATE1' to the required periods, so I can plot the chart smoothly as I did before
Asked By: Hasan A Yousef

||

Answers:

Suppose we have a string format dataframe

dt_string = ['2022-01-01 00:00:00:000', '2022-01-01 00:03:00:000', '2022-01-01 01:00:00:000', '2022-01-01 00:00:00:379']
dt_time = pd.DataFrame(dt_string, columns=['str_dt'])
dt_time
###
                    str_dt
0  2022-01-01 00:00:00:000
1  2022-01-01 00:03:00:000
2  2022-01-01 01:00:00:000
3  2022-01-01 00:00:00:379

turn it into date format

dt_time['str_dt'] = pd.to_datetime(dt_time['str_dt'], format='%Y-%m-%d %H:%M:%S:%f')
dt_time['dt'] = dt_time['str_dt'].dt.normalize()
dt_time
###
                    str_dt          dt
0  2022-01-01 00:00:00:000  2022-01-01
1  2022-01-01 00:03:00:000  2022-01-01
2  2022-01-01 01:00:00:000  2022-01-01
3  2022-01-01 00:00:00:379  2022-01-01
dt_time.dtypes
###
str_dt    datetime64[ns]
dt        datetime64[ns]
dtype: object

Multi-index Selection

df
###
                                QTY
DATE1      ITEM_ID SLS_CNTE_ID     
2022-01-01 95      4.0            1
           106     30.0           1
                   100.0          1
           133     19.0           1
           282     30.0           1
                   4.0            1
2022-01-02 96      30.0           1
                   100.0          1
2022-01-03 97      19.0           1
                   30.0           1
           199     4.0            1
           200     30.0           1
2022-01-04 23      100.0          1
           42      19.0           1
                   30.0           1

You can use the following ways, they all return the same result

df[df.index.get_level_values('DATE1')=='2022-01-01']
df.query('DATE1=="2022-01-01"')
df.loc[['2022-01-01']]

for non-single selection:

CASE 1:
Select '2022-01-01' and '2022-01-03'

df[(df.index.get_level_values('DATE1')=='2022-01-01')|(df.index.get_level_values('DATE1')=='2022-01-03')]

OR

df.query('DATE1=="2022-01-01" or DATE1=="2022-01-03"')

CASE 2:
Select date range from '2022-01-02' to '2022-01-04'

df.query('DATE1>="2022-01-02" and DATE1<="2022-01-04"')
Answered By: Baron Legendre
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.