How do I handle date Working with date in multiindexed columns
Question:
In my code, I’m assigning an idex to the date
field, then converting it from string to datetime as below:
path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')
df = data.groupby(['DATE1']).agg({'QTY': "sum"})
df
Then I split the dataset into training
and testing
and plotting the data, as below:
train = df.loc[df.index < '2022-05-01']
test = df.loc[df.index >= '2022-05-01']
fig, ax = plt.subplots(figsize=(15, 5))
train.plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test.plot(ax=ax, label='Test Set')
ax.axvline('2022-05-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()
Everything is fine, and I, getting the required outputs:
Now, I’m trying to do the same with multiple indexes, so I read the data as below:
path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')
df = data.groupby(['DATE1', 'ITEM_ID', 'SLS_CNTR_ID']).agg({'QTY': "sum"})
df
And I got the required dataframe correctly:
But now I’ve 2 issues:
- How can I convert index[0], i.e. the
'DATE1'
field from str to date?
- How can I compare the value of index[0], i.e. the
'DATE1'
to the required periods, so I can plot the chart smoothly as I did before
Answers:
Suppose we have a string format dataframe
dt_string = ['2022-01-01 00:00:00:000', '2022-01-01 00:03:00:000', '2022-01-01 01:00:00:000', '2022-01-01 00:00:00:379']
dt_time = pd.DataFrame(dt_string, columns=['str_dt'])
dt_time
###
str_dt
0 2022-01-01 00:00:00:000
1 2022-01-01 00:03:00:000
2 2022-01-01 01:00:00:000
3 2022-01-01 00:00:00:379
turn it into date format
dt_time['str_dt'] = pd.to_datetime(dt_time['str_dt'], format='%Y-%m-%d %H:%M:%S:%f')
dt_time['dt'] = dt_time['str_dt'].dt.normalize()
dt_time
###
str_dt dt
0 2022-01-01 00:00:00:000 2022-01-01
1 2022-01-01 00:03:00:000 2022-01-01
2 2022-01-01 01:00:00:000 2022-01-01
3 2022-01-01 00:00:00:379 2022-01-01
dt_time.dtypes
###
str_dt datetime64[ns]
dt datetime64[ns]
dtype: object
Multi-index Selection
df
###
QTY
DATE1 ITEM_ID SLS_CNTE_ID
2022-01-01 95 4.0 1
106 30.0 1
100.0 1
133 19.0 1
282 30.0 1
4.0 1
2022-01-02 96 30.0 1
100.0 1
2022-01-03 97 19.0 1
30.0 1
199 4.0 1
200 30.0 1
2022-01-04 23 100.0 1
42 19.0 1
30.0 1
You can use the following ways, they all return the same result
df[df.index.get_level_values('DATE1')=='2022-01-01']
df.query('DATE1=="2022-01-01"')
df.loc[['2022-01-01']]
for non-single selection:
CASE 1:
Select '2022-01-01'
and '2022-01-03'
df[(df.index.get_level_values('DATE1')=='2022-01-01')|(df.index.get_level_values('DATE1')=='2022-01-03')]
OR
df.query('DATE1=="2022-01-01" or DATE1=="2022-01-03"')
CASE 2:
Select date range from '2022-01-02'
to '2022-01-04'
df.query('DATE1>="2022-01-02" and DATE1<="2022-01-04"')
In my code, I’m assigning an idex to the date
field, then converting it from string to datetime as below:
path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')
df = data.groupby(['DATE1']).agg({'QTY': "sum"})
df
Then I split the dataset into training
and testing
and plotting the data, as below:
train = df.loc[df.index < '2022-05-01']
test = df.loc[df.index >= '2022-05-01']
fig, ax = plt.subplots(figsize=(15, 5))
train.plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test.plot(ax=ax, label='Test Set')
ax.axvline('2022-05-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()
Everything is fine, and I, getting the required outputs:
Now, I’m trying to do the same with multiple indexes, so I read the data as below:
path = "/content/drive/MyDrive/ColabNotebooks/dataset/"
data = pd.read_csv(path+'2022Sales.csv', sep='t', lineterminator='r')
df = data.groupby(['DATE1', 'ITEM_ID', 'SLS_CNTR_ID']).agg({'QTY': "sum"})
df
And I got the required dataframe correctly:
But now I’ve 2 issues:
- How can I convert index[0], i.e. the
'DATE1'
field from str to date? - How can I compare the value of index[0], i.e. the
'DATE1'
to the required periods, so I can plot the chart smoothly as I did before
Suppose we have a string format dataframe
dt_string = ['2022-01-01 00:00:00:000', '2022-01-01 00:03:00:000', '2022-01-01 01:00:00:000', '2022-01-01 00:00:00:379']
dt_time = pd.DataFrame(dt_string, columns=['str_dt'])
dt_time
###
str_dt
0 2022-01-01 00:00:00:000
1 2022-01-01 00:03:00:000
2 2022-01-01 01:00:00:000
3 2022-01-01 00:00:00:379
turn it into date format
dt_time['str_dt'] = pd.to_datetime(dt_time['str_dt'], format='%Y-%m-%d %H:%M:%S:%f')
dt_time['dt'] = dt_time['str_dt'].dt.normalize()
dt_time
###
str_dt dt
0 2022-01-01 00:00:00:000 2022-01-01
1 2022-01-01 00:03:00:000 2022-01-01
2 2022-01-01 01:00:00:000 2022-01-01
3 2022-01-01 00:00:00:379 2022-01-01
dt_time.dtypes
###
str_dt datetime64[ns]
dt datetime64[ns]
dtype: object
Multi-index Selection
df
###
QTY
DATE1 ITEM_ID SLS_CNTE_ID
2022-01-01 95 4.0 1
106 30.0 1
100.0 1
133 19.0 1
282 30.0 1
4.0 1
2022-01-02 96 30.0 1
100.0 1
2022-01-03 97 19.0 1
30.0 1
199 4.0 1
200 30.0 1
2022-01-04 23 100.0 1
42 19.0 1
30.0 1
You can use the following ways, they all return the same result
df[df.index.get_level_values('DATE1')=='2022-01-01']
df.query('DATE1=="2022-01-01"')
df.loc[['2022-01-01']]
for non-single selection:
CASE 1:
Select '2022-01-01'
and '2022-01-03'
df[(df.index.get_level_values('DATE1')=='2022-01-01')|(df.index.get_level_values('DATE1')=='2022-01-03')]
OR
df.query('DATE1=="2022-01-01" or DATE1=="2022-01-03"')
CASE 2:
Select date range from '2022-01-02'
to '2022-01-04'
df.query('DATE1>="2022-01-02" and DATE1<="2022-01-04"')