Getting the "ValueError: Columns must be same length as key" when splitting out the date column by "/"
Question:
I tried to split the date column into 3 separate columns
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/')
Here’s the error I’m getting
ValueError: Columns must be same length as key
It works fine when I just type in df[‘Date’].str.split(‘/’) and I was able to confirm that each list has only 3 elements.
I think it might have to do with some of the dates not having the full year, so the length of the column varies but I’m not sure why this would matter here.
df['Date'].value_counts()
22/05/2022 10
26/12/05 10
26/12/08 10
11/05/14 10
12/05/2019 10
..
28/02/05 1
14/09/2015 1
27/09/2015 1
28/09/2015 1
17/08/2015 1
df['Date'].str.len().value_counts()
8.0 4850
10.0 2280
Name: Date, dtype: int64
Answers:
The error is caused by not having all dates in the ‘Date’ column haveing the same length.
The first format 22/05/2022 the length is 8 characters, but in the second format 26/12/05 the length is only 6 characters.
A solution would be to first convert all dates to a uniform length:
def convert_date(date):
if len(date) == 8:
date = "20" + date[-2:] + "/" + date[:2] + "/" + date[3:5]
return date
df['Date'] = df['Date'].apply(convert_date)
You can then split the dates into separate columns using:
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/')
By default Series.str.split
will return a Series (i.e. a single "column") with lists that contain the split elements).
df['Date'].str.split('/')
0 [22, 05, 2022]
1 [26, 12, 05]
2 [26, 12, 08]
3 [11, 5, 2014]
4 [12, 5, 2019]
5 [28, 02, 05]
6 [14, 09, 2015]
7 [27, 09, 2015]
8 [28, 09, 2015]
9 [17, 08, 2015]
Name: Date, dtype: object
So, the error you are getting is the result of trying to assign this single "column" to three new columns. To fix this, you need to set the expand
parameter to True
. This will expand the result into three different columns, which we can then assign as intended:
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/', expand=True)
df
Date Count date1 date2 date3
0 22/05/2022 10 22 05 2022
1 26/12/05 10 26 12 05
2 26/12/08 10 26 12 08
3 11/5/2014 10 11 5 2014
4 12/5/2019 10 12 5 2019
5 28/02/05 1 28 02 05
6 14/09/2015 1 14 09 2015
7 27/09/2015 1 27 09 2015
8 28/09/2015 1 28 09 2015
9 17/08/2015 1 17 08 2015
Data used
import pandas as pd
data = {'Date': {0: '22/05/2022', 1: '26/12/05', 2: '26/12/08', 3: '11/5/2014',
4: '12/5/2019', 5: '28/02/05', 6: '14/09/2015', 7: '27/09/2015',
8: '28/09/2015', 9: '17/08/2015'},
'Count': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}
df = pd.DataFrame(data)
Incidentally, while the above should make the assignment work, if you are sure that the logic of the date strings is always: day / month / year (despite the differences in formatting), it is probably a better idea to rely on pd.to_datetime
with dayfirst
parameter set to True
, and then to use Series.dt.day
and same for month
and year
:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['day'] = df['Date'].dt.day
df['month'] = df['Date'].dt.month
df['year'] = df['Date'].dt.year
df
Date Count day month year
0 2022-05-22 10 22 5 2022
1 2005-12-26 10 26 12 2005
2 2008-12-26 10 26 12 2008
3 2014-05-11 10 11 5 2014
4 2019-05-12 10 12 5 2019
5 2005-02-28 1 28 2 2005
6 2015-09-14 1 14 9 2015
7 2015-09-27 1 27 9 2015
8 2015-09-28 1 28 9 2015
9 2015-08-17 1 17 8 2015
Note that you’ll end up with proper and consistent int
values this way.
I tried to split the date column into 3 separate columns
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/')
Here’s the error I’m getting
ValueError: Columns must be same length as key
It works fine when I just type in df[‘Date’].str.split(‘/’) and I was able to confirm that each list has only 3 elements.
I think it might have to do with some of the dates not having the full year, so the length of the column varies but I’m not sure why this would matter here.
df['Date'].value_counts()
22/05/2022 10
26/12/05 10
26/12/08 10
11/05/14 10
12/05/2019 10
..
28/02/05 1
14/09/2015 1
27/09/2015 1
28/09/2015 1
17/08/2015 1
df['Date'].str.len().value_counts()
8.0 4850
10.0 2280
Name: Date, dtype: int64
The error is caused by not having all dates in the ‘Date’ column haveing the same length.
The first format 22/05/2022 the length is 8 characters, but in the second format 26/12/05 the length is only 6 characters.
A solution would be to first convert all dates to a uniform length:
def convert_date(date):
if len(date) == 8:
date = "20" + date[-2:] + "/" + date[:2] + "/" + date[3:5]
return date
df['Date'] = df['Date'].apply(convert_date)
You can then split the dates into separate columns using:
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/')
By default Series.str.split
will return a Series (i.e. a single "column") with lists that contain the split elements).
df['Date'].str.split('/')
0 [22, 05, 2022]
1 [26, 12, 05]
2 [26, 12, 08]
3 [11, 5, 2014]
4 [12, 5, 2019]
5 [28, 02, 05]
6 [14, 09, 2015]
7 [27, 09, 2015]
8 [28, 09, 2015]
9 [17, 08, 2015]
Name: Date, dtype: object
So, the error you are getting is the result of trying to assign this single "column" to three new columns. To fix this, you need to set the expand
parameter to True
. This will expand the result into three different columns, which we can then assign as intended:
df[['date1', 'date2', 'date3']] = df['Date'].str.split('/', expand=True)
df
Date Count date1 date2 date3
0 22/05/2022 10 22 05 2022
1 26/12/05 10 26 12 05
2 26/12/08 10 26 12 08
3 11/5/2014 10 11 5 2014
4 12/5/2019 10 12 5 2019
5 28/02/05 1 28 02 05
6 14/09/2015 1 14 09 2015
7 27/09/2015 1 27 09 2015
8 28/09/2015 1 28 09 2015
9 17/08/2015 1 17 08 2015
Data used
import pandas as pd
data = {'Date': {0: '22/05/2022', 1: '26/12/05', 2: '26/12/08', 3: '11/5/2014',
4: '12/5/2019', 5: '28/02/05', 6: '14/09/2015', 7: '27/09/2015',
8: '28/09/2015', 9: '17/08/2015'},
'Count': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}
df = pd.DataFrame(data)
Incidentally, while the above should make the assignment work, if you are sure that the logic of the date strings is always: day / month / year (despite the differences in formatting), it is probably a better idea to rely on pd.to_datetime
with dayfirst
parameter set to True
, and then to use Series.dt.day
and same for month
and year
:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['day'] = df['Date'].dt.day
df['month'] = df['Date'].dt.month
df['year'] = df['Date'].dt.year
df
Date Count day month year
0 2022-05-22 10 22 5 2022
1 2005-12-26 10 26 12 2005
2 2008-12-26 10 26 12 2008
3 2014-05-11 10 11 5 2014
4 2019-05-12 10 12 5 2019
5 2005-02-28 1 28 2 2005
6 2015-09-14 1 14 9 2015
7 2015-09-27 1 27 9 2015
8 2015-09-28 1 28 9 2015
9 2015-08-17 1 17 8 2015
Note that you’ll end up with proper and consistent int
values this way.