Convert different length list in pandas dataframe to row in one column
Question:
I have a table like this in pandas, the date is always Friday but it could be not continuous due to holidays or other reasons, and in Target, it is a list that contains the performance of next week, the length of the list in the last row could be <5 because today is Wednesday, so for this week I only have Monday and Tuesday data:
| Date | Performance |
| 2022/01/27 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/10 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/17 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/24 | [0.1,0.1] |
I want to convert this table to a date/performance 2d table with the date of the actual performance day and the performance of each day:
| Date | Performance |
| 2022/01/30 |0.1 |
| 2022/01/31 |0.1 |
| 2022/02/01 |0.2 |
| 2022/02/02 |0.1 |
| 2022/02/03 |0.3 |
| 2022/02/13 |0.1 |
| 2022/02/14 |0.1 |
| 2022/02/15 |0.2 |
| ... |... |
| 2022/02/27 |0.1 |
| 2022/02/28 |0.1 |
How can I do this in python?
I tried to use sum for the list to connect all lists to a 1d array, but it is problem to attach it to the date column.
Answers:
import pandas as pd
# create the input DataFrame
df = pd.DataFrame({'Date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'Performance': [[0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1]]})
# create an empty DataFrame to store the result
result_df = pd.DataFrame(columns=['Date', 'Performance'])
# iterate through each row of the input DataFrame
for i in range(len(df)):
date = pd.to_datetime(df['Date'][i])
performance = df['Performance'][i]
# get the minimum value between the length of the performance list and 5
n = min(len(performance), 5)
# iterate through each performance value and append it to the result DataFrame
for j in range(n):
result_df = result_df.append({'Date': date + pd.DateOffset(days=j), 'Performance': performance[j]},
ignore_index=True)
# print the final result DataFrame
print(result_df)
The output look like this:
Date Performance
0 2022-01-27 00:00:00 0.1
1 2022-01-28 00:00:00 0.1
2 2022-01-29 00:00:00 0.2
3 2022-01-30 00:00:00 0.1
4 2022-01-31 00:00:00 0.3
5 2022-02-10 00:00:00 0.1
6 2022-02-11 00:00:00 0.1
7 2022-02-12 00:00:00 0.2
8 2022-02-13 00:00:00 0.1
9 2022-02-14 00:00:00 0.3
10 2022-02-17 00:00:00 0.1
11 2022-02-18 00:00:00 0.1
12 2022-02-19 00:00:00 0.2
13 2022-02-20 00:00:00 0.1
14 2022-02-21 00:00:00 0.3
15 2022-02-24 00:00:00 0.1
16 2022-02-25 00:00:00 0.1
Here is an approach using df.explode()
and df.groupby().cumcount()
df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Date Performance
0 2022-01-27 0.1
1 2022-01-28 0.1
2 2022-01-29 0.2
3 2022-01-30 0.1
4 2022-01-31 0.3
5 2022-02-10 0.1
6 2022-02-11 0.1
7 2022-02-12 0.2
8 2022-02-13 0.1
9 2022-02-14 0.3
10 2022-02-17 0.1
11 2022-02-18 0.1
12 2022-02-19 0.2
13 2022-02-20 0.1
14 2022-02-21 0.3
15 2022-02-24 0.1
16 2022-02-25 0.1
From what I understand about your description of the DataFrame, its columns represent the following:
-
date
: contains dates which are all consecutive Fridays.
-
performance
: contains lists of performances corresponding to consecutive days in the next week (from Monday up to at most Friday), i.e. 3
days after the value in date
.
And the problem is how to form a DataFrame that has each performance and its corresponding date on a separate row.
Input data
import pandas as pd
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
print(df)
date performance
0 2022/01/27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022/02/10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022/02/17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022/02/24 [0.1, 0.1]
Simple solution
Jamiu S. provided a much more compact solution than my original one. So I’ve include it here first, with the addition of pd.DateOffset()
to fully answer the question.
df = df.explode('performance')
df['date'] = pd.to_datetime(df['Date']) + pd.DateOffset(days=3) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Original solution
Consider the following steps:
Step 1: Converting dates to datetime
If not done so already, ensure the date
values are represented as datetime
objects rather than strings. The pd.to_datetime()
method can be used to accomplish this.
# Convert the date column to a datetime object, so it can be manipulated later.
df['date'] = pd.to_datetime(df['date'])
print(df)
date performance
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24 [0.1, 0.1]
Output of df.info()
:
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 performance 4 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes
Step 2: Adding the date of next week
Add a new column 'start_of_week'
, representing the Monday of the next week (3
days after Friday).
To calculate these dates, pd.DateOffset()
can be used, to advance the original dates by certain number of days.
# Create a column representing the start of the next week (Monday) - 3 days after the current date (Friday)
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
print(df)
date performance start_of_week
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-01-30
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-13
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-20
3 2022-02-24 [0.1, 0.1] 2022-02-27
Step 3: Creating a performance table generator
Create a function that can be applied to each row, to form a two-dimensional "performance table" out of it.
The pd.date_range()
function can be used to form a sequence of consecutive dates
corresponding to each performance value.
# Generates a sub-DataFrame out of a row containing a week-date and performances.
def create_performance_table(r):
# Extract the performance dates.
perfs = r['performance']
# Construct the range of dates corresponding to each of these performances
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
# Create a DataFrame out of these values and return it.
return pd.DataFrame({"date": dates, "performance": perfs})
Step 4: Creating the sub-tables and combining them
Use the newly defined create_performance_table()
function to construct the DataFrame representing the whole performance table.
-
The .apply()
method applies the function to each row of the DataFrame, and combines them together.
-
Since the resulting sub-tables will be represented as a single Series
object, they need to be joined together to form a single DataFrame
. The .concat()
method can do just that (but the Series
must first be converted to a list).
# Apply the performance table generator to every row, storing the results as a Series of sub-DataFrames.
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# Concatenate each of these sub-DatFrames to form the final performance table
out_df = pd.concat(tables.tolist(), ignore_index=True)
print(out_df)
Final output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Full code
import pandas as pd
# --- Input data
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
# --- Convert dates to datetime
df['date'] = pd.to_datetime(df['date'])
# --- Add the date of next week
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
# --- Performance table generator
def create_performance_table(r):
perfs = r['performance']
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
return pd.DataFrame({"date": dates, "performance": perfs})
# --- Create the sub-tables and combine them
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# The final output
out_df = pd.concat(tables.tolist(), ignore_index=True)
I have a table like this in pandas, the date is always Friday but it could be not continuous due to holidays or other reasons, and in Target, it is a list that contains the performance of next week, the length of the list in the last row could be <5 because today is Wednesday, so for this week I only have Monday and Tuesday data:
| Date | Performance |
| 2022/01/27 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/10 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/17 | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/24 | [0.1,0.1] |
I want to convert this table to a date/performance 2d table with the date of the actual performance day and the performance of each day:
| Date | Performance |
| 2022/01/30 |0.1 |
| 2022/01/31 |0.1 |
| 2022/02/01 |0.2 |
| 2022/02/02 |0.1 |
| 2022/02/03 |0.3 |
| 2022/02/13 |0.1 |
| 2022/02/14 |0.1 |
| 2022/02/15 |0.2 |
| ... |... |
| 2022/02/27 |0.1 |
| 2022/02/28 |0.1 |
How can I do this in python?
I tried to use sum for the list to connect all lists to a 1d array, but it is problem to attach it to the date column.
import pandas as pd
# create the input DataFrame
df = pd.DataFrame({'Date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'Performance': [[0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1]]})
# create an empty DataFrame to store the result
result_df = pd.DataFrame(columns=['Date', 'Performance'])
# iterate through each row of the input DataFrame
for i in range(len(df)):
date = pd.to_datetime(df['Date'][i])
performance = df['Performance'][i]
# get the minimum value between the length of the performance list and 5
n = min(len(performance), 5)
# iterate through each performance value and append it to the result DataFrame
for j in range(n):
result_df = result_df.append({'Date': date + pd.DateOffset(days=j), 'Performance': performance[j]},
ignore_index=True)
# print the final result DataFrame
print(result_df)
The output look like this:
Date Performance
0 2022-01-27 00:00:00 0.1
1 2022-01-28 00:00:00 0.1
2 2022-01-29 00:00:00 0.2
3 2022-01-30 00:00:00 0.1
4 2022-01-31 00:00:00 0.3
5 2022-02-10 00:00:00 0.1
6 2022-02-11 00:00:00 0.1
7 2022-02-12 00:00:00 0.2
8 2022-02-13 00:00:00 0.1
9 2022-02-14 00:00:00 0.3
10 2022-02-17 00:00:00 0.1
11 2022-02-18 00:00:00 0.1
12 2022-02-19 00:00:00 0.2
13 2022-02-20 00:00:00 0.1
14 2022-02-21 00:00:00 0.3
15 2022-02-24 00:00:00 0.1
16 2022-02-25 00:00:00 0.1
Here is an approach using df.explode()
and df.groupby().cumcount()
df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Date Performance
0 2022-01-27 0.1
1 2022-01-28 0.1
2 2022-01-29 0.2
3 2022-01-30 0.1
4 2022-01-31 0.3
5 2022-02-10 0.1
6 2022-02-11 0.1
7 2022-02-12 0.2
8 2022-02-13 0.1
9 2022-02-14 0.3
10 2022-02-17 0.1
11 2022-02-18 0.1
12 2022-02-19 0.2
13 2022-02-20 0.1
14 2022-02-21 0.3
15 2022-02-24 0.1
16 2022-02-25 0.1
From what I understand about your description of the DataFrame, its columns represent the following:
-
date
: contains dates which are all consecutive Fridays. -
performance
: contains lists of performances corresponding to consecutive days in the next week (from Monday up to at most Friday), i.e.3
days after the value indate
.
And the problem is how to form a DataFrame that has each performance and its corresponding date on a separate row.
Input data
import pandas as pd
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
print(df)
date performance
0 2022/01/27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022/02/10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022/02/17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022/02/24 [0.1, 0.1]
Simple solution
Jamiu S. provided a much more compact solution than my original one. So I’ve include it here first, with the addition of pd.DateOffset()
to fully answer the question.
df = df.explode('performance')
df['date'] = pd.to_datetime(df['Date']) + pd.DateOffset(days=3) + pd.to_timedelta(
df.groupby(level=0).cumcount(), unit='D')
df = df.reset_index(drop=True)
print(df)
Output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Original solution
Consider the following steps:
Step 1: Converting dates to datetime
If not done so already, ensure the date
values are represented as datetime
objects rather than strings. The pd.to_datetime()
method can be used to accomplish this.
# Convert the date column to a datetime object, so it can be manipulated later.
df['date'] = pd.to_datetime(df['date'])
print(df)
date performance
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24 [0.1, 0.1]
Output of df.info()
:
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 performance 4 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes
Step 2: Adding the date of next week
Add a new column 'start_of_week'
, representing the Monday of the next week (3
days after Friday).
To calculate these dates, pd.DateOffset()
can be used, to advance the original dates by certain number of days.
# Create a column representing the start of the next week (Monday) - 3 days after the current date (Friday)
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
print(df)
date performance start_of_week
0 2022-01-27 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-01-30
1 2022-02-10 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-13
2 2022-02-17 [0.1, 0.1, 0.2, 0.1, 0.3] 2022-02-20
3 2022-02-24 [0.1, 0.1] 2022-02-27
Step 3: Creating a performance table generator
Create a function that can be applied to each row, to form a two-dimensional "performance table" out of it.
The pd.date_range()
function can be used to form a sequence of consecutive dates
corresponding to each performance value.
# Generates a sub-DataFrame out of a row containing a week-date and performances.
def create_performance_table(r):
# Extract the performance dates.
perfs = r['performance']
# Construct the range of dates corresponding to each of these performances
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
# Create a DataFrame out of these values and return it.
return pd.DataFrame({"date": dates, "performance": perfs})
Step 4: Creating the sub-tables and combining them
Use the newly defined create_performance_table()
function to construct the DataFrame representing the whole performance table.
-
The
.apply()
method applies the function to each row of the DataFrame, and combines them together. -
Since the resulting sub-tables will be represented as a single
Series
object, they need to be joined together to form a singleDataFrame
. The.concat()
method can do just that (but theSeries
must first be converted to a list).
# Apply the performance table generator to every row, storing the results as a Series of sub-DataFrames.
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# Concatenate each of these sub-DatFrames to form the final performance table
out_df = pd.concat(tables.tolist(), ignore_index=True)
print(out_df)
Final output:
date performance
0 2022-01-30 0.1
1 2022-01-31 0.1
2 2022-02-01 0.2
3 2022-02-02 0.1
4 2022-02-03 0.3
5 2022-02-13 0.1
6 2022-02-14 0.1
7 2022-02-15 0.2
8 2022-02-16 0.1
9 2022-02-17 0.3
10 2022-02-20 0.1
11 2022-02-21 0.1
12 2022-02-22 0.2
13 2022-02-23 0.1
14 2022-02-24 0.3
15 2022-02-27 0.1
16 2022-02-28 0.1
Full code
import pandas as pd
# --- Input data
data = {
'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
'performance': [
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1,0.2,0.1,0.3],
[0.1,0.1]
]
}
df = pd.DataFrame(data)
# --- Convert dates to datetime
df['date'] = pd.to_datetime(df['date'])
# --- Add the date of next week
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)
# --- Performance table generator
def create_performance_table(r):
perfs = r['performance']
dates = pd.date_range(r['start_of_week'], periods = len(perfs))
return pd.DataFrame({"date": dates, "performance": perfs})
# --- Create the sub-tables and combine them
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)
# The final output
out_df = pd.concat(tables.tolist(), ignore_index=True)