Convert different length list in pandas dataframe to row in one column

Question:

I have a table like this in pandas, the date is always Friday but it could be not continuous due to holidays or other reasons, and in Target, it is a list that contains the performance of next week, the length of the list in the last row could be <5 because today is Wednesday, so for this week I only have Monday and Tuesday data:

| Date         | Performance          |
| 2022/01/27   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/10   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/17   | [0.1,0.1,0.2,0.1,0.3]|
| 2022/02/24   | [0.1,0.1]            |

I want to convert this table to a date/performance 2d table with the date of the actual performance day and the performance of each day:

| Date         | Performance |
| 2022/01/30   |0.1 |
| 2022/01/31   |0.1 |
| 2022/02/01   |0.2 |
| 2022/02/02   |0.1 |
| 2022/02/03   |0.3 |
| 2022/02/13   |0.1 |
| 2022/02/14   |0.1 |
| 2022/02/15   |0.2 |
| ...          |... |
| 2022/02/27   |0.1 |
| 2022/02/28   |0.1 |

How can I do this in python?

I tried to use sum for the list to connect all lists to a 1d array, but it is problem to attach it to the date column.

Asked By: Yiwei

||

Answers:

import pandas as pd

# create the input DataFrame
df = pd.DataFrame({'Date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
                   'Performance': [[0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1, 0.2, 0.1, 0.3], [0.1, 0.1]]})

# create an empty DataFrame to store the result
result_df = pd.DataFrame(columns=['Date', 'Performance'])

# iterate through each row of the input DataFrame
for i in range(len(df)):
    date = pd.to_datetime(df['Date'][i])
    performance = df['Performance'][i]
    
    # get the minimum value between the length of the performance list and 5
    n = min(len(performance), 5)
    
    # iterate through each performance value and append it to the result DataFrame
    for j in range(n):
        result_df = result_df.append({'Date': date + pd.DateOffset(days=j), 'Performance': performance[j]},
                                     ignore_index=True)

# print the final result DataFrame
print(result_df)

The output look like this:

                   Date Performance
0   2022-01-27 00:00:00         0.1
1   2022-01-28 00:00:00         0.1
2   2022-01-29 00:00:00         0.2
3   2022-01-30 00:00:00         0.1
4   2022-01-31 00:00:00         0.3
5   2022-02-10 00:00:00         0.1
6   2022-02-11 00:00:00         0.1
7   2022-02-12 00:00:00         0.2
8   2022-02-13 00:00:00         0.1
9   2022-02-14 00:00:00         0.3
10  2022-02-17 00:00:00         0.1
11  2022-02-18 00:00:00         0.1
12  2022-02-19 00:00:00         0.2
13  2022-02-20 00:00:00         0.1
14  2022-02-21 00:00:00         0.3
15  2022-02-24 00:00:00         0.1
16  2022-02-25 00:00:00         0.1
Answered By: Rajender Kumar

Here is an approach using df.explode() and df.groupby().cumcount()

df = df.explode('Performance')
df['Date'] = pd.to_datetime(df['Date']) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit='D')

df = df.reset_index(drop=True) 
print(df)


         Date Performance
0  2022-01-27         0.1
1  2022-01-28         0.1
2  2022-01-29         0.2
3  2022-01-30         0.1
4  2022-01-31         0.3
5  2022-02-10         0.1
6  2022-02-11         0.1
7  2022-02-12         0.2
8  2022-02-13         0.1
9  2022-02-14         0.3
10 2022-02-17         0.1
11 2022-02-18         0.1
12 2022-02-19         0.2
13 2022-02-20         0.1
14 2022-02-21         0.3
15 2022-02-24         0.1
16 2022-02-25         0.1
Answered By: Jamiu S.

From what I understand about your description of the DataFrame, its columns represent the following:

  • date: contains dates which are all consecutive Fridays.

  • performance: contains lists of performances corresponding to consecutive days in the next week (from Monday up to at most Friday), i.e. 3 days after the value in date.

And the problem is how to form a DataFrame that has each performance and its corresponding date on a separate row.


Input data

import pandas as pd

data = {
    'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
    'performance': [
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1]
    ]
}

df = pd.DataFrame(data)

print(df)
         date                performance
0  2022/01/27  [0.1, 0.1, 0.2, 0.1, 0.3]
1  2022/02/10  [0.1, 0.1, 0.2, 0.1, 0.3]
2  2022/02/17  [0.1, 0.1, 0.2, 0.1, 0.3]
3  2022/02/24                 [0.1, 0.1]

Simple solution

Jamiu S. provided a much more compact solution than my original one. So I’ve include it here first, with the addition of pd.DateOffset() to fully answer the question.

df = df.explode('performance')

df['date'] = pd.to_datetime(df['Date']) + pd.DateOffset(days=3) + pd.to_timedelta(
             df.groupby(level=0).cumcount(), unit='D') 

df = df.reset_index(drop=True) 
print(df)

Output:

         date performance
0  2022-01-30         0.1
1  2022-01-31         0.1
2  2022-02-01         0.2
3  2022-02-02         0.1
4  2022-02-03         0.3
5  2022-02-13         0.1
6  2022-02-14         0.1
7  2022-02-15         0.2
8  2022-02-16         0.1
9  2022-02-17         0.3
10 2022-02-20         0.1
11 2022-02-21         0.1
12 2022-02-22         0.2
13 2022-02-23         0.1
14 2022-02-24         0.3
15 2022-02-27         0.1
16 2022-02-28         0.1

Original solution

Consider the following steps:

Step 1: Converting dates to datetime

If not done so already, ensure the date values are represented as datetime objects rather than strings. The pd.to_datetime() method can be used to accomplish this.

# Convert the date column to a datetime object, so it can be manipulated later.
df['date'] = pd.to_datetime(df['date'])

print(df)
        date                performance
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]
3 2022-02-24                 [0.1, 0.1]

Output of df.info():

RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         4 non-null      datetime64[ns]
 1   performance  4 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 192.0+ bytes

Step 2: Adding the date of next week

Add a new column 'start_of_week', representing the Monday of the next week (3 days after Friday).

To calculate these dates, pd.DateOffset() can be used, to advance the original dates by certain number of days.

# Create a column representing the start of the next week (Monday) - 3 days after the current date (Friday)
df['start_of_week'] = df['date'] + pd.DateOffset(days=3)

print(df)
        date                performance start_of_week
0 2022-01-27  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-01-30
1 2022-02-10  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-13
2 2022-02-17  [0.1, 0.1, 0.2, 0.1, 0.3]    2022-02-20
3 2022-02-24                 [0.1, 0.1]    2022-02-27

Step 3: Creating a performance table generator

Create a function that can be applied to each row, to form a two-dimensional "performance table" out of it.

The pd.date_range() function can be used to form a sequence of consecutive dates corresponding to each performance value.

# Generates a sub-DataFrame out of a row containing a week-date and performances.
def create_performance_table(r):
    
    # Extract the performance dates.
    perfs = r['performance']
    
    # Construct the range of dates corresponding to each of these performances
    dates = pd.date_range(r['start_of_week'], periods = len(perfs))

    # Create a DataFrame out of these values and return it.
    return pd.DataFrame({"date": dates, "performance": perfs})

Step 4: Creating the sub-tables and combining them

Use the newly defined create_performance_table() function to construct the DataFrame representing the whole performance table.

  • The .apply() method applies the function to each row of the DataFrame, and combines them together.

  • Since the resulting sub-tables will be represented as a single Series object, they need to be joined together to form a single DataFrame. The .concat() method can do just that (but the Series must first be converted to a list).

# Apply the performance table generator to every row, storing the results as a Series of sub-DataFrames.
tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)

# Concatenate each of these sub-DatFrames to form the final performance table
out_df = pd.concat(tables.tolist(), ignore_index=True)

print(out_df)

Final output:

         date  performance
0  2022-01-30          0.1
1  2022-01-31          0.1
2  2022-02-01          0.2
3  2022-02-02          0.1
4  2022-02-03          0.3
5  2022-02-13          0.1
6  2022-02-14          0.1
7  2022-02-15          0.2
8  2022-02-16          0.1
9  2022-02-17          0.3
10 2022-02-20          0.1
11 2022-02-21          0.1
12 2022-02-22          0.2
13 2022-02-23          0.1
14 2022-02-24          0.3
15 2022-02-27          0.1
16 2022-02-28          0.1

Full code

import pandas as pd

# --- Input data

data = {
    'date': ['2022/01/27', '2022/02/10', '2022/02/17', '2022/02/24'],
    'performance': [
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1,0.2,0.1,0.3],
        [0.1,0.1]
    ]
}

df = pd.DataFrame(data)


# --- Convert dates to datetime

df['date'] = pd.to_datetime(df['date'])


# --- Add the date of next week

df['start_of_week'] = df['date'] + pd.DateOffset(days=3)


# --- Performance table generator

def create_performance_table(r):
    
    perfs = r['performance']
    
    dates = pd.date_range(r['start_of_week'], periods = len(perfs))

    return pd.DataFrame({"date": dates, "performance": perfs})


# --- Create the sub-tables and combine them

tables = df[['performance', 'start_of_week']].apply(create_performance_table, axis=1)

# The final output
out_df = pd.concat(tables.tolist(), ignore_index=True)
Answered By: user21283023
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.