Tricky Reverse Aggregate values to unique rows per category in Pandas

Question:

I have a dataset where I would like to de aggregate the values into their own unique rows as well as perform a pivot and grouping by category.

Data

Date         start      end         area    BB_stat AA_stat BB_test AA_test final   
10/1/2022   11/1/2022   12/1/2022   NY      10      80      0       1       1/1/2022    
11/1/2022   12/1/2022   01/1/2023   NY      5       90      1       0       1/1/2022    
10/1/2022   11/1/2022   12/1/2022   CA      6       100     3       1       1/1/2022    
11/1/2022   12/1/2022   01/1/2023   CA      7       0       2       8       1/1/2022    
                                
                                
                                
                                
                                

Desired

#create a new column by string transformation

Date        start       end         type    area    stat    test    final       
10/1/2022   11/1/2022   12/1/2022   BB      NY      10      0       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   BB      NY      5       1       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   AA      NY      80      1       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   AA      NY      90      0       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   BB      CA      6       3       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   BB      CA      7       2       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   AA      CA      100     1       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   AA      CA      0       8       1/1/2022

Doing

#some help from previous SO post/member

df = df.set_index(["Date", "start", "end"])
new_df = pd.concat([pd.Series(c, index=df.index.repeat(df[c]))
                    for c in df]).reset_index(name="type")

# then sort values
new_df = new_df.sort_values(["Date", "start", "end"], ignore_index=True) 

Any suggestion is appreciated

Asked By: Lynn

||

Answers:

In the code, the value_vars variable is defined as a list of column names that contain the stat and test values. The melt function is then used to unpivot the DataFrame from wide format to long format, using the list of column names in the value_vars variable as the values to unpivot. This results in a new DataFrame where each row contains a single stat or test value.

Next, the str.split method is used to split the values in the "variable" column at the underscore character, resulting in two new columns called "type" and "name". The "variable" column is then dropped.

The resulting DataFrame is split into two half-sized DataFrames, and the two halves are merged together on the common columns, using suffixes to distinguish the columns from each half. The merged DataFrame is then renamed to have "stat" and "test" columns instead of the "value_1" and "value_2" columns. The unnecessary "name_1" and "name_2" columns are also dropped, and the columns are rearranged to have the desired order.

This results in a DataFrame that has the stat and test values in their own unique rows, grouped by category.

df["CC_test"] = [10,20,30,40]
df["CC_stat"] = [20,30,40,50]

value_vars = ["BB_stat", "AA_stat", "BB_test", "AA_test", "CC_stat", "CC_test"]
df = df.melt(id_vars=["Date", "start", "end", "area", "final"], value_vars=value_vars)

temp_df = df.variable.str.split("_", 1, expand=True)
df["type"] = temp_df[0]
df["name"] = temp_df[1]
df = df.drop(columns=["variable"])

grouping = df.groupby("name")
new_df = grouping.get_group("stat").rename(columns = {'name':'name_0'})
for i, (_, sub_df) in enumerate(grouping):
    new_df = pd.merge(new_df, sub_df, on=["Date", "start", "end", "area", "final", "type"], suffixes=("", f"_{i}"))

new_df.rename(columns = {f'value_{i}': f'{new_df["name_"+str(i)].iloc[0]}' for i in range(len(grouping))}), inplace = True)
new_df.drop(columns=[f"name_{i}" for i in range(len(grouping))] + ["name"], inplace=True)
new_df = new_df[["Date", "start", "end", "type", "area", "stat", "test","final"]]

new_df.sort_values(["area", "type"], ascending=False, inplace=True)
new_df.to_markdown()

Outputs:

|    | Date      | start     | end       | type   | area   |   stat |   test | final    |
|---:|:----------|:----------|:----------|:-------|:-------|-------:|-------:|:---------|
|  8 | 10/1/2022 | 11/1/2022 | 12/1/2022 | CC     | NY     |     20 |     10 | 1/1/2022 |
|  9 | 11/1/2022 | 12/1/2022 | 01/1/2023 | CC     | NY     |     30 |     20 | 1/1/2022 |
|  0 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB     | NY     |     10 |      0 | 1/1/2022 |
|  1 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB     | NY     |      5 |      1 | 1/1/2022 |
|  4 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA     | NY     |     80 |      1 | 1/1/2022 |
|  5 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA     | NY     |     90 |      0 | 1/1/2022 |
| 10 | 10/1/2022 | 11/1/2022 | 12/1/2022 | CC     | CA     |     40 |     30 | 1/1/2022 |
| 11 | 11/1/2022 | 12/1/2022 | 01/1/2023 | CC     | CA     |     50 |     40 | 1/1/2022 |
|  2 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB     | CA     |      6 |      3 | 1/1/2022 |
|  3 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB     | CA     |      7 |      2 | 1/1/2022 |
|  6 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA     | CA     |    100 |      1 | 1/1/2022 |
|  7 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA     | CA     |      0 |      8 | 1/1/2022 |

What Changed:
First, two new columns called "CC_test" and "CC_stat" are added to the DataFrame.

Next, the list of columns to unpivot is updated to include the new "CC_test" and "CC_stat" columns. The DataFrame is then melted using this updated list of columns.

Next, the DataFrame is grouped by the "name" column. From the grouping, the subgroup with "stat" as values is selected as a "template" and the resulting DataFrame is renamed to have a "name_0" column. This DataFrame is then merged with each of the other groups in the grouping, using the common columns as the merge keys.

From the resulting DataFrame each of unnecessary "name_i" columns are dropped.

Answered By: woblob
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.