Shift dataframe values for all columns to make monotonically increasing

Question:

I have a dataframe of measures in multiple columns, that are aggregated. This means that the function they represent is a monotonically increasing one. Now, due to reset of an apparatus, all measurements are reset to zero, after which the aggregation resumes. But to work with the data, I need to disgard the reset and shift all values in all columns to mimic that th reset never occured.

Hence, I what this sistuation:

enter image description here

to become

enter image description here

What I want is a function that will shift all values in all columns to to the last measured maximum.

for some sample data, I have created this:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


date_range = pd.date_range(start='2021-01-01', end='2021-01-05', freq='1D')
df1 = pd.DataFrame({'Date': date_range, 'Column 1': range(5), 'Column 2': range(5)})

date_range = pd.date_range(start='2021-01-06', end='2021-01-10', freq='1D')
df2 = pd.DataFrame({'Date': date_range, 'Column 1': range(5), 'Column 2': range(5)})


df = pd.concat([df1,df2])

which I want to become

date_range = pd.date_range(start='2021-01-01', end='2021-01-10', freq='1D')
df3 = pd.DataFrame({'Date': date_range, 'Column 1': range(10), 'Column 2': range(10)})

I know how to do this in the case where I know that df is constructed from df1 and df2

def shift_df(df, df1, df2):

    columns = list(df.columns)
    columns.remove('Date')

    max_values = {}
    for col in columns:
        max_values[col] = df1[col].max()

    min_values = {}
    for col in columns:
        min_values[col] = df2[col].min()

    differences = {}
    for col in columns:
        differences[col] = max_values[col] - min_values[col]+1

    for col in columns:
        df[col] = np.where(df['Date'].isin(df2['Date']),
                            df[col] + differences[col],
                            df[col])

    return df

But I do not know how to generalize it if I only have the knowledge of df. Basically, how do I transform my function

shift_df(df, df1, df2)

to

shift_df(df)

Answers:

Here is one option assuming you want to handle all columns independently:

  • compute and identify the negative jumps with a diff
  • keep only those, inverse the values, add 1
  • compute a cumsum
  • add to original Series
def shifter(s):
    m = s.diff()
    return s.add((-m.sub(1)).where(m.lt(0), 0).cumsum())

out = df.set_index('Date').apply(shifter)

Output:

            Column 1  Column 2
Date                          
2021-01-01       0.0       0.0
2021-01-02       1.0       1.0
2021-01-03       2.0       2.0
2021-01-04       3.0       3.0
2021-01-05       4.0       4.0
2021-01-06       5.0       5.0
2021-01-07       6.0       6.0
2021-01-08       7.0       7.0
2021-01-09       8.0       8.0
2021-01-10       9.0       9.0

Another example:

df = pd.DataFrame({'Date': [Timestamp('2021-01-01 00:00:00'), Timestamp('2021-01-02 00:00:00'), Timestamp('2021-01-03 00:00:00'), Timestamp('2021-01-04 00:00:00'), Timestamp('2021-01-05 00:00:00'), Timestamp('2021-01-06 00:00:00'), Timestamp('2021-01-07 00:00:00'), Timestamp('2021-01-08 00:00:00'), Timestamp('2021-01-09 00:00:00'), Timestamp('2021-01-10 00:00:00')],
                   'Column 1': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
                   'Column 2': [3, 4, 5, 6, 7, 3, 4, 5, 6, 7]}
)

out = df.set_index('Date').apply(shifter)

Output:

            Column 1  Column 2
Date                          
2021-01-01       0.0       3.0
2021-01-02       1.0       4.0
2021-01-03       2.0       5.0
2021-01-04       3.0       6.0
2021-01-05       4.0       7.0
2021-01-06       5.0       8.0
2021-01-07       6.0       9.0
2021-01-08       7.0      10.0
2021-01-09       8.0      11.0
2021-01-10       9.0      12.0
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.