How can I convert cumulative values to incremental values to a pandas dataframe with null values?

Question:

I have a pandas dataframe which has cumulative values like this:

import pandas as pd
import numpy as np

data = {'Date':['2017-01-27','2017-01-28','2017-01-29','2017-01-30','2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-04','2017-02-05'],
'Kentucky':[1,4,5,7,np.NaN,np.NaN,10,np.NaN,12,13],
'New York':[np.NaN,2,np.NaN,7,np.NaN,np.NaN,np.NaN,np.NaN,9,np.NaN],
'California':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,1,np.NaN,2,np.NaN],
'Maine':[np.NaN,2,3,7,np.NaN,np.NaN,10,np.NaN,12,13]}

df = pd.DataFrame(data)

enter image description here

How can I convert the cumulative values to incremental values if the dataframe has null values? I have tried incremental values conversion solution but it does not produce the desired result.

Asked By: Mazil_tov998

||

Answers:

First convert to a data frame:

data = pd.DataFrame(data)

Next use fillna’s forward fill to account for the NaN issue. Below shows the code for Maine, but could be applied to the other states:

data['Maine_Inc_Values'] = data['Maine'].fillna(method='ffill').diff()

Then use .loc to correct for the first incremental value (which returns a NaN and not 2):

data.loc[data['Maine_Inc_Values'].isnull(), 'Maine_Inc_Values'] = data['Maine']

Results in:

enter image description here

Answered By: Brian

You can use the .interpolate() method to fill up NaN based on surrounding values, then use .diff() to get increments. This assumes the quantities change smoothly over time.

Taking back your example with date-based indexing (last line added):

import pandas as pd
import numpy as np

data = {'Date':['2017-01-27','2017-01-28','2017-01-29','2017-01-30','2017-01-31','2017-02-01','2017-02-02','2017-02-03','2017-02-04','2017-02-05'],
'Kentucky':[1,4,5,7,np.NaN,np.NaN,10,np.NaN,12,13],
'New York':[np.NaN,2,np.NaN,7,np.NaN,np.NaN,np.NaN,np.NaN,9,np.NaN],
'California':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,1,np.NaN,2,np.NaN],
'Maine':[np.NaN,2,3,7,np.NaN,np.NaN,10,np.NaN,12,13]}

# Turn to dataframe
df = pd.DataFrame(data)

# Set up dates as index to operate on similar-type columns
df = df.set_index("Date")

Proposed solution

df.interpolate().diff().fillna(0)

Output

            Kentucky  New York  California  Maine
Date                                             
2017-01-27       0.0       0.0         0.0    0.0
2017-01-28       3.0       0.0         0.0    0.0
2017-01-29       1.0       2.5         0.0    1.0
2017-01-30       2.0       2.5         0.0    4.0
2017-01-31       1.0       0.4         0.0    1.0
2017-02-01       1.0       0.4         0.0    1.0
2017-02-02       1.0       0.4         0.0    1.0
2017-02-03       1.0       0.4         0.5    1.0
2017-02-04       1.0       0.4         0.5    1.0
2017-02-05       1.0       0.0         0.0    1.0
Answered By: Leonard