Last valid value of certain column

Question:

I have a large dataframe where I calculate means with a condition. I need to change NaN to the last valid value for that city.

I’ve tried df['Mean3big'].fillna(method='ffill', inplace=True), but then I get the wrong values since it doesn’t consider the city.

df  = pd.DataFrame([["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30]])

df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(df.groupby(['City', "Year"])['Val3'].transform(lambda x: x.expanding().mean().shift()).where(df['Val1'] > 1.6), 2)

My result:

      City  Year  Val1  Val2  Val3  Mean3big
0   Gothenburg  2018   1.5   2.3   107       NaN
1   Gothenburg  2018   1.3   3.3    10       NaN
2   Gothenburg  2018   2.2   2.3    20     10.00
3   Gothenburg  2018   1.5   2.1    30       NaN
4   Gothenburg  2018   2.5   2.3    20     20.00
5        Malmo  2018   1.6   2.3    10       NaN
6   Gothenburg  2018   1.9   2.8    10     20.00
7        Malmo  2018   0.7   4.3    30       NaN
8   Gothenburg  2018   1.7   3.2    40     18.00
9        Malmo  2018   1.0   3.3    40       NaN
10  Gothenburg  2018   3.7   2.3    10     21.67
11       Malmo  2018   1.0   2.9   112       NaN
12  Gothenburg  2018   2.7   2.3    20     20.00
13  Gothenburg  2019   1.3   3.3    10       NaN
14  Gothenburg  2019   1.2   2.3    20       NaN
15  Gothenburg  2019   1.6   2.1    10       NaN
16  Gothenburg  2019   1.8   2.3    10     13.33
17       Malmo  2019   1.6   1.3    20       NaN
18  Gothenburg  2019   1.9   2.8    30     12.50

I want Mean3big row 3 to give last valid value for city "Gothenburg" = 10. Row 0 and 1 is ok with NaN since I don’t have a prior valid value.

Row 7 should be 20, which is last valid value for "Malmo". Row 5 is ok with Nan because there are no prior valid values, and so on…

Asked By: TobiasS

||

Answers:

Not taking into account your last sentence in your post. Maybe give this a try:

import pandas as pd

df = pd.DataFrame(
    [
        ["Gothenburg", "2018", 1.5, 2.3, 107],
        ["Gothenburg", 2018, 1.3, 3.3, 10],
        ["Gothenburg", 2018, 2.2, 2.3, 20],
        ["Gothenburg", 2018, 1.5, 2.1, 30],
        ["Gothenburg", 2018, 2.5, 2.3, 20],
        ["Malmo", 2018, 1.6, 2.3, 10],
        ["Gothenburg", 2018, 1.9, 2.8, 10],
        ["Malmo", 2018, 0.7, 4.3, 30],
        ["Gothenburg", 2018, 1.7, 3.2, 40],
        ["Malmo", 2018, 1.0, 3.3, 40],
        ["Gothenburg", 2018, 3.7, 2.3, 10],
        ["Malmo", 2018, 1.0, 2.9, 112],
        ["Gothenburg", 2018, 2.7, 2.3, 20],
        ["Gothenburg", 2019, 1.3, 3.3, 10],
        ["Gothenburg", 2019, 1.2, 2.3, 20],
        ["Gothenburg", 2019, 1.6, 2.1, 10],
        ["Gothenburg", 2019, 1.8, 2.3, 10],
        ["Malmo", 2019, 1.6, 1.3, 20],
        ["Gothenburg", 2019, 1.9, 2.8, 30],
    ]
)

df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(
    df.groupby(['City', "Year"])['Val3']
    .transform(lambda x: x.expanding().mean().shift())
    .where(df['Val1'] > 1.6),
    2,
)
print(df)

valids = {}
for index, row in df.iterrows():
    # this if checks if the value is NaN, you can import math and use isnan() instead
    if row['Mean3big'] != row['Mean3big']:
        if row['City'] in valids:
            df.at[index, 'Mean3big'] = valids[row['City']]
    else:
        valids[row['City']] = row['Mean3big']

print(df)

The time complexity is O(n).

Answered By: Perplexabot
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.