Last valid value of certain column
Question:
I have a large dataframe where I calculate means with a condition. I need to change NaN to the last valid value for that city.
I’ve tried df['Mean3big'].fillna(method='ffill', inplace=True)
, but then I get the wrong values since it doesn’t consider the city.
df = pd.DataFrame([["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30]])
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(df.groupby(['City', "Year"])['Val3'].transform(lambda x: x.expanding().mean().shift()).where(df['Val1'] > 1.6), 2)
My result:
City Year Val1 Val2 Val3 Mean3big
0 Gothenburg 2018 1.5 2.3 107 NaN
1 Gothenburg 2018 1.3 3.3 10 NaN
2 Gothenburg 2018 2.2 2.3 20 10.00
3 Gothenburg 2018 1.5 2.1 30 NaN
4 Gothenburg 2018 2.5 2.3 20 20.00
5 Malmo 2018 1.6 2.3 10 NaN
6 Gothenburg 2018 1.9 2.8 10 20.00
7 Malmo 2018 0.7 4.3 30 NaN
8 Gothenburg 2018 1.7 3.2 40 18.00
9 Malmo 2018 1.0 3.3 40 NaN
10 Gothenburg 2018 3.7 2.3 10 21.67
11 Malmo 2018 1.0 2.9 112 NaN
12 Gothenburg 2018 2.7 2.3 20 20.00
13 Gothenburg 2019 1.3 3.3 10 NaN
14 Gothenburg 2019 1.2 2.3 20 NaN
15 Gothenburg 2019 1.6 2.1 10 NaN
16 Gothenburg 2019 1.8 2.3 10 13.33
17 Malmo 2019 1.6 1.3 20 NaN
18 Gothenburg 2019 1.9 2.8 30 12.50
I want Mean3big row 3 to give last valid value for city "Gothenburg" = 10. Row 0 and 1 is ok with NaN since I don’t have a prior valid value.
Row 7 should be 20, which is last valid value for "Malmo". Row 5 is ok with Nan because there are no prior valid values, and so on…
Answers:
Not taking into account your last sentence in your post. Maybe give this a try:
import pandas as pd
df = pd.DataFrame(
[
["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30],
]
)
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(
df.groupby(['City', "Year"])['Val3']
.transform(lambda x: x.expanding().mean().shift())
.where(df['Val1'] > 1.6),
2,
)
print(df)
valids = {}
for index, row in df.iterrows():
# this if checks if the value is NaN, you can import math and use isnan() instead
if row['Mean3big'] != row['Mean3big']:
if row['City'] in valids:
df.at[index, 'Mean3big'] = valids[row['City']]
else:
valids[row['City']] = row['Mean3big']
print(df)
The time complexity is O(n).
I have a large dataframe where I calculate means with a condition. I need to change NaN to the last valid value for that city.
I’ve tried df['Mean3big'].fillna(method='ffill', inplace=True)
, but then I get the wrong values since it doesn’t consider the city.
df = pd.DataFrame([["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30]])
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(df.groupby(['City', "Year"])['Val3'].transform(lambda x: x.expanding().mean().shift()).where(df['Val1'] > 1.6), 2)
My result:
City Year Val1 Val2 Val3 Mean3big
0 Gothenburg 2018 1.5 2.3 107 NaN
1 Gothenburg 2018 1.3 3.3 10 NaN
2 Gothenburg 2018 2.2 2.3 20 10.00
3 Gothenburg 2018 1.5 2.1 30 NaN
4 Gothenburg 2018 2.5 2.3 20 20.00
5 Malmo 2018 1.6 2.3 10 NaN
6 Gothenburg 2018 1.9 2.8 10 20.00
7 Malmo 2018 0.7 4.3 30 NaN
8 Gothenburg 2018 1.7 3.2 40 18.00
9 Malmo 2018 1.0 3.3 40 NaN
10 Gothenburg 2018 3.7 2.3 10 21.67
11 Malmo 2018 1.0 2.9 112 NaN
12 Gothenburg 2018 2.7 2.3 20 20.00
13 Gothenburg 2019 1.3 3.3 10 NaN
14 Gothenburg 2019 1.2 2.3 20 NaN
15 Gothenburg 2019 1.6 2.1 10 NaN
16 Gothenburg 2019 1.8 2.3 10 13.33
17 Malmo 2019 1.6 1.3 20 NaN
18 Gothenburg 2019 1.9 2.8 30 12.50
I want Mean3big row 3 to give last valid value for city "Gothenburg" = 10. Row 0 and 1 is ok with NaN since I don’t have a prior valid value.
Row 7 should be 20, which is last valid value for "Malmo". Row 5 is ok with Nan because there are no prior valid values, and so on…
Not taking into account your last sentence in your post. Maybe give this a try:
import pandas as pd
df = pd.DataFrame(
[
["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30],
]
)
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(
df.groupby(['City', "Year"])['Val3']
.transform(lambda x: x.expanding().mean().shift())
.where(df['Val1'] > 1.6),
2,
)
print(df)
valids = {}
for index, row in df.iterrows():
# this if checks if the value is NaN, you can import math and use isnan() instead
if row['Mean3big'] != row['Mean3big']:
if row['City'] in valids:
df.at[index, 'Mean3big'] = valids[row['City']]
else:
valids[row['City']] = row['Mean3big']
print(df)
The time complexity is O(n).