Set value based on previous value in previous group if it exists
Question:
Say I have this:
df = pandas.DataFrame(
[ dict(a=75, b=numpy.nan, d='2023-01-01 00:00')
, dict(a=82, b=numpy.nan, d='2023-01-01 10:00')
, dict(a=39, b=numpy.nan, d='2023-01-01 20:00')
, dict(a=10, b=82 , d='2023-01-05 00:00')
, dict(a=90, b=82 , d='2023-01-05 20:00')
, dict(a=61, b=numpy.nan, d='2023-02-08 00:00')
, dict(a=35, b=numpy.nan, d='2023-02-08 10:00')
, dict(a=95, b=numpy.nan, d='2023-02-08 20:00')
, dict(a=21, b=35 , d='2023-04-15 00:00')
, dict(a=60, b=35 , d='2023-04-15 10:00')
])
df['d'] = pandas.to_datetime(df['d'])
df = df.set_index('d')
print(df)
which outputs:
a b
d
2023-01-01 00:00:00 75 NaN
2023-01-01 10:00:00 82 NaN
2023-01-01 20:00:00 39 NaN
2023-01-05 00:00:00 10 82.0
2023-01-05 20:00:00 90 82.0
2023-02-08 00:00:00 61 NaN
2023-02-08 10:00:00 35 NaN
2023-02-08 20:00:00 95 NaN
2023-04-15 00:00:00 21 35.0
2023-04-15 10:00:00 60 35.0
In real life, I only have column a
and my desired output is in column b
.
Here, b
equals the value in a
from the previous available date at 10:00. Dates are not necessarily consecutive. Value at 10:00 may not exist for the previous available date, in which case b
should be NaN.
Logically, I’d solve this by grouping by date and extracting the value from the previous group.
Without resorting to iterating each (previous group, group)
tuples or something of sorts, can that be done with pandas?
More generally, are there any pandas idioms to deal with these "look up value from the previous group" situations?
I’ll be adding edits here as answers come to show additional info that doesn’t fit nicely in a comment.
For https://stackoverflow.com/a/75599866/3821009
df['c'] = df.groupby(df.index.date)['a'].shift()
print(df)
produces:
a b c
d
2023-01-01 00:00:00 75 NaN NaN
2023-01-01 10:00:00 82 NaN 75.0
2023-01-01 20:00:00 39 NaN 82.0
2023-01-05 00:00:00 10 82.0 NaN
2023-01-05 20:00:00 90 82.0 10.0
2023-02-08 00:00:00 61 NaN NaN
2023-02-08 10:00:00 35 NaN 61.0
2023-02-08 20:00:00 95 NaN 35.0
2023-04-15 00:00:00 21 35.0 NaN
2023-04-15 10:00:00 60 35.0 21.0
so that’s not what I’m looking for.
Answers:
Yes, I believe you can use the groupby() method along with the shift() method to accomplish this.
You could do something like,
df['b'] = df.groupby(df.index.date)['a'].shift()
This code is taking a table of data, and is breaking it down into groups based on the dates in the table. For each group, it then looks at the values in the ‘a’ column and moves them down by one row.
By doing this, the ‘b’ column now shows the value of ‘a’ from the previous group for each row within that group.
The general idea is:
- Get the value where time is
10.00
- Get the date group id
- If time is ordered, the current group id is just 1 greater than the previous
- Map the previous time value to the current with the group id
time = df.loc[df.index.time == pd.to_datetime('10:00:00').time(), 'a']
gid = df.groupby(df.index.date).ngroup()
df['c'] = gid.map(dict(zip(time.index.map(gid)+1, time)))
$ print(time)
d
2023-01-01 10:00:00 82
2023-02-08 10:00:00 35
2023-04-15 10:00:00 60
Name: a, dtype: int64
$ print(gid)
d
2023-01-01 00:00:00 0
2023-01-01 10:00:00 0
2023-01-01 20:00:00 0
2023-01-05 00:00:00 1
2023-01-05 20:00:00 1
2023-02-08 00:00:00 2
2023-02-08 10:00:00 2
2023-02-08 20:00:00 2
2023-04-15 00:00:00 3
2023-04-15 10:00:00 3
dtype: int64
$ print(df)
a b c
d
2023-01-01 00:00:00 75 NaN NaN
2023-01-01 10:00:00 82 NaN NaN
2023-01-01 20:00:00 39 NaN NaN
2023-01-05 00:00:00 10 82.0 82.0
2023-01-05 20:00:00 90 82.0 82.0
2023-02-08 00:00:00 61 NaN NaN
2023-02-08 10:00:00 35 NaN NaN
2023-02-08 20:00:00 95 NaN NaN
2023-04-15 00:00:00 21 35.0 35.0
2023-04-15 10:00:00 60 35.0 35.0
Say I have this:
df = pandas.DataFrame(
[ dict(a=75, b=numpy.nan, d='2023-01-01 00:00')
, dict(a=82, b=numpy.nan, d='2023-01-01 10:00')
, dict(a=39, b=numpy.nan, d='2023-01-01 20:00')
, dict(a=10, b=82 , d='2023-01-05 00:00')
, dict(a=90, b=82 , d='2023-01-05 20:00')
, dict(a=61, b=numpy.nan, d='2023-02-08 00:00')
, dict(a=35, b=numpy.nan, d='2023-02-08 10:00')
, dict(a=95, b=numpy.nan, d='2023-02-08 20:00')
, dict(a=21, b=35 , d='2023-04-15 00:00')
, dict(a=60, b=35 , d='2023-04-15 10:00')
])
df['d'] = pandas.to_datetime(df['d'])
df = df.set_index('d')
print(df)
which outputs:
a b
d
2023-01-01 00:00:00 75 NaN
2023-01-01 10:00:00 82 NaN
2023-01-01 20:00:00 39 NaN
2023-01-05 00:00:00 10 82.0
2023-01-05 20:00:00 90 82.0
2023-02-08 00:00:00 61 NaN
2023-02-08 10:00:00 35 NaN
2023-02-08 20:00:00 95 NaN
2023-04-15 00:00:00 21 35.0
2023-04-15 10:00:00 60 35.0
In real life, I only have column a
and my desired output is in column b
.
Here, b
equals the value in a
from the previous available date at 10:00. Dates are not necessarily consecutive. Value at 10:00 may not exist for the previous available date, in which case b
should be NaN.
Logically, I’d solve this by grouping by date and extracting the value from the previous group.
Without resorting to iterating each (previous group, group)
tuples or something of sorts, can that be done with pandas?
More generally, are there any pandas idioms to deal with these "look up value from the previous group" situations?
I’ll be adding edits here as answers come to show additional info that doesn’t fit nicely in a comment.
For https://stackoverflow.com/a/75599866/3821009
df['c'] = df.groupby(df.index.date)['a'].shift()
print(df)
produces:
a b c
d
2023-01-01 00:00:00 75 NaN NaN
2023-01-01 10:00:00 82 NaN 75.0
2023-01-01 20:00:00 39 NaN 82.0
2023-01-05 00:00:00 10 82.0 NaN
2023-01-05 20:00:00 90 82.0 10.0
2023-02-08 00:00:00 61 NaN NaN
2023-02-08 10:00:00 35 NaN 61.0
2023-02-08 20:00:00 95 NaN 35.0
2023-04-15 00:00:00 21 35.0 NaN
2023-04-15 10:00:00 60 35.0 21.0
so that’s not what I’m looking for.
Yes, I believe you can use the groupby() method along with the shift() method to accomplish this.
You could do something like,
df['b'] = df.groupby(df.index.date)['a'].shift()
This code is taking a table of data, and is breaking it down into groups based on the dates in the table. For each group, it then looks at the values in the ‘a’ column and moves them down by one row.
By doing this, the ‘b’ column now shows the value of ‘a’ from the previous group for each row within that group.
The general idea is:
- Get the value where time is
10.00
- Get the date group id
- If time is ordered, the current group id is just 1 greater than the previous
- Map the previous time value to the current with the group id
time = df.loc[df.index.time == pd.to_datetime('10:00:00').time(), 'a']
gid = df.groupby(df.index.date).ngroup()
df['c'] = gid.map(dict(zip(time.index.map(gid)+1, time)))
$ print(time)
d
2023-01-01 10:00:00 82
2023-02-08 10:00:00 35
2023-04-15 10:00:00 60
Name: a, dtype: int64
$ print(gid)
d
2023-01-01 00:00:00 0
2023-01-01 10:00:00 0
2023-01-01 20:00:00 0
2023-01-05 00:00:00 1
2023-01-05 20:00:00 1
2023-02-08 00:00:00 2
2023-02-08 10:00:00 2
2023-02-08 20:00:00 2
2023-04-15 00:00:00 3
2023-04-15 10:00:00 3
dtype: int64
$ print(df)
a b c
d
2023-01-01 00:00:00 75 NaN NaN
2023-01-01 10:00:00 82 NaN NaN
2023-01-01 20:00:00 39 NaN NaN
2023-01-05 00:00:00 10 82.0 82.0
2023-01-05 20:00:00 90 82.0 82.0
2023-02-08 00:00:00 61 NaN NaN
2023-02-08 10:00:00 35 NaN NaN
2023-02-08 20:00:00 95 NaN NaN
2023-04-15 00:00:00 21 35.0 35.0
2023-04-15 10:00:00 60 35.0 35.0