Python Pandas GroupBy to calculate differences in months
Question:
A data frame below and I want to calculate the intervals of months under the names.
Lines so far:
import pandas as pd
from io import StringIO
import numpy as np
csvfile = StringIO(
"""Name Year - Month Score
Mike 2022-11 31
Mike 2022-11 136
Lilly 2022-11 23
Lilly 2022-10 44
Kate 2023-01 1393
Kate 2022-10 2360
Kate 2022-08 1648
Kate 2022-06 543
Kate 2022-04 1935
Peter 2022-04 302
David 2023-01 1808
David 2022-12 194
David 2022-09 4077
David 2022-06 666
David 2022-03 3362""")
df = pd.read_csv(csvfile, sep = 't', engine='python')
df['Year - Month'] = pd.to_datetime(df['Year - Month'], format='%Y-%m')
df['Interval'] = (df.groupby(['Name'])['Year - Month'].transform(lambda x: x.diff())/ np.timedelta64(1, 'M'))
df['Interval'] = df['Interval'].replace(np.nan, 1).astype(int)
But the output seems something wrong (not calculating right).
Where has this gone wrong, and how can I correct it?
Name Year - Month Score Interval
0 Mike 2022-11 31 1 <- shall be 0
1 Mike 2022-11 136 0
2 Lilly 2022-11 23 1
3 Lilly 2022-10 44 1 <- shall be 0
4 Kate 2023-01 1393 1 <- shall be 3
5 Kate 2022-10 2360 3 <- shall be 2
6 Kate 2022-08 1648 2
7 Kate 2022-06 543 2
8 Kate 2022-04 1935 2 <- shall be 0
9 Peter 2022-04 302 1 <- shall be 0
10 David 2023-01 1808 1 <- shall be 1
11 David 2022-12 194 1 <- shall be 3
12 David 2022-09 4077 2 <- shall be 3
13 David 2022-06 666 3
14 David 2022-03 3362 3 <- shall be 0
Answers:
You need to difference with next value instead of previous value. You can do so by setting -1 in diff()
.
...
df['Interval'] = df.groupby(['Name'])['Year - Month'].transform(lambda x: x.diff(-1)) / np.timedelta64(1, 'M')
df['Interval'] = df['Interval'].fillna(0).round().astype(int)
Result:
Name Year - Month Score Interval
0 Mike 2022-11-01 31 0
1 Mike 2022-11-01 136 0
2 Lilly 2022-11-01 23 1
3 Lilly 2022-10-01 44 0
4 Kate 2023-01-01 1393 3
5 Kate 2022-10-01 2360 2
6 Kate 2022-08-01 1648 2
7 Kate 2022-06-01 543 2
8 Kate 2022-04-01 1935 0
9 Peter 2022-04-01 302 0
10 David 2023-01-01 1808 1
11 David 2022-12-01 194 3
12 David 2022-09-01 4077 3
13 David 2022-06-01 666 3
14 David 2022-03-01 3362 0
A data frame below and I want to calculate the intervals of months under the names.
Lines so far:
import pandas as pd
from io import StringIO
import numpy as np
csvfile = StringIO(
"""Name Year - Month Score
Mike 2022-11 31
Mike 2022-11 136
Lilly 2022-11 23
Lilly 2022-10 44
Kate 2023-01 1393
Kate 2022-10 2360
Kate 2022-08 1648
Kate 2022-06 543
Kate 2022-04 1935
Peter 2022-04 302
David 2023-01 1808
David 2022-12 194
David 2022-09 4077
David 2022-06 666
David 2022-03 3362""")
df = pd.read_csv(csvfile, sep = 't', engine='python')
df['Year - Month'] = pd.to_datetime(df['Year - Month'], format='%Y-%m')
df['Interval'] = (df.groupby(['Name'])['Year - Month'].transform(lambda x: x.diff())/ np.timedelta64(1, 'M'))
df['Interval'] = df['Interval'].replace(np.nan, 1).astype(int)
But the output seems something wrong (not calculating right).
Where has this gone wrong, and how can I correct it?
Name Year - Month Score Interval
0 Mike 2022-11 31 1 <- shall be 0
1 Mike 2022-11 136 0
2 Lilly 2022-11 23 1
3 Lilly 2022-10 44 1 <- shall be 0
4 Kate 2023-01 1393 1 <- shall be 3
5 Kate 2022-10 2360 3 <- shall be 2
6 Kate 2022-08 1648 2
7 Kate 2022-06 543 2
8 Kate 2022-04 1935 2 <- shall be 0
9 Peter 2022-04 302 1 <- shall be 0
10 David 2023-01 1808 1 <- shall be 1
11 David 2022-12 194 1 <- shall be 3
12 David 2022-09 4077 2 <- shall be 3
13 David 2022-06 666 3
14 David 2022-03 3362 3 <- shall be 0
You need to difference with next value instead of previous value. You can do so by setting -1 in diff()
.
...
df['Interval'] = df.groupby(['Name'])['Year - Month'].transform(lambda x: x.diff(-1)) / np.timedelta64(1, 'M')
df['Interval'] = df['Interval'].fillna(0).round().astype(int)
Result:
Name Year - Month Score Interval
0 Mike 2022-11-01 31 0
1 Mike 2022-11-01 136 0
2 Lilly 2022-11-01 23 1
3 Lilly 2022-10-01 44 0
4 Kate 2023-01-01 1393 3
5 Kate 2022-10-01 2360 2
6 Kate 2022-08-01 1648 2
7 Kate 2022-06-01 543 2
8 Kate 2022-04-01 1935 0
9 Peter 2022-04-01 302 0
10 David 2023-01-01 1808 1
11 David 2022-12-01 194 3
12 David 2022-09-01 4077 3
13 David 2022-06-01 666 3
14 David 2022-03-01 3362 0