Keep observations with two or more consecutive years of data by group
Question:
I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven’t been able to get it quite right. The few different things I have tried have also required multiple steps and weren’t particularly clean.
Here is what I have:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
356
2222
2005
356
2222
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2013
600
3355
2015
600
3355
2016
753
4444
2005
753
4444
2008
753
4444
2009
Here is what I want:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2015
600
3355
2016
753
4444
2008
753
4444
2009
I started by creating a variable equal to one:
df['tosum'] = 1
And then count the number of observations where the difference in calyear by group is equal to 1.
df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
)
And then I keep all observations with ‘num_years’ greater than 1.
However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way…it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!
Answers:
Yes you need to groupby ‘director_id’, ‘match_id’ and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.
df = df[
df.groupby(["director_id", "match_id"])["calyear"].transform(
lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
)
]
print(df):
director_id match_id calyear
0 282 1111 2006
1 282 1111 2007
4 600 3333 2010
5 600 3333 2011
6 600 3333 2012
8 600 3355 2015
9 600 3355 2016
11 753 4444 2008
12 753 4444 2009
I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven’t been able to get it quite right. The few different things I have tried have also required multiple steps and weren’t particularly clean.
Here is what I have:
director_id | match_id | calyear |
---|---|---|
282 | 1111 | 2006 |
282 | 1111 | 2007 |
356 | 2222 | 2005 |
356 | 2222 | 2007 |
600 | 3333 | 2010 |
600 | 3333 | 2011 |
600 | 3333 | 2012 |
600 | 3355 | 2013 |
600 | 3355 | 2015 |
600 | 3355 | 2016 |
753 | 4444 | 2005 |
753 | 4444 | 2008 |
753 | 4444 | 2009 |
Here is what I want:
director_id | match_id | calyear |
---|---|---|
282 | 1111 | 2006 |
282 | 1111 | 2007 |
600 | 3333 | 2010 |
600 | 3333 | 2011 |
600 | 3333 | 2012 |
600 | 3355 | 2015 |
600 | 3355 | 2016 |
753 | 4444 | 2008 |
753 | 4444 | 2009 |
I started by creating a variable equal to one:
df['tosum'] = 1
And then count the number of observations where the difference in calyear by group is equal to 1.
df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
)
And then I keep all observations with ‘num_years’ greater than 1.
However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way…it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!
Yes you need to groupby ‘director_id’, ‘match_id’ and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.
df = df[
df.groupby(["director_id", "match_id"])["calyear"].transform(
lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
)
]
print(df):
director_id match_id calyear
0 282 1111 2006
1 282 1111 2007
4 600 3333 2010
5 600 3333 2011
6 600 3333 2012
8 600 3355 2015
9 600 3355 2016
11 753 4444 2008
12 753 4444 2009