Version upgrades label sorted on commitdates
Question:
My dataframe is like this:
commitDate. info_version id
0 2021-04-07 1.1.0 84
1 2021-05-31 1.1.0 84
2 2021-06-21 1.1.0 84
3 2021-06-18 1.1.0 84
4 2020-12-06 0.1.0 124
5 2020-11-14 0.1.0 124
6 2021-02-17 3.0.0 164
7 2021-03-23 3.1.0 164
8 2021-03-08 3.1.0 164
9 2021-03-03 3.0.0 164
10 2021-05-12 3.1.0 164
11 2021-05-28 3.1.0 164
12 2021-06-21 3.2.0 164
13 2019-10-14 1.2.2 184
14 2019-09-10 1.0.1 184
15 2019-01-19 1.0.0 184
I want to label these version upgrades, like for example for id 84 all instances are same, so it can be labelled no change, however for id 164, the order is a bit haphazard. So I want to sort them by their commit dates so they get correctly labelled. In case there are two updates, in minor and patch simultaneously, I want it to be labelled as minor-patch.
The expected output should be like this:
commitDate. info_version id Label
0 2021-04-07 1.1.0 84. no change
1 2021-05-31 1.1.0 84. no change
2 2021-06-21 1.1.0 84. no change
3 2021-06-18 1.1.0 84. no change
4 2020-12-06 0.1.0 124. no change
5 2020-11-14 0.1.0 124. no change
6 2021-02-17 3.0.0 164. no change
7 2021-03-23 3.1.0 164. no change
8 2021-03-08 3.1.0 164 minor
9 2021-03-03 3.0.0 164. no change
10 2021-05-12 3.1.0 164. no change
11 2021-05-28 3.1.0 164. no change
12 2021-06-21 3.2.0 164. minor
13 2019-10-14 1.2.2 184. minor-patch
14 2019-09-10 1.0.1 184. patch
15 2019-01-19 1.0.0 184. no change
I tried this code:
import pandas as pd
from packaging import version
def version_upgrade(prev_version, current_version):
if prev_version is None:
return None
elif version.parse(current_version) > version.parse(prev_version):
if version.parse(current_version).major > version.parse(prev_version).major:
return "major"
elif version.parse(current_version).minor > version.parse(prev_version).minor:
return "minor"
else:
return "patch"
else:
return None
semver_df["label"] = None
prev_version_list = semver_df["info_version"].shift(1).tolist()
semver_df["label"] = semver_df["info_version"].apply(lambda x: version_upgrade(prev_version_list.pop(0), x
However, it does not give me the desired output when I try on my dataset. Also, the info_version
field in my data is of object
datatype. A potential issue is also some of the info_version
fields have only two numbers: like 1.0
or 0.2
. I am not sure how could I tackle those exceptions.
Any help with this would be highly appreciated!
Answers:
Here are some thoughts about this, although, as mentioned in the comment, I don’t understand some aspects of your question (e.g. what is a "minor-patch" change? Why do versions have several conflicting dates?)
First, let’s augment packaging.version.Version
so that we can get the "difference" between two versions:
import packaging
class Vers(packaging.version.Version):
def parts(self):
return (
self.epoch,
self.major,
self.minor,
self.micro,
)
def __sub__(self, other):
if self == other:
return 'no change'
for name, p0, p1 in zip(['epoch', 'major', 'minor', 'patch'], self.parts(), other.parts()):
if p0 != p1:
return name
return 'nano'
(Note: a previous version was using tuple
but could only handle simple numerical versions).
Examples:
>>> Vers('1.1.2') == Vers('1.1.2')
True
>>> Vers('1.1') == Vers('1.1.2') # different .micro (or "patch")
False
>>> Vers('1.2.1') < Vers('1.13') # change of minor
True
>>> Vers('1.2.1') < Vers('1!1.2.1') # change of epoch
True
>>> Vers('1.2.1a01') == Vers('1.2.1a1') # same "alpha"
True
We introduced a funny __sub__
operator. Unlike normal subtraction, it is a bit weird because it is symmetrical (a - b == b - a
). What it provides is a description of version change as a string:
>>> Vers('1.0') - Vers('1.0.0')
'no change'
>>> Vers('1.0.1') - Vers('1.0')
'patch'
>>> Vers('1.2') - Vers('1.1.3')
'minor'
# any change below 'patch' will be labeled 'nano', e.g.:
>>> Vers('1.0a0') - Vers('1.0')
'nano'
# we also handle 'epoch' (change of versioning scheme):
>>> Vers('2014.01') - Vers('1!1.0')
'epoch'
With this, we can label version changes in a Series. E.g.:
txt = """2014.12.0
2014.12.1
2014.13.4
2015.01
2015.1
1!1.dev0
1!1.0.dev456
1!1.0a1
1!1.0a2.dev456
1!1.0rc1
1!1.0
1!1.0+abc.5
1!1.0.post456
1!1.0.15
1!1.1.dev1"""
z = pd.DataFrame(txt.split(), columns=['version'])
>>> z.assign(label=z['version'].apply(Vers).diff().fillna('NA'))
version label
0 2014.12.0 NA
1 2014.12.1 patch
2 2014.13.4 minor
3 2015.01 major
4 2015.1 no change
5 1!1.dev0 epoch
6 1!1.0.dev456 nano
7 1!1.0a1 nano
8 1!1.0a2.dev456 nano
9 1!1.0rc1 nano
10 1!1.0 nano
11 1!1.0+abc.5 nano
12 1!1.0.post456 nano
13 1!1.0.15 patch
14 1!1.1.dev1 minor
We can then use this on your DataFrame. But first, I think we should clean up the conflicting dates:
z = (
df.groupby(['id', 'info_version'])['commitDate'].min()
.reset_index().sort_values(['id', 'commitDate'])[df.columns]
)
>>> z
commitDate info_version id
0 2021-04-07 1.1.0 84
1 2020-11-14 0.1.0 124
2 2021-02-17 3.0.0 164
3 2021-03-08 3.1.0 164
4 2021-06-21 3.2.0 164
5 2019-01-19 1.0.0 184
6 2019-09-10 1.0.1 184
7 2019-10-14 1.2.2 184
With this done, we can now apply Vers()
to each info_version
, and take the .diff()
to get our messages:
df2 = z.assign(
label=z.assign(v=z['info_version'].apply(Vers))
.groupby('id')['v'].transform(pd.Series.diff)
.fillna('no change'))
>>> df2
commitDate info_version id label
0 2021-04-07 1.1.0 84 no change
1 2020-11-14 0.1.0 124 no change
2 2021-02-17 3.0.0 164 no change
3 2021-03-08 3.1.0 164 minor
4 2021-06-21 3.2.0 164 minor
5 2019-01-19 1.0.0 184 no change
6 2019-09-10 1.0.1 184 patch
7 2019-10-14 1.2.2 184 minor
Note, if we don’t clean up the multiple commitDates per (ID, version)
, that is fine too (but I’m not sure what that means):
z = df.sort_values(['id', 'commitDate'])
# ... (same as above)
>>> df2
commitDate info_version id label
0 2021-04-07 1.1.0 84 no change
1 2021-05-31 1.1.0 84 no change
3 2021-06-18 1.1.0 84 no change
2 2021-06-21 1.1.0 84 no change
5 2020-11-14 0.1.0 124 no change
4 2020-12-06 0.1.0 124 no change
6 2021-02-17 3.0.0 164 no change
9 2021-03-03 3.0.0 164 no change
8 2021-03-08 3.1.0 164 minor
7 2021-03-23 3.1.0 164 no change
10 2021-05-12 3.1.0 164 no change
11 2021-05-28 3.1.0 164 no change
12 2021-06-21 3.2.0 164 minor
15 2019-01-19 1.0.0 184 no change
14 2019-09-10 1.0.1 184 patch
13 2019-10-14 1.2.2 184 minor
My dataframe is like this:
commitDate. info_version id
0 2021-04-07 1.1.0 84
1 2021-05-31 1.1.0 84
2 2021-06-21 1.1.0 84
3 2021-06-18 1.1.0 84
4 2020-12-06 0.1.0 124
5 2020-11-14 0.1.0 124
6 2021-02-17 3.0.0 164
7 2021-03-23 3.1.0 164
8 2021-03-08 3.1.0 164
9 2021-03-03 3.0.0 164
10 2021-05-12 3.1.0 164
11 2021-05-28 3.1.0 164
12 2021-06-21 3.2.0 164
13 2019-10-14 1.2.2 184
14 2019-09-10 1.0.1 184
15 2019-01-19 1.0.0 184
I want to label these version upgrades, like for example for id 84 all instances are same, so it can be labelled no change, however for id 164, the order is a bit haphazard. So I want to sort them by their commit dates so they get correctly labelled. In case there are two updates, in minor and patch simultaneously, I want it to be labelled as minor-patch.
The expected output should be like this:
commitDate. info_version id Label
0 2021-04-07 1.1.0 84. no change
1 2021-05-31 1.1.0 84. no change
2 2021-06-21 1.1.0 84. no change
3 2021-06-18 1.1.0 84. no change
4 2020-12-06 0.1.0 124. no change
5 2020-11-14 0.1.0 124. no change
6 2021-02-17 3.0.0 164. no change
7 2021-03-23 3.1.0 164. no change
8 2021-03-08 3.1.0 164 minor
9 2021-03-03 3.0.0 164. no change
10 2021-05-12 3.1.0 164. no change
11 2021-05-28 3.1.0 164. no change
12 2021-06-21 3.2.0 164. minor
13 2019-10-14 1.2.2 184. minor-patch
14 2019-09-10 1.0.1 184. patch
15 2019-01-19 1.0.0 184. no change
I tried this code:
import pandas as pd
from packaging import version
def version_upgrade(prev_version, current_version):
if prev_version is None:
return None
elif version.parse(current_version) > version.parse(prev_version):
if version.parse(current_version).major > version.parse(prev_version).major:
return "major"
elif version.parse(current_version).minor > version.parse(prev_version).minor:
return "minor"
else:
return "patch"
else:
return None
semver_df["label"] = None
prev_version_list = semver_df["info_version"].shift(1).tolist()
semver_df["label"] = semver_df["info_version"].apply(lambda x: version_upgrade(prev_version_list.pop(0), x
However, it does not give me the desired output when I try on my dataset. Also, the info_version
field in my data is of object
datatype. A potential issue is also some of the info_version
fields have only two numbers: like 1.0
or 0.2
. I am not sure how could I tackle those exceptions.
Any help with this would be highly appreciated!
Here are some thoughts about this, although, as mentioned in the comment, I don’t understand some aspects of your question (e.g. what is a "minor-patch" change? Why do versions have several conflicting dates?)
First, let’s augment packaging.version.Version
so that we can get the "difference" between two versions:
import packaging
class Vers(packaging.version.Version):
def parts(self):
return (
self.epoch,
self.major,
self.minor,
self.micro,
)
def __sub__(self, other):
if self == other:
return 'no change'
for name, p0, p1 in zip(['epoch', 'major', 'minor', 'patch'], self.parts(), other.parts()):
if p0 != p1:
return name
return 'nano'
(Note: a previous version was using tuple
but could only handle simple numerical versions).
Examples:
>>> Vers('1.1.2') == Vers('1.1.2')
True
>>> Vers('1.1') == Vers('1.1.2') # different .micro (or "patch")
False
>>> Vers('1.2.1') < Vers('1.13') # change of minor
True
>>> Vers('1.2.1') < Vers('1!1.2.1') # change of epoch
True
>>> Vers('1.2.1a01') == Vers('1.2.1a1') # same "alpha"
True
We introduced a funny __sub__
operator. Unlike normal subtraction, it is a bit weird because it is symmetrical (a - b == b - a
). What it provides is a description of version change as a string:
>>> Vers('1.0') - Vers('1.0.0')
'no change'
>>> Vers('1.0.1') - Vers('1.0')
'patch'
>>> Vers('1.2') - Vers('1.1.3')
'minor'
# any change below 'patch' will be labeled 'nano', e.g.:
>>> Vers('1.0a0') - Vers('1.0')
'nano'
# we also handle 'epoch' (change of versioning scheme):
>>> Vers('2014.01') - Vers('1!1.0')
'epoch'
With this, we can label version changes in a Series. E.g.:
txt = """2014.12.0
2014.12.1
2014.13.4
2015.01
2015.1
1!1.dev0
1!1.0.dev456
1!1.0a1
1!1.0a2.dev456
1!1.0rc1
1!1.0
1!1.0+abc.5
1!1.0.post456
1!1.0.15
1!1.1.dev1"""
z = pd.DataFrame(txt.split(), columns=['version'])
>>> z.assign(label=z['version'].apply(Vers).diff().fillna('NA'))
version label
0 2014.12.0 NA
1 2014.12.1 patch
2 2014.13.4 minor
3 2015.01 major
4 2015.1 no change
5 1!1.dev0 epoch
6 1!1.0.dev456 nano
7 1!1.0a1 nano
8 1!1.0a2.dev456 nano
9 1!1.0rc1 nano
10 1!1.0 nano
11 1!1.0+abc.5 nano
12 1!1.0.post456 nano
13 1!1.0.15 patch
14 1!1.1.dev1 minor
We can then use this on your DataFrame. But first, I think we should clean up the conflicting dates:
z = (
df.groupby(['id', 'info_version'])['commitDate'].min()
.reset_index().sort_values(['id', 'commitDate'])[df.columns]
)
>>> z
commitDate info_version id
0 2021-04-07 1.1.0 84
1 2020-11-14 0.1.0 124
2 2021-02-17 3.0.0 164
3 2021-03-08 3.1.0 164
4 2021-06-21 3.2.0 164
5 2019-01-19 1.0.0 184
6 2019-09-10 1.0.1 184
7 2019-10-14 1.2.2 184
With this done, we can now apply Vers()
to each info_version
, and take the .diff()
to get our messages:
df2 = z.assign(
label=z.assign(v=z['info_version'].apply(Vers))
.groupby('id')['v'].transform(pd.Series.diff)
.fillna('no change'))
>>> df2
commitDate info_version id label
0 2021-04-07 1.1.0 84 no change
1 2020-11-14 0.1.0 124 no change
2 2021-02-17 3.0.0 164 no change
3 2021-03-08 3.1.0 164 minor
4 2021-06-21 3.2.0 164 minor
5 2019-01-19 1.0.0 184 no change
6 2019-09-10 1.0.1 184 patch
7 2019-10-14 1.2.2 184 minor
Note, if we don’t clean up the multiple commitDates per (ID, version)
, that is fine too (but I’m not sure what that means):
z = df.sort_values(['id', 'commitDate'])
# ... (same as above)
>>> df2
commitDate info_version id label
0 2021-04-07 1.1.0 84 no change
1 2021-05-31 1.1.0 84 no change
3 2021-06-18 1.1.0 84 no change
2 2021-06-21 1.1.0 84 no change
5 2020-11-14 0.1.0 124 no change
4 2020-12-06 0.1.0 124 no change
6 2021-02-17 3.0.0 164 no change
9 2021-03-03 3.0.0 164 no change
8 2021-03-08 3.1.0 164 minor
7 2021-03-23 3.1.0 164 no change
10 2021-05-12 3.1.0 164 no change
11 2021-05-28 3.1.0 164 no change
12 2021-06-21 3.2.0 164 minor
15 2019-01-19 1.0.0 184 no change
14 2019-09-10 1.0.1 184 patch
13 2019-10-14 1.2.2 184 minor