Version upgrades label sorted on commitdates

Question:

My dataframe is like this:

  commitDate. info_version id
0 2021-04-07     1.1.0   84
1 2021-05-31     1.1.0   84
2 2021-06-21     1.1.0   84
3 2021-06-18     1.1.0   84
4 2020-12-06     0.1.0  124
5 2020-11-14     0.1.0  124
6 2021-02-17     3.0.0  164
7 2021-03-23     3.1.0  164
8 2021-03-08     3.1.0  164
9 2021-03-03     3.0.0  164
10 2021-05-12    3.1.0  164
11 2021-05-28    3.1.0  164
12 2021-06-21    3.2.0  164
13 2019-10-14    1.2.2  184
14 2019-09-10    1.0.1  184
15 2019-01-19    1.0.0  184

I want to label these version upgrades, like for example for id 84 all instances are same, so it can be labelled no change, however for id 164, the order is a bit haphazard. So I want to sort them by their commit dates so they get correctly labelled. In case there are two updates, in minor and patch simultaneously, I want it to be labelled as minor-patch.

The expected output should be like this:

  commitDate. info_version id  Label
0 2021-04-07     1.1.0   84.   no change
1 2021-05-31     1.1.0   84.   no change
2 2021-06-21     1.1.0   84.   no change
3 2021-06-18     1.1.0   84.   no change
4 2020-12-06     0.1.0  124.   no change
5 2020-11-14     0.1.0  124.   no change
6 2021-02-17     3.0.0  164.   no change
7 2021-03-23     3.1.0  164.   no change
8 2021-03-08     3.1.0  164    minor
9 2021-03-03     3.0.0  164.   no change
10 2021-05-12    3.1.0  164.   no change
11 2021-05-28    3.1.0  164.   no change
12 2021-06-21    3.2.0  164.   minor
13 2019-10-14    1.2.2  184.   minor-patch
14 2019-09-10    1.0.1  184.   patch
15 2019-01-19    1.0.0  184.   no change

I tried this code:

import pandas as pd
from packaging import version

def version_upgrade(prev_version, current_version):
    if prev_version is None:
        return None
    elif version.parse(current_version) > version.parse(prev_version):
        if version.parse(current_version).major > version.parse(prev_version).major:
            return "major"
        elif version.parse(current_version).minor > version.parse(prev_version).minor:
            return "minor"
        else:
            return "patch"
    else:
        return None

semver_df["label"] = None

prev_version_list = semver_df["info_version"].shift(1).tolist()
semver_df["label"] = semver_df["info_version"].apply(lambda x: version_upgrade(prev_version_list.pop(0), x

However, it does not give me the desired output when I try on my dataset. Also, the info_version field in my data is of object datatype. A potential issue is also some of the info_version fields have only two numbers: like 1.0 or 0.2. I am not sure how could I tackle those exceptions.

Any help with this would be highly appreciated!

Asked By: Brie MerryWeather

||

Answers:

Here are some thoughts about this, although, as mentioned in the comment, I don’t understand some aspects of your question (e.g. what is a "minor-patch" change? Why do versions have several conflicting dates?)

First, let’s augment packaging.version.Version so that we can get the "difference" between two versions:

import packaging

class Vers(packaging.version.Version):
    def parts(self):
        return (
            self.epoch,
            self.major,
            self.minor,
            self.micro,
        )

    def __sub__(self, other):
        if self == other:
            return 'no change'
        
        for name, p0, p1 in zip(['epoch', 'major', 'minor', 'patch'], self.parts(), other.parts()):
            if p0 != p1:
                return name
        return 'nano'

(Note: a previous version was using tuple but could only handle simple numerical versions).

Examples:

>>> Vers('1.1.2') == Vers('1.1.2')
True

>>> Vers('1.1') == Vers('1.1.2')  # different .micro (or "patch")
False

>>> Vers('1.2.1') < Vers('1.13')  # change of minor
True

>>> Vers('1.2.1') < Vers('1!1.2.1')  # change of epoch
True

>>> Vers('1.2.1a01') == Vers('1.2.1a1')  # same "alpha"
True

We introduced a funny __sub__ operator. Unlike normal subtraction, it is a bit weird because it is symmetrical (a - b == b - a). What it provides is a description of version change as a string:

>>> Vers('1.0') - Vers('1.0.0')
'no change'

>>> Vers('1.0.1') - Vers('1.0')
'patch'

>>> Vers('1.2') - Vers('1.1.3')
'minor'

# any change below 'patch' will be labeled 'nano', e.g.:
>>> Vers('1.0a0') - Vers('1.0')
'nano'

# we also handle 'epoch' (change of versioning scheme):
>>> Vers('2014.01') - Vers('1!1.0')
'epoch'

With this, we can label version changes in a Series. E.g.:

txt = """2014.12.0
2014.12.1
2014.13.4
2015.01
2015.1
1!1.dev0
1!1.0.dev456
1!1.0a1
1!1.0a2.dev456
1!1.0rc1
1!1.0
1!1.0+abc.5
1!1.0.post456
1!1.0.15
1!1.1.dev1"""

z = pd.DataFrame(txt.split(), columns=['version'])
>>> z.assign(label=z['version'].apply(Vers).diff().fillna('NA'))
           version      label
0        2014.12.0         NA
1        2014.12.1      patch
2        2014.13.4      minor
3          2015.01      major
4           2015.1  no change
5         1!1.dev0      epoch
6     1!1.0.dev456       nano
7          1!1.0a1       nano
8   1!1.0a2.dev456       nano
9         1!1.0rc1       nano
10           1!1.0       nano
11     1!1.0+abc.5       nano
12   1!1.0.post456       nano
13        1!1.0.15      patch
14      1!1.1.dev1      minor

We can then use this on your DataFrame. But first, I think we should clean up the conflicting dates:

z = (
    df.groupby(['id', 'info_version'])['commitDate'].min()
    .reset_index().sort_values(['id', 'commitDate'])[df.columns]
)
>>> z
  commitDate info_version   id
0 2021-04-07        1.1.0   84
1 2020-11-14        0.1.0  124
2 2021-02-17        3.0.0  164
3 2021-03-08        3.1.0  164
4 2021-06-21        3.2.0  164
5 2019-01-19        1.0.0  184
6 2019-09-10        1.0.1  184
7 2019-10-14        1.2.2  184

With this done, we can now apply Vers() to each info_version, and take the .diff() to get our messages:

df2 = z.assign(
    label=z.assign(v=z['info_version'].apply(Vers))
    .groupby('id')['v'].transform(pd.Series.diff)
    .fillna('no change'))
>>> df2
  commitDate info_version   id      label
0 2021-04-07        1.1.0   84  no change
1 2020-11-14        0.1.0  124  no change
2 2021-02-17        3.0.0  164  no change
3 2021-03-08        3.1.0  164      minor
4 2021-06-21        3.2.0  164      minor
5 2019-01-19        1.0.0  184  no change
6 2019-09-10        1.0.1  184      patch
7 2019-10-14        1.2.2  184      minor

Note, if we don’t clean up the multiple commitDates per (ID, version), that is fine too (but I’m not sure what that means):

z = df.sort_values(['id', 'commitDate'])
# ... (same as above)
>>> df2
   commitDate info_version   id      label
0  2021-04-07        1.1.0   84  no change
1  2021-05-31        1.1.0   84  no change
3  2021-06-18        1.1.0   84  no change
2  2021-06-21        1.1.0   84  no change
5  2020-11-14        0.1.0  124  no change
4  2020-12-06        0.1.0  124  no change
6  2021-02-17        3.0.0  164  no change
9  2021-03-03        3.0.0  164  no change
8  2021-03-08        3.1.0  164      minor
7  2021-03-23        3.1.0  164  no change
10 2021-05-12        3.1.0  164  no change
11 2021-05-28        3.1.0  164  no change
12 2021-06-21        3.2.0  164      minor
15 2019-01-19        1.0.0  184  no change
14 2019-09-10        1.0.1  184      patch
13 2019-10-14        1.2.2  184      minor
Answered By: Pierre D
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.