pairing two set of data in Python, any idea?

Question:

I am trying to pair two sets of data based on their difference and minimize the standard deviation

set1= (300, 420, 541, 600)
set2= (1470, 1250, 1250, 1360)

so, three examples here
#1

(300, 420, 541, 600)
(1250, 1250, 1360, 1470)

difference is (950, 830, 819, 870) so the stdv is 59.35

#2

(300, 420, 541, 600)
(1470, 1250, 1360, 1250)

difference is (1170, 830, 819, 650) so the stdv is 217.99

#3

(300, 420, 541, 600)
(1250, 1470, 1360, 1250)

difference is (950, 1050, 819, 650) so the stdv is 127.98

so #1 is desired with smaller stdv

How this can be achieved in Python? Thanks in advance.

so #1 is desired with smaller stdv

(300, 420, 541, 600)
(1250, 1250, 1360, 1470)

I couldn’t think of how, it was previously done in access

Update:

the data came in as dataframe as

data1 = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]}
date2= {'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]}

using data1.col2 and data2.col2 like set1 and set2

I wanna see

data1  = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600], 'col3': [1250, 1250, 1360, 1470]} 

with the new ‘col3’ filled with the order from min_diff_s2

Update2: Can I safely say if I want final data1 with set2’s col1 information, does a join based on the min_diff_s2 is good?

Asked By: Connie Xu

||

Answers:

I’m assuming you have tuples not sets as that is how they are written. As for getting their difference you can use a comprehension:

>>> set1= (300, 420, 541, 600)
>>> set2= (1470, 1250, 1250, 1360)
>>> diff = tuple(s1 - s2 for s1, s2 in zip(set1, set2))
>>> diff
(-1170, -830, -709, -760)

And for the standard deviation there is a function for this in the statistics library:

>>> from statistics import stdev
>>> stdev(diff)
207.83867942870177
Answered By: Jab

Those are tuples, not sets.

set1 = (300, 420, 541, 600)
set2 = (1470, 1250, 1250, 1360)

First, let’s define a function that, given any two tuples, will return the standard deviation of the difference between them:

from statistics import stdev

def diff_std(s1, s2):
    return stdev(i1 - i2 for i1, i2 in zip(s1, s2))

You want all possible permutations of one of the tuple. If you compare the other tuple to all possible permutations of the first tuple, you will have compared all possible combinations of the two tuples.

Then, apply the subtraction and standard deviation to obtain a single value, and find the combination that gives you the smallest standard deviation.

from itertools import permutations

set1_perms = permutations(set1)

min_diff_s1, min_diff_s2 = min(((s1, set2) for s1 in set1_perms), 
                    key=lambda x: diff_std(x[0], x[1]))

Which gives the combination you specified in #1:

min_diff_s1 = (600, 300, 420, 541)
min_diff_s2 = (1470, 1250, 1250, 1360)

With your dataframes, you need to do the same thing (almost):

df1 = pd.DataFrame({'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]})
df2 = pd.DataFrame({'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]})

A couple of differences:

  • you want to put the data from df2 into df1, so let’s permute df2.col2 instead of permuting the first set like we did above
  • Pandas columns are Series objects which inherently support element-wise subtraction and standard deviation, so we can define the diff_std function to account for this:
  • We don’t really need to get the first set’s values again, since we never permute them.
def diff_std(s1, s2):
    return (s1 - s2).std()

set2_perms = permutations(df2.col2.values)
min_diff_s2 = min(set_perms, 
                  key=lambda x: diff_std(df1.col2, x))

In diff_std, s2 is a tuple and s1 is a pd.Series object. Subtracting a tuple from a Series does element-wise subtraction and returns a Series. Then we take the std() of that resulting Series object.

We don’t really care about the first set, since we didn’t permute it.

Now that we have min_diff_s2, we can set the column of df1:

df1["col3"] = min_diff_s2

Which results in:

  col1  col2  col3
0    A   300  1250
1    B   420  1250
2    C   541  1360
3    C   600  1470
Answered By: Pranav Hosangadi
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.