pairing two set of data in Python, any idea?
Question:
I am trying to pair two sets of data based on their difference and minimize the standard deviation
set1= (300, 420, 541, 600)
set2= (1470, 1250, 1250, 1360)
so, three examples here
#1
(300, 420, 541, 600)
(1250, 1250, 1360, 1470)
difference is (950, 830, 819, 870)
so the stdv is 59.35
#2
(300, 420, 541, 600)
(1470, 1250, 1360, 1250)
difference is (1170, 830, 819, 650)
so the stdv is 217.99
#3
(300, 420, 541, 600)
(1250, 1470, 1360, 1250)
difference is (950, 1050, 819, 650)
so the stdv is 127.98
so #1 is desired with smaller stdv
How this can be achieved in Python? Thanks in advance.
so #1 is desired with smaller stdv
(300, 420, 541, 600)
(1250, 1250, 1360, 1470)
I couldn’t think of how, it was previously done in access
Update:
the data came in as dataframe as
data1 = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]}
date2= {'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]}
using data1.col2 and data2.col2 like set1 and set2
I wanna see
data1 = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600], 'col3': [1250, 1250, 1360, 1470]}
with the new ‘col3’ filled with the order from min_diff_s2
Update2: Can I safely say if I want final data1 with set2’s col1 information, does a join based on the min_diff_s2 is good?
Answers:
I’m assuming you have tuples not sets as that is how they are written. As for getting their difference you can use a comprehension:
>>> set1= (300, 420, 541, 600)
>>> set2= (1470, 1250, 1250, 1360)
>>> diff = tuple(s1 - s2 for s1, s2 in zip(set1, set2))
>>> diff
(-1170, -830, -709, -760)
And for the standard deviation there is a function for this in the statistics
library:
>>> from statistics import stdev
>>> stdev(diff)
207.83867942870177
Those are tuples, not sets.
set1 = (300, 420, 541, 600)
set2 = (1470, 1250, 1250, 1360)
First, let’s define a function that, given any two tuples, will return the standard deviation of the difference between them:
from statistics import stdev
def diff_std(s1, s2):
return stdev(i1 - i2 for i1, i2 in zip(s1, s2))
You want all possible permutations of one of the tuple. If you compare the other tuple to all possible permutations of the first tuple, you will have compared all possible combinations of the two tuples.
Then, apply the subtraction and standard deviation to obtain a single value, and find the combination that gives you the smallest standard deviation.
from itertools import permutations
set1_perms = permutations(set1)
min_diff_s1, min_diff_s2 = min(((s1, set2) for s1 in set1_perms),
key=lambda x: diff_std(x[0], x[1]))
Which gives the combination you specified in #1:
min_diff_s1 = (600, 300, 420, 541)
min_diff_s2 = (1470, 1250, 1250, 1360)
With your dataframes, you need to do the same thing (almost):
df1 = pd.DataFrame({'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]})
df2 = pd.DataFrame({'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]})
A couple of differences:
- you want to put the data from
df2
into df1
, so let’s permute df2.col2
instead of permuting the first set like we did above
- Pandas columns are
Series
objects which inherently support element-wise subtraction and standard deviation, so we can define the diff_std
function to account for this:
- We don’t really need to get the first set’s values again, since we never permute them.
def diff_std(s1, s2):
return (s1 - s2).std()
set2_perms = permutations(df2.col2.values)
min_diff_s2 = min(set_perms,
key=lambda x: diff_std(df1.col2, x))
In diff_std
, s2
is a tuple and s1
is a pd.Series
object. Subtracting a tuple from a Series
does element-wise subtraction and returns a Series
. Then we take the std()
of that resulting Series
object.
We don’t really care about the first set, since we didn’t permute it.
Now that we have min_diff_s2
, we can set the column of df1
:
df1["col3"] = min_diff_s2
Which results in:
col1 col2 col3
0 A 300 1250
1 B 420 1250
2 C 541 1360
3 C 600 1470
I am trying to pair two sets of data based on their difference and minimize the standard deviation
set1= (300, 420, 541, 600)
set2= (1470, 1250, 1250, 1360)
so, three examples here
#1
(300, 420, 541, 600)
(1250, 1250, 1360, 1470)
difference is (950, 830, 819, 870)
so the stdv is 59.35
#2
(300, 420, 541, 600)
(1470, 1250, 1360, 1250)
difference is (1170, 830, 819, 650)
so the stdv is 217.99
#3
(300, 420, 541, 600)
(1250, 1470, 1360, 1250)
difference is (950, 1050, 819, 650)
so the stdv is 127.98
so #1 is desired with smaller stdv
How this can be achieved in Python? Thanks in advance.
so #1 is desired with smaller stdv
(300, 420, 541, 600)
(1250, 1250, 1360, 1470)
I couldn’t think of how, it was previously done in access
Update:
the data came in as dataframe as
data1 = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]}
date2= {'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]}
using data1.col2 and data2.col2 like set1 and set2
I wanna see
data1 = {'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600], 'col3': [1250, 1250, 1360, 1470]}
with the new ‘col3’ filled with the order from min_diff_s2
Update2: Can I safely say if I want final data1 with set2’s col1 information, does a join based on the min_diff_s2 is good?
I’m assuming you have tuples not sets as that is how they are written. As for getting their difference you can use a comprehension:
>>> set1= (300, 420, 541, 600)
>>> set2= (1470, 1250, 1250, 1360)
>>> diff = tuple(s1 - s2 for s1, s2 in zip(set1, set2))
>>> diff
(-1170, -830, -709, -760)
And for the standard deviation there is a function for this in the statistics
library:
>>> from statistics import stdev
>>> stdev(diff)
207.83867942870177
Those are tuples, not sets.
set1 = (300, 420, 541, 600)
set2 = (1470, 1250, 1250, 1360)
First, let’s define a function that, given any two tuples, will return the standard deviation of the difference between them:
from statistics import stdev
def diff_std(s1, s2):
return stdev(i1 - i2 for i1, i2 in zip(s1, s2))
You want all possible permutations of one of the tuple. If you compare the other tuple to all possible permutations of the first tuple, you will have compared all possible combinations of the two tuples.
Then, apply the subtraction and standard deviation to obtain a single value, and find the combination that gives you the smallest standard deviation.
from itertools import permutations
set1_perms = permutations(set1)
min_diff_s1, min_diff_s2 = min(((s1, set2) for s1 in set1_perms),
key=lambda x: diff_std(x[0], x[1]))
Which gives the combination you specified in #1:
min_diff_s1 = (600, 300, 420, 541)
min_diff_s2 = (1470, 1250, 1250, 1360)
With your dataframes, you need to do the same thing (almost):
df1 = pd.DataFrame({'col1': ["A", "B", "C", "C"], 'col2': [300, 420, 541, 600]})
df2 = pd.DataFrame({'col1': ["A", "C", "B", "D"], 'col2': [1470, 1250, 1250, 1360]})
A couple of differences:
- you want to put the data from
df2
intodf1
, so let’s permutedf2.col2
instead of permuting the first set like we did above - Pandas columns are
Series
objects which inherently support element-wise subtraction and standard deviation, so we can define thediff_std
function to account for this: - We don’t really need to get the first set’s values again, since we never permute them.
def diff_std(s1, s2):
return (s1 - s2).std()
set2_perms = permutations(df2.col2.values)
min_diff_s2 = min(set_perms,
key=lambda x: diff_std(df1.col2, x))
In diff_std
, s2
is a tuple and s1
is a pd.Series
object. Subtracting a tuple from a Series
does element-wise subtraction and returns a Series
. Then we take the std()
of that resulting Series
object.
We don’t really care about the first set, since we didn’t permute it.
Now that we have min_diff_s2
, we can set the column of df1
:
df1["col3"] = min_diff_s2
Which results in:
col1 col2 col3
0 A 300 1250
1 B 420 1250
2 C 541 1360
3 C 600 1470