Doing simple operations with itertools combinatorics?
Question:
I have a python dataset that has the following structure:
cluster pts lon lat
0 5 45 24
1 6 47 23
2 10 45 20
As you can see, I have a column that refers to a cluster, the number of points within a cluster, the representative latitude of the cluster and the representative longitude of the cluster. In the whole dataframe I have 140 clusters.
Now I would like to calculate for each cluster the following operation by means of a combinatorial:
ℎ ( , )=− + / ( , )
where i refers to a cluster and j to another.
where n refers to the number of pts
On the one hand it does the sum of the points between cluster i and cluster j, and in the denominator it calculates by means of haversine the distance between the two clusters taking into account their representative coordinates.
I’ve started by coming up with a code that uses itertools, but I have problems to continue. Any idea?
from itertools import combinations
for c in combinations(df['cluster'],2):
sum_pts=
distance=
weight=-(sum_pts/distance)
print(c,weight)
Answers:
As you mentioned, to do the combinations, you can use itertools.
To calculate the distance you can use geopy.distance.distance
. Refer to the documentation for details: https://geopy.readthedocs.io/en/stable/#module-geopy.distance
This should work:
from itertools import combinations
from geopy.distance import distance
for p1, p2 in combinations(df['cluster'], 2):
sum_pts = df['pts'][p1] + df['pts'][p2]
# distance in km
dist = distance(df.loc[p1, ['lat', 'lon']], df.loc[p2, ['lat', 'lon']]).km
weight = -sum_pts/dist
print ((p1, p2), weight)
Edit: for a case when clusters don’t necessarily correspond to index
for c1, c2 in combinations(df['cluster'], 2):
p1, p2 = df[df['cluster'] == c1].iloc[0], df[df['cluster'] == c2].iloc[0]
sum_pts = p1['pts'] + p2['pts']
dist = distance((p1['lat'], p1['lon']), (p2['lat'], p2['lon'])).km
weight = -sum_pts/dist
print ((c1, c2), weight)
Output:
(0, 1) -0.04733881547464973
(0, 2) -0.033865977446857085
(1, 2) -0.04086856230889897
If you care about performance you may want to use merge and vectorized operations.
import numpy as np
import pandas as pd
def haversine_distance(lat1, lon1, lat2, lon2):
R = 6372800 # Earth radius in meters
phi1, phi2 = np.radians(lat1), np.radians(lat2)
dphi = np.radians(lat2 - lat1)
dlambda = np.radians(lon2 - lon1)
a = np.sin(dphi / 2) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(dlambda / 2) ** 2
return 2 * R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df = pd.DataFrame({
'cluster': [0, 1, 2],
'pts': [5, 6, 10],
'lat': [45, 47, 45],
'lon': [24, 23, 20],
})
df = pd.merge(df, df, suffixes=('_1', '_2'), how="cross")
df = df[df['cluster_1'] != df['cluster_2']]
df["weight"] = -df['pts_1'] + df['pts_2'] / haversine_distance(df['lat_1'], df['lon_1'], df['lat_2'], df['lon_2'])
I have a python dataset that has the following structure:
cluster pts lon lat
0 5 45 24
1 6 47 23
2 10 45 20
As you can see, I have a column that refers to a cluster, the number of points within a cluster, the representative latitude of the cluster and the representative longitude of the cluster. In the whole dataframe I have 140 clusters.
Now I would like to calculate for each cluster the following operation by means of a combinatorial:
ℎ ( , )=− + / ( , )
where i refers to a cluster and j to another.
where n refers to the number of pts
On the one hand it does the sum of the points between cluster i and cluster j, and in the denominator it calculates by means of haversine the distance between the two clusters taking into account their representative coordinates.
I’ve started by coming up with a code that uses itertools, but I have problems to continue. Any idea?
from itertools import combinations
for c in combinations(df['cluster'],2):
sum_pts=
distance=
weight=-(sum_pts/distance)
print(c,weight)
As you mentioned, to do the combinations, you can use itertools.
To calculate the distance you can use geopy.distance.distance
. Refer to the documentation for details: https://geopy.readthedocs.io/en/stable/#module-geopy.distance
This should work:
from itertools import combinations
from geopy.distance import distance
for p1, p2 in combinations(df['cluster'], 2):
sum_pts = df['pts'][p1] + df['pts'][p2]
# distance in km
dist = distance(df.loc[p1, ['lat', 'lon']], df.loc[p2, ['lat', 'lon']]).km
weight = -sum_pts/dist
print ((p1, p2), weight)
Edit: for a case when clusters don’t necessarily correspond to index
for c1, c2 in combinations(df['cluster'], 2):
p1, p2 = df[df['cluster'] == c1].iloc[0], df[df['cluster'] == c2].iloc[0]
sum_pts = p1['pts'] + p2['pts']
dist = distance((p1['lat'], p1['lon']), (p2['lat'], p2['lon'])).km
weight = -sum_pts/dist
print ((c1, c2), weight)
Output:
(0, 1) -0.04733881547464973
(0, 2) -0.033865977446857085
(1, 2) -0.04086856230889897
If you care about performance you may want to use merge and vectorized operations.
import numpy as np
import pandas as pd
def haversine_distance(lat1, lon1, lat2, lon2):
R = 6372800 # Earth radius in meters
phi1, phi2 = np.radians(lat1), np.radians(lat2)
dphi = np.radians(lat2 - lat1)
dlambda = np.radians(lon2 - lon1)
a = np.sin(dphi / 2) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(dlambda / 2) ** 2
return 2 * R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df = pd.DataFrame({
'cluster': [0, 1, 2],
'pts': [5, 6, 10],
'lat': [45, 47, 45],
'lon': [24, 23, 20],
})
df = pd.merge(df, df, suffixes=('_1', '_2'), how="cross")
df = df[df['cluster_1'] != df['cluster_2']]
df["weight"] = -df['pts_1'] + df['pts_2'] / haversine_distance(df['lat_1'], df['lon_1'], df['lat_2'], df['lon_2'])