sklearn clustering with custom metric: pairwise_distances throwing error
Question:
I would like to cluster sets of spatial data using my own metric. The data comes as pairs of (x,y)
values in a dataframe, where each set of pairs has an id
. Like in the following example where I have three sets of points:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3,
'x': np.random.random(12),
'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
Here is the distance function I would like to use:
from scipy.spatial.distance import directed_hausdorff
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
This function computes the Hausdorff distance, i.e. the distance between two subsets u
and v
of n
-dimensional space. In my case, I would like to use this distance function to cluster subsets of the real plane. In the data above there are three such subsets (id
s from 1 to 3) so the resulting distance matrix should be 3×3.
My idea for the clustering step was to use sklearn.cluster.AgglomerativeClustering
with a precomputed metric, which in turn I want to compute with sklearn.metrics.pairwise import pairwise_distances
.
from sklearn.metrics.pairwise import pairwise_distances
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array).as_matrix()
m = pairwise_distances(X, X, metric=some_distance)
However, the last line is giving me an error:
ValueError: setting an array element with a sequence.
What does work fine, however, is calling some_distance(X[1], X[2])
.
My hunch is that X
needs to be a different format for pairwise_distances
to work. Any ideas on how to make this work, or how to compute the matrix myself so I can stick it into sklearn.cluster.AgglomerativeClustering
?
The error stack is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e34155622595> in <module>
12 def some_distance(u, v):
13 return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
---> 14 m = pairwise_distances(X, X, metric=some_distance)
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1065
1066 if effective_n_jobs(n_jobs) == 1:
-> 1067 return func(X, Y, **kwds)
1068
1069 # TODO: in some cases, backend='threading' may be appropriate
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in _pairwise_callable(X, Y, metric, **kwds)
1079 """Handle the callable case for pairwise_{distances,kernels}
1080 """
-> 1081 X, Y = check_pairwise_arrays(X, Y)
1082
1083 if X is Y:
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
106 if Y is X or Y is None:
107 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 108 warn_on_dtype=warn_on_dtype, estimator=estimator)
109 else:
110 X = check_array(X, accept_sparse='csr', dtype=dtype,
C:ProgramDataAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supportedn"
C:ProgramDataAnaconda3libsite-packagesnumpycorenumeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Answers:
Try this:
import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3, 'x':
np.random.random(12), 'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
df.groupby('id')['xy'].apply(to_np_array)
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array)
d = np.zeros((len(X),len(X)))
for i, u in enumerate(X):
for j, v in list(enumerate(X))[i:]:
d[i,j] = some_distance(u,v)
d[j,i] = d[i,j]
And now when you print d
you get this:
array([[0. , 0.58928274, 0.40767213],
[0.58928274, 0. , 0.510095 ],
[0.40767213, 0.510095 , 0. ]])
And for clustering:
cluster = AgglomerativeClustering(n_clusters=2, affinity='precomputed', linkage = 'average')
cluster.fit(d)
It would help if you showed some of the variables. Fortunately you gave enough code to run it. For example the dataframe:
In [9]: df
Out[9]:
id x y xy
0 1 0.428437 0.267264 [0.42843730501201727, 0.2672637429997736]
1 1 0.944687 0.023323 [0.9446872371859233, 0.023322969159167317]
2 1 0.091055 0.683154 [0.09105472832178496, 0.6831542985617349]
3 1 0.474522 0.313541 [0.4745222021519122, 0.3135405569298565]
4 2 0.835237 0.491541 [0.8352366339973815, 0.4915408434083248]
5 2 0.905918 0.854030 [0.9059178939221513, 0.8540297797160584]
6 2 0.182154 0.909656 [0.18215390836391654, 0.9096555360282939]
7 2 0.225270 0.522193 [0.22527013482912195, 0.5221926076838651]
8 2 0.924208 0.858627 [0.9242076604008371, 0.8586274362498842]
9 3 0.419813 0.634741 [0.41981292371175905, 0.6347409684931891]
10 3 0.954141 0.795452 [0.9541413559045294, 0.7954524369652217]
11 3 0.896593 0.271187 [0.8965932351250882, 0.2711872631673109]
And your X
:
In [10]: X
Out[10]:
array([array([[0.42843731, 0.26726374],
[0.94468724, 0.02332297],
[0.09105473, 0.6831543 ],
[0.4745222 , 0.31354056]]),
array([[0.83523663, 0.49154084],
[0.90591789, 0.85402978],
[0.18215391, 0.90965554],
[0.22527013, 0.52219261],
[0.92420766, 0.85862744]]),
array([[0.41981292, 0.63474097],
[0.95414136, 0.79545244],
[0.89659324, 0.27118726]])], dtype=object)
That is a (3,) object array – in effect a list of 3 2d arrays, with different sizes ((3,2),(5,2),(4,2)). That’s one array for each group.
How is pairwise
supposed to feed that to your distance code? pairwise
docs says X
should be a (n,m)
array – n samples, m features. Your X
doesn’t fit that description!
The error is probably produced by when trying to make a float array from X
:
In [12]: np.asarray(X,dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a6e08bb1590c> in <module>
----> 1 np.asarray(X,dtype=float)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
I would like to cluster sets of spatial data using my own metric. The data comes as pairs of (x,y)
values in a dataframe, where each set of pairs has an id
. Like in the following example where I have three sets of points:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3,
'x': np.random.random(12),
'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
Here is the distance function I would like to use:
from scipy.spatial.distance import directed_hausdorff
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
This function computes the Hausdorff distance, i.e. the distance between two subsets u
and v
of n
-dimensional space. In my case, I would like to use this distance function to cluster subsets of the real plane. In the data above there are three such subsets (id
s from 1 to 3) so the resulting distance matrix should be 3×3.
My idea for the clustering step was to use sklearn.cluster.AgglomerativeClustering
with a precomputed metric, which in turn I want to compute with sklearn.metrics.pairwise import pairwise_distances
.
from sklearn.metrics.pairwise import pairwise_distances
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array).as_matrix()
m = pairwise_distances(X, X, metric=some_distance)
However, the last line is giving me an error:
ValueError: setting an array element with a sequence.
What does work fine, however, is calling some_distance(X[1], X[2])
.
My hunch is that X
needs to be a different format for pairwise_distances
to work. Any ideas on how to make this work, or how to compute the matrix myself so I can stick it into sklearn.cluster.AgglomerativeClustering
?
The error stack is
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e34155622595> in <module>
12 def some_distance(u, v):
13 return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
---> 14 m = pairwise_distances(X, X, metric=some_distance)
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1430 func = partial(distance.cdist, metric=metric, **kwds)
1431
-> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1433
1434
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1065
1066 if effective_n_jobs(n_jobs) == 1:
-> 1067 return func(X, Y, **kwds)
1068
1069 # TODO: in some cases, backend='threading' may be appropriate
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in _pairwise_callable(X, Y, metric, **kwds)
1079 """Handle the callable case for pairwise_{distances,kernels}
1080 """
-> 1081 X, Y = check_pairwise_arrays(X, Y)
1082
1083 if X is Y:
C:ProgramDataAnaconda3libsite-packagessklearnmetricspairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
106 if Y is X or Y is None:
107 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 108 warn_on_dtype=warn_on_dtype, estimator=estimator)
109 else:
110 X = check_array(X, accept_sparse='csr', dtype=dtype,
C:ProgramDataAnaconda3libsite-packagessklearnutilsvalidation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supportedn"
C:ProgramDataAnaconda3libsite-packagesnumpycorenumeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Try this:
import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
df = pd.DataFrame({'id': [1] * 4 + [2] * 5 + [3] * 3, 'x':
np.random.random(12), 'y': np.random.random(12)})
df['xy'] = df[['x','y']].apply(lambda row: [row['x'],row['y']], axis = 1)
df.groupby('id')['xy'].apply(to_np_array)
def some_distance(u, v):
return max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
def to_np_array(col):
return np.array(list(col.values))
X = df.groupby('id')['xy'].apply(to_np_array)
d = np.zeros((len(X),len(X)))
for i, u in enumerate(X):
for j, v in list(enumerate(X))[i:]:
d[i,j] = some_distance(u,v)
d[j,i] = d[i,j]
And now when you print d
you get this:
array([[0. , 0.58928274, 0.40767213],
[0.58928274, 0. , 0.510095 ],
[0.40767213, 0.510095 , 0. ]])
And for clustering:
cluster = AgglomerativeClustering(n_clusters=2, affinity='precomputed', linkage = 'average')
cluster.fit(d)
It would help if you showed some of the variables. Fortunately you gave enough code to run it. For example the dataframe:
In [9]: df
Out[9]:
id x y xy
0 1 0.428437 0.267264 [0.42843730501201727, 0.2672637429997736]
1 1 0.944687 0.023323 [0.9446872371859233, 0.023322969159167317]
2 1 0.091055 0.683154 [0.09105472832178496, 0.6831542985617349]
3 1 0.474522 0.313541 [0.4745222021519122, 0.3135405569298565]
4 2 0.835237 0.491541 [0.8352366339973815, 0.4915408434083248]
5 2 0.905918 0.854030 [0.9059178939221513, 0.8540297797160584]
6 2 0.182154 0.909656 [0.18215390836391654, 0.9096555360282939]
7 2 0.225270 0.522193 [0.22527013482912195, 0.5221926076838651]
8 2 0.924208 0.858627 [0.9242076604008371, 0.8586274362498842]
9 3 0.419813 0.634741 [0.41981292371175905, 0.6347409684931891]
10 3 0.954141 0.795452 [0.9541413559045294, 0.7954524369652217]
11 3 0.896593 0.271187 [0.8965932351250882, 0.2711872631673109]
And your X
:
In [10]: X
Out[10]:
array([array([[0.42843731, 0.26726374],
[0.94468724, 0.02332297],
[0.09105473, 0.6831543 ],
[0.4745222 , 0.31354056]]),
array([[0.83523663, 0.49154084],
[0.90591789, 0.85402978],
[0.18215391, 0.90965554],
[0.22527013, 0.52219261],
[0.92420766, 0.85862744]]),
array([[0.41981292, 0.63474097],
[0.95414136, 0.79545244],
[0.89659324, 0.27118726]])], dtype=object)
That is a (3,) object array – in effect a list of 3 2d arrays, with different sizes ((3,2),(5,2),(4,2)). That’s one array for each group.
How is pairwise
supposed to feed that to your distance code? pairwise
docs says X
should be a (n,m)
array – n samples, m features. Your X
doesn’t fit that description!
The error is probably produced by when trying to make a float array from X
:
In [12]: np.asarray(X,dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a6e08bb1590c> in <module>
----> 1 np.asarray(X,dtype=float)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.