Is there an easy way to use DBSCAN in python with dimensions higher than 2?
Question:
I’ve been working on a machine learning project using clustering algorithms, and I’m looking into using scikit-learn’s DBSCAN implementation based on the data that I’m working with. However, whenever I try to run it with my feature arrays, it throws the following error:
ValueError: Found array with dim 3. Estimator expected <= 2.
This gives me the impression that scikit’s DBSCAN only supports two-dimensional features. Am I wrong in thinking this? If not, is there an implementation of DBSCAN that supports higher-dimensional feature arrays? Thanks for any help you can offer.
Edit
Here’s the code that I’m using for my DBSCAN script. The idea is to read data from a number of different CSVs, save them into an array, and then dump them into a pickle file so that the model can load them in the future and run DBSCAN.
def get_clusters(fileList, arraySavePath):
# Create empty array
fitting = [];
# Get values from all files, save to singular array
for filePath in fileList:
df = pd.read_csv(filePath, usecols=use_cols);
fitting.append(df.values.tolist());
# Save array to it's own csv file
with open(arraySavePath, "wb") as fp:
pickle.dump(fitting, fp);
def predict_cluster(modelPath, predictInput):
# Load the cluster data
with open(modelPath, "rb") as fp:
fitting = pickle.load(fp);
# DBSCAN fit
clustering = DBSCAN(eps=3, min_samples=2);
clustering.fit(fitting);
# Predict the label
return clustering.predict_fit(predictInput);
Answers:
I believe the issue is with the "min_samples" parameter. The data you’re fitting contains 3 features/dimensions but you’ve set "min_samples=2". Min_samples has to be equal to or greater than the number of features in your dataset.
I have an example of DBSCAN on my blog.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.30, min_samples=9)
#predict the labels of clusters.
label = model.fit_predict(df_cars)
label
df_cars['dbscan'] = label
df_cars
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
I’ve been working on a machine learning project using clustering algorithms, and I’m looking into using scikit-learn’s DBSCAN implementation based on the data that I’m working with. However, whenever I try to run it with my feature arrays, it throws the following error:
ValueError: Found array with dim 3. Estimator expected <= 2.
This gives me the impression that scikit’s DBSCAN only supports two-dimensional features. Am I wrong in thinking this? If not, is there an implementation of DBSCAN that supports higher-dimensional feature arrays? Thanks for any help you can offer.
Edit
Here’s the code that I’m using for my DBSCAN script. The idea is to read data from a number of different CSVs, save them into an array, and then dump them into a pickle file so that the model can load them in the future and run DBSCAN.
def get_clusters(fileList, arraySavePath):
# Create empty array
fitting = [];
# Get values from all files, save to singular array
for filePath in fileList:
df = pd.read_csv(filePath, usecols=use_cols);
fitting.append(df.values.tolist());
# Save array to it's own csv file
with open(arraySavePath, "wb") as fp:
pickle.dump(fitting, fp);
def predict_cluster(modelPath, predictInput):
# Load the cluster data
with open(modelPath, "rb") as fp:
fitting = pickle.load(fp);
# DBSCAN fit
clustering = DBSCAN(eps=3, min_samples=2);
clustering.fit(fitting);
# Predict the label
return clustering.predict_fit(predictInput);
I believe the issue is with the "min_samples" parameter. The data you’re fitting contains 3 features/dimensions but you’ve set "min_samples=2". Min_samples has to be equal to or greater than the number of features in your dataset.
I have an example of DBSCAN on my blog.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.30, min_samples=9)
#predict the labels of clusters.
label = model.fit_predict(df_cars)
label
df_cars['dbscan'] = label
df_cars
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb