cluster-analysis

running regression with clusters from k means

running regression with clusters from k means Question: I did k means clustering by running below code X_std = StandardScaler().fit_transform(df_logret) km = Kmeans(n_clusters=2, max_iter = 100) km.fit(X_std) centroids = km.centroids and I’d like to put cluster 1 in x_1 and cluster 2 in x_2 and run a regression that looks like y= ax_1+bx_2 I’ve been …

Total answers: 1

Efficient Method to Group by Columns

Efficient Method to Group by Columns Question: I’ve been trying to deduplicate strings of names over a very, very large dataset. I’ve been using the recordlinkage library. However, while it does generate a nice list of paired indices, it does not provide any way to re-group them. I’ve run several similarity measures on the strings, …

Total answers: 1

Split list of string into groups of "similar string"

Split list of string into groups of "similar string" Question: I have a list of string (about 300 strings). Some of those strings start with the same characters. For example: "ACTUATOR_alim5V0ResetErrorRatio_1", "ACTUATOR_alim5V0ResetErrorRatio_2", "ACTUATOR_alim5V0ResetErrorSensors_1", "SENSOR_inputPwm2DutyCycle", "SENSOR_inputPwm2DigitalLevel", "SENSOR_inputPwm2RisingEdgeCounter", etc… I want to group strings starting with the same characters. For example: "ACTUATOR_alim5V0ResetErrorRatio_1", "ACTUATOR_alim5V0ResetErrorRatio_2", "ACTUATOR_alim5V0ResetErrorSensors_1" will belong to …

Total answers: 2

Clustering geospatial data on coordinates AND non spatial feature

Clustering geospatial data on coordinates AND non spatial feature Question: Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like: business_lat business_lng business_rating 0 19.111841 72.910729 5. 1 19.111342 72.908387 5. 2 19.111342 72.908387 4. 3 19.137815 72.914085 5. 4 19.119677 72.905081 2. 5 19.119677 72.905081 …

Total answers: 2

'KMeansModel' object has no attribute 'computeCost' in apache pyspark

'KMeansModel' object has no attribute 'computeCost' in apache pyspark Question: I’m experimenting with a clustering model in pyspark. I’m trying to get the mean squared cost of the cluster fit for different values of K def meanScore(k,df): inputCol = df.columns[:38] assembler = VectorAssembler(inputCols=inputCols,outputCol="features") kmeans = KMeans().setK(k) pipeModel2 = Pipeline(stages=[assembler,kmeans]) kmeansModel = pipeModel2.fit(df).stages[-1] kmeansModel.computeCost(assembler.transform(df))/data.count() When I …

Total answers: 5

Getting the center point of a cluster for latitude and longitude in Python

Getting the center point of a cluster for latitude and longitude in Python Question: I have a list of of coordinates that have areas mapped out as follows df=pd.DataFrame({‘user_id’:[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,], ‘latitude’:[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,], ‘longitude’:[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,], }) For the following latitude longitude pairs I am using DBSCAN to cluster them X=np.array(df[[‘latitude’, ‘longitude’]]) kms_per_radian = 6371.0088 epsilon = 1 / kms_per_radian …

Total answers: 3

How to get the centorids in DBSCAN sklearn?

How to get the centroids in DBSCAN sklearn? Question: I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans. However, I observed that DBSCAN has something called core points. I am thinking if …

Total answers: 2

Is there an easy way to use DBSCAN in python with dimensions higher than 2?

Is there an easy way to use DBSCAN in python with dimensions higher than 2? Question: I’ve been working on a machine learning project using clustering algorithms, and I’m looking into using scikit-learn’s DBSCAN implementation based on the data that I’m working with. However, whenever I try to run it with my feature arrays, it …

Total answers: 2

typeError help, plt.scatter reading my .csv as true/false rather than numerical values

typeError help, plt.scatter reading my .csv as true/false rather than numerical values Question: I’m following this article, using my own data trying to plot a customers # of orders against their lifetime spend when I get this error: I’ve tried removing true/false values from my dataframe and updating relevant packages TypeError Traceback (most recent call …

Total answers: 4

How do I automate the number of clusters?

How do I automate the number of clusters? Question: I’ve been playing with the below script: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score import textract import os folder_to_scan = ‘/media/sf_Documents/clustering’ dict_of_docs = {} # Gets all the files to scan with textract for root, sub, files in os.walk(folder_to_scan): for file …

Total answers: 1