How to identify in which region new point will lie using Sklearn Python?

Question

I have a sample code for the Sklearn taken from the website. I am trying to learn how to classify points using Sklearn(Scikit-Learn). Here is the code:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.inspection import DecisionBoundaryDisplay

names = [
    "Nearest Neighbors",
]

classifiers = [
    KNeighborsClassifier(3),
]

X, y = make_classification(
    n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1
)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [
    linearly_separable,
]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42
    )

    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(["#FF0000", "#0000FF"])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
    # Plot the testing points
    ax.scatter(
        X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k"
    )
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        All_Value_Response = DecisionBoundaryDisplay.from_estimator(
            clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5
        )

        # Plot the training points
        ax.scatter(
            X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k"
        )
        # Plot the testing points
        ax.scatter(
            X_test[:, 0],
            X_test[:, 1],
            c=y_test,
            cmap=cm_bright,
            edgecolors="k",
            alpha=0.6,
        )

        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(
            x_max - 0.3,
            y_min + 0.3,
            ("%.2f" % score).lstrip("0"),
            size=15,
            horizontalalignment="right",
        )
        i += 1

plt.tight_layout()
plt.show()

Here is the output:

Now as one can see the areas formed are not regular shapes, so it is becoming a little difficult to understand how to know if a new point arrives and will lie in which region. I managed to capture the data of the regions (All_Value_Response variable stores that information) but it seems not helpful to me.

So I want to know if I want to know in which region does the point (1,3) lies then how I can deduce it through code. I can do it by seeing on the graph but how to make it work using the code?

Please help me find a solution to my problem.

Asked By: Jaffer Wilson

||

Source

Answer 1

So, you have X_train and X_test. These are both lists containing tuples. The values in the tuples (a, b) have some range, like 0 -> 1. In your graphs, these are the x and y coordinates of your dots.

You also have y_train and y_test. These are the known classifications of all the tuples in X_train and X_test. These values can be either 0 or 1, and none in between. If a dot in your graph is in the blue region, that means that the predicted value of that dot (a, b) is 0. If the dot is in the red region, this means the predicted value is 1.

# if X_train is this
X_train = [(0.0, 0.0), (0.1, 0.9), (0.9, 0.0), (1, 0.9)]

# then y_train has to be this, for you chart
y_train = [0, 0, 1, 1]

If you then train a classifier on this (but normally more data), then you can ask it any point (a, b) and it will tell you 0 or 1 (aka blue or red).

So for example I predict for a point (a, b) that it has not seen in X_train (aka something that is in X_test):

result = clf.predict([(0.2, 0.2)])

result then equals: [0]. This is because looking at your graph, assuming x-axis and y-axis range 0 -> 1. The tuple (0.2, 0,2) falls in the blue region.

It knows this because it has learned the blue red classification you see in your graph from X_train and X_test. So when it gets new tuples it sees on which region that dot falls and classifies it as 0 or 1, region blue or region red.

To summarize. The colored regions show what value will be predicted for any given tuple (a, b). The dot positions (in the scatter) are given by the values (a, b) in the tuple. a and b for the tuple are in range between 0->1. The color is not a range, but a classification 0 or 1.

Hopefully it helps, good luck!

Answered By: Bas van der Linden

Answer 2

Well we can definitely determine which region the new point will lie in, but before we do that I want to call attention to something you’re doing in your code that is going to come back to bite you.

This line right here X = StandardScaler().fit_transform(X) will come back to smack you harder than you know.

Remember, the point of doing a StandardScaler is to standardize the data (0 mean, unit variance). Also remember that what you do on your training set you must do on your test set. The caveat here is the operations you perform on your test set will be learned from your training set. I’ll give a condensed form of your code to help illustrate this.

X, y = datasets[0]

# Save an instance of the standard scaler so we can apply it to unknown values later
standard_scalar = StandardScaler()
standard_scalar.fit(X)  # Fit the "training data"

X = standard_scalar.transform(X)  # Transform the data

# You ended up with a good score because the transformation in your data was applied to the entire dataset, "X". But normally you'd transform the X_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Train the classifier
clf =  KNeighborsClassifier(3)
clf.fit(X_train, y_train)

score = clf.score(X_test, y_test)  # 0.925. Good score but let's see what happens next

# This is where things will go wrong in your code.
# Instead, make sure you transform the test point your want to calculate
>>> clf.predict([[1,3]])
array([0])

>>> clf.predict(standard_scalar.transform([[1,3]]))
array([1])

So even though you can use predict in both cases, you need to make sure to apply the transformation to the data point that you’re inputting.

Answered By: Chrispresso

Answer 3

Try this.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.inspection import DecisionBoundaryDisplay

names = [
    "Nearest Neighbors",
]

classifiers = [
    KNeighborsClassifier(3),
]

X, y = make_classification(
    n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1
)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [
    linearly_separable,
]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42
    )

    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(["#FF0000", "#0000FF"])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
    # Plot the testing points
    ax.scatter(
        X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k"
    )
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        All_Value_Response = DecisionBoundaryDisplay.from_estimator(
            clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5
        )

        # Plot the training points
        ax.scatter(
            X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k"
        )
        # Plot the testing points
        ax.scatter(
            X_test[:, 0],
            X_test[:, 1],
            c=y_test,
            cmap=cm_bright,
            edgecolors="k",
            alpha=0.6,
        )
        
        X1 = All_Value_Response.xx0.ravel()
        Y1 = All_Value_Response.xx1.ravel()
        Color = All_Value_Response.response.ravel()
        
        Outputs = []
        
        for X2, Y2 in X:
            XD = X2 - X1
            YD = Y2 - Y1
            Distance = (XD * XD) + (YD * YD)
            Color_Gradient = Color[Distance.argmin()]
            Outputs.append(Color_Gradient)
        
        print(Outputs)
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(
            x_max - 0.3,
            y_min + 0.3,
            ("%.2f" % score).lstrip("0"),
            size=15,
            horizontalalignment="right",
        )
        i += 1

plt.tight_layout()
plt.show()

I have found a reference in of the question that you have asked : https://stackoverflow.com/a/74613354/4948889

The output of the above code is something like the below:

Answered By: Amazing Things Around You

How to identify in which region new point will lie using Sklearn Python?

Question:

Answers: