{ "cells": [ { "cell_type": "markdown", "id": "c0a59dc9-bbf7-428d-8d1b-468a396af4da", "metadata": {}, "source": [ "# Clustering\n", "\n", "Until now we have exclusively looked at supervised methods: to create a model we always had a dataset containing features and a **target** to predict. The goal in those methods was then to be able to do a **prediction** i.e. given a set of new features, predict a variable (continuous like in regression or categorical like in classification). In clustering, **we don't have the target** in our dataset. We rather try to identify sub-groups or clusters in our dataset. In scikit-learn terms, when we use supervised learning we always have features ```X``` and targets ```y``` and when we do clustering, we only have the features ```X```." ] }, { "cell_type": "markdown", "id": "95aba150-497a-4b2d-87d3-3c0dcc088f1f", "metadata": {}, "source": [ "## Clustering methods\n", "\n", "There are many algorithms to do clustering, but generally the idea is to find sub-groups in our dataset where data points are close together according to some metric. We'll fist look at some artificial data to get the idea. For this we use a scikit-learn function that creates blobs of data:" ] }, { "cell_type": "code", "execution_count": 1, "id": "05020310-6a4f-44a6-89c3-b1ce69792d14", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 5, "id": "5c273b0b-06c9-4b1d-b5a1-d3d5b965b8e8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | feature1 | \n", "feature2 | \n", "label | \n", "
---|---|---|---|
0 | \n", "-2.978672 | \n", "9.556846 | \n", "0 | \n", "
1 | \n", "3.161357 | \n", "1.253325 | \n", "1 | \n", "
2 | \n", "3.488885 | \n", "2.348868 | \n", "1 | \n", "
3 | \n", "4.038172 | \n", "3.825448 | \n", "1 | \n", "
4 | \n", "-1.043549 | \n", "8.788510 | \n", "0 | \n", "
KMeans(n_clusters=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3)
KMeans(n_clusters=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=2)
KMeans(n_clusters=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=2)
MeanShift()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MeanShift()
DBSCAN(eps=0.1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DBSCAN(eps=0.1)