completed 2 clustering parts of unsupervised learning section

2019-07-25 00:12:04 +01:00
parent 9648dfe7db
commit 15dfbd5d91
23 changed files with 5877 additions and 0 deletions
--- a/Learning/Clustering/Changing
+++ b/Learning/Clustering/Changing
@@ -0,0 +1,291 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Changing K - Solution\n",
+    "\n",
+    "In this notebook, you will get some practice with different values of **k**, and how it changes the clusters that are observed in the data.  As well as how to determine what the best value for **k** might be for a dataset.\n",
+    "\n",
+    "To get started, let's read in our necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from mpl_toolkits.mplot3d import Axes3D\n",
+    "from sklearn.cluster import KMeans\n",
+    "from sklearn.datasets import make_blobs\n",
+    "import helpers2 as h\n",
+    "import tests as t\n",
+    "from IPython import display\n",
+    "\n",
+    "%matplotlib inline\n",
+    "\n",
+    "# Make the images larger\n",
+    "plt.rcParams['figure.figsize'] = (16, 9)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`1.` To get started, there is a function called **simulate_data** within the **helpers2** module.  Read the documentation on the function by running the cell below.  Then use the function to simulate a dataset with 200 data points (rows), 5 features (columns), and 4 centers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "h.simulate_data?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = h.simulate_data(200, 5, 4)\n",
+    "\n",
+    "# This will check that your dataset appears to match ours before moving forward\n",
+    "t.test_question_1(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`2.` Because of how you set up the data, what should the value of **k** be?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "k_value = 4\n",
+    "\n",
+    "# Check your solution against ours.\n",
+    "t.test_question_2(k_value)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`3.` Let's try a few different values for **k** and fit them to our data using **KMeans**.\n",
+    "\n",
+    "To use KMeans, you need to follow three steps:\n",
+    "\n",
+    "**I.** Instantiate your model.\n",
+    "\n",
+    "**II.** Fit your model to the data.\n",
+    "\n",
+    "**III.** Predict the labels for the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Try instantiating a model with 4 centers\n",
+    "kmeans_4 = KMeans(n_clusters=4)\n",
+    "\n",
+    "# Then fit the model to your data using the fit method\n",
+    "model_4 = kmeans_4.fit(data)\n",
+    "\n",
+    "# Finally predict the labels on the same data to show the category that point belongs to\n",
+    "labels_4 = model_4.predict(data)\n",
+    "\n",
+    "# If you did all of that correctly, this should provide a plot of your data colored by center\n",
+    "h.plot_data(data, labels_4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`4.` Now try again, but this time fit kmeans using 2 clusters instead of 4 to your data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Try instantiating a model with 2 centers\n",
+    "kmeans_2 = KMeans(n_clusters=2)\n",
+    "\n",
+    "# Then fit the model to your data using the fit method\n",
+    "model_2 = kmeans_2.fit(data)\n",
+    "\n",
+    "# Finally predict the labels on the same data to show the category that point belongs to\n",
+    "labels_2 = model_2.predict(data)\n",
+    "\n",
+    "# If you did all of that correctly, this should provide a plot of your data colored by center\n",
+    "h.plot_data(data, labels_2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`5.` Now try one more time, but with the number of clusters in kmeans to 7."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Try instantiating a model with 7 centers\n",
+    "kmeans_7 = KMeans(n_clusters=7)\n",
+    "\n",
+    "# Then fit the model to your data using the fit method\n",
+    "model_7 = kmeans_7.fit(data)\n",
+    "\n",
+    "# Finally predict the labels on the same data to show the category that point belongs to\n",
+    "labels_7 = model_7.predict(data)\n",
+    "\n",
+    "# If you did all of that correctly, this should provide a plot of your data colored by center\n",
+    "h.plot_data(data, labels_7)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Visually, we get some indication of how well our model is doing, but it isn't totally apparent. Each time additional centers are considered, the distances between the points and the center will decrease.  However, at some point, that decrease is not substantial enough to suggest the need for an additional cluster.  \n",
+    "\n",
+    "Using a scree plot is a common method for understanding if an additional cluster center is needed.  The elbow method used by looking at a scree plot is still pretty subjective, but let's take a look to see how many cluster centers might be indicated.\n",
+    "_________\n",
+    "\n",
+    "`6.` Once you have **fit** a kmeans model to some data in sklearn, there is a **score** method, which takes the data.  This score is an indication of how far the points are from the centroids.  By fitting models for centroids from 1-10, and keeping track of the score and the number of centroids, you should be able to build a scree plot.  \n",
+    "\n",
+    "This plot should have the number of centroids on the x-axis, and the absolute value of the score result on the y-axis.  You can see the plot I retrieved by running the solution code.  Try creating your own scree plot, as you will need it for the final questions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# A place for your work - create a scree plot - you will need to\n",
+    "# Fit a kmeans model with changing k from 1-10\n",
+    "# Obtain the score for each model (take the absolute value)\n",
+    "# Plot the score against k\n",
+    "\n",
+    "def get_kmeans_score(data, center):\n",
+    "    '''\n",
+    "    returns the kmeans score regarding SSE for points to centers\n",
+    "    INPUT:\n",
+    "        data - the dataset you want to fit kmeans to\n",
+    "        center - the number of centers you want (the k value)\n",
+    "    OUTPUT:\n",
+    "        score - the SSE score for the kmeans model fit to the data\n",
+    "    '''\n",
+    "    #instantiate kmeans\n",
+    "    kmeans = KMeans(n_clusters=center)\n",
+    "\n",
+    "    # Then fit the model to your data using the fit method\n",
+    "    model = kmeans.fit(data)\n",
+    "    \n",
+    "    # Obtain a score related to the model fit\n",
+    "    score = np.abs(model.score(data))\n",
+    "    \n",
+    "    return score\n",
+    "\n",
+    "scores = []\n",
+    "centers = list(range(1,11))\n",
+    "\n",
+    "for center in centers:\n",
+    "    scores.append(get_kmeans_score(data, center))\n",
+    "    \n",
+    "plt.plot(centers, scores, linestyle='--', marker='o', color='b');\n",
+    "plt.xlabel('K');\n",
+    "plt.ylabel('SSE');\n",
+    "plt.title('SSE vs. K');"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run our solution\n",
+    "centers, scores = h.fit_mods()\n",
+    "\n",
+    "#Your plot should look similar to the below\n",
+    "plt.plot(centers, scores, linestyle='--', marker='o', color='b');\n",
+    "plt.xlabel('K');\n",
+    "plt.ylabel('SSE');\n",
+    "plt.title('SSE vs. K');"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`7.` Using the scree plot, how many clusters would you suggest as being in the data?  What is K?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "value_for_k = 4\n",
+    "\n",
+    "# Test your solution against ours\n",
+    "display.HTML(t.test_question_7(value_for_k))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/Learning/Clustering/Changing
+++ b/Learning/Clustering/Changing
--- a/Learning/Clustering/Feature
+++ b/Learning/Clustering/Feature
--- a/Learning/Clustering/Feature
+++ b/Learning/Clustering/Feature
--- a/Learning/Clustering/Feature
+++ b/Learning/Clustering/Feature
--- a/Learning/Clustering/Feature
+++ b/Learning/Clustering/Feature
--- a/Learning/Clustering/Identifying_Clusters.ipynb
+++ b/Learning/Clustering/Identifying_Clusters.ipynb
--- a/Learning/Clustering/data.csv
+++ b/Learning/Clustering/data.csv
@@ -0,0 +1,100 @@
+0.78051,-0.063669,1
+0.28774,0.29139,1
+0.40714,0.17878,1
+0.2923,0.4217,1
+0.50922,0.35256,1
+0.27785,0.10802,1
+0.27527,0.33223,1
+0.43999,0.31245,1
+0.33557,0.42984,1
+0.23448,0.24986,1
+0.0084492,0.13658,1
+0.12419,0.33595,1
+0.25644,0.42624,1
+0.4591,0.40426,1
+0.44547,0.45117,1
+0.42218,0.20118,1
+0.49563,0.21445,1
+0.30848,0.24306,1
+0.39707,0.44438,1
+0.32945,0.39217,1
+0.40739,0.40271,1
+0.3106,0.50702,1
+0.49638,0.45384,1
+0.10073,0.32053,1
+0.69907,0.37307,1
+0.29767,0.69648,1
+0.15099,0.57341,1
+0.16427,0.27759,1
+0.33259,0.055964,1
+0.53741,0.28637,1
+0.19503,0.36879,1
+0.40278,0.035148,1
+0.21296,0.55169,1
+0.48447,0.56991,1
+0.25476,0.34596,1
+0.21726,0.28641,1
+0.67078,0.46538,1
+0.3815,0.4622,1
+0.53838,0.32774,1
+0.4849,0.26071,1
+0.37095,0.38809,1
+0.54527,0.63911,1
+0.32149,0.12007,1
+0.42216,0.61666,1
+0.10194,0.060408,1
+0.15254,0.2168,1
+0.45558,0.43769,1
+0.28488,0.52142,1
+0.27633,0.21264,1
+0.39748,0.31902,1
+0.5533,1,0
+0.44274,0.59205,0
+0.85176,0.6612,0
+0.60436,0.86605,0
+0.68243,0.48301,0
+1,0.76815,0
+0.72989,0.8107,0
+0.67377,0.77975,0
+0.78761,0.58177,0
+0.71442,0.7668,0
+0.49379,0.54226,0
+0.78974,0.74233,0
+0.67905,0.60921,0
+0.6642,0.72519,0
+0.79396,0.56789,0
+0.70758,0.76022,0
+0.59421,0.61857,0
+0.49364,0.56224,0
+0.77707,0.35025,0
+0.79785,0.76921,0
+0.70876,0.96764,0
+0.69176,0.60865,0
+0.66408,0.92075,0
+0.65973,0.66666,0
+0.64574,0.56845,0
+0.89639,0.7085,0
+0.85476,0.63167,0
+0.62091,0.80424,0
+0.79057,0.56108,0
+0.58935,0.71582,0
+0.56846,0.7406,0
+0.65912,0.71548,0
+0.70938,0.74041,0
+0.59154,0.62927,0
+0.45829,0.4641,0
+0.79982,0.74847,0
+0.60974,0.54757,0
+0.68127,0.86985,0
+0.76694,0.64736,0
+0.69048,0.83058,0
+0.68122,0.96541,0
+0.73229,0.64245,0
+0.76145,0.60138,0
+0.58985,0.86955,0
+0.73145,0.74516,0
+0.77029,0.7014,0
+0.73156,0.71782,0
+0.44556,0.57991,0
+0.85275,0.85987,0
+0.51912,0.62359,0
--- a/Learning/Clustering/helper_functions.py
+++ b/Learning/Clustering/helper_functions.py
@@ -0,0 +1,41 @@
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+
+# Generate Question 1 Data
+X, y = make_blobs(n_samples=500, n_features=3, centers=4, random_state=5)
+
+
+def plot_q1_data():
+    fig = plt.figure()
+    ax = Axes3D(fig)
+    ax.scatter(X[:, 0], X[:, 1], X[:, 2]);
+
+
+# Generate Question 2 Data
+Z, y = make_blobs(n_samples=500, n_features=5, centers=2, random_state=42)
+
+
+def plot_q2_data():
+    fig = plt.figure()
+    plt.scatter(Z[:, 0], Z[:, 1]);
+
+
+# Generate Question 3 Data
+T, y = make_blobs(n_samples=500, n_features=5, centers=8, random_state=5)
+
+
+def plot_q3_data():
+    fig = plt.figure()
+    ax = Axes3D(fig)
+    ax.scatter(T[:, 1], T[:, 3], T[:, 4]);
+
+# Plot data for Question 4
+
+
+def plot_q4_data():
+    fig = plt.figure()
+    ax = Axes3D(fig)
+    ax.scatter(T[:, 1], T[:, 2], T[:, 3]);
--- a/Learning/Clustering/helpers2.py
+++ b/Learning/Clustering/helpers2.py
@@ -0,0 +1,58 @@
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+
+def simulate_data(n = 500, features = 10, centroids = 3):
+    '''
+    Simulates n data points, each with number of features equal to features, with a number of centers equal to centroids
+    INPUT (defaults)
+        n = number of rows (500)
+        features = number of columns (10)
+        centroids = number of centers (3)
+    Output
+        dataset = a dataset with the the specified characteristics
+    '''
+    dataset, y = make_blobs(n_samples=n, n_features=features, centers=centroids, random_state=42)
+
+    return dataset
+
+def plot_data(data, labels):
+    '''
+    Plot data with colors associated with labels
+    '''
+    fig = plt.figure();
+    ax = Axes3D(fig)
+    ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=labels, cmap='tab10');
+
+data = simulate_data(200, 5, 4)
+
+def get_kmeans_score(data, center):
+    '''
+    returns the kmeans score regarding SSE for points to centers
+    INPUT:
+        data - the dataset you want to fit kmeans to
+        center - the number of centers you want (the k value)
+    OUTPUT:
+        score - the SSE score for the kmeans model fit to the data
+    '''
+    #instantiate kmeans
+    kmeans = KMeans(n_clusters=center)
+
+    # Then fit the model to your data using the fit method
+    model = kmeans.fit(data)
+
+    # Obtain a score related to the model fit
+    score = np.abs(model.score(data))
+
+    return score
+
+def fit_mods():
+    scores = []
+    centers = list(range(1,11))
+
+    for center in centers:
+        scores.append(get_kmeans_score(data, center))
+
+    return centers, scores
--- a/Learning/Clustering/images/high_epsilon_and_high_min_sample.png
+++ b/Learning/Clustering/images/high_epsilon_and_high_min_sample.png
--- a/Learning/Clustering/images/high_epsilon_and_low_min_sample.png
+++ b/Learning/Clustering/images/high_epsilon_and_low_min_sample.png
--- a/Learning/Clustering/images/low_epsilon_and_high_min_sample.png
+++ b/Learning/Clustering/images/low_epsilon_and_high_min_sample.png
--- a/Learning/Clustering/images/low_epsilon_and_low_min_sample.png
+++ b/Learning/Clustering/images/low_epsilon_and_low_min_sample.png
--- a/Learning/Clustering/test_file.py
+++ b/Learning/Clustering/test_file.py
@@ -0,0 +1,27 @@
+def display_gif(fn):
+    return '<img src="{}">'.format(fn)
+
+
+def test_question_1(clusters):
+    if clusters == 4:
+        print("That's right!  There are 4 clusters in this dataset.")
+    elif clusters < 4:
+        print("Oops!  We were thinking there were actually more clusters than what you suggested. Try again.  A cluster is a group of points that are closer together and separated from other points in the dataset.")
+    else:
+        print("Oops!  We were thinking there were fewer clusters than what you suggested. Try again.  A cluster is a group of points that are closer together and separated from other points in the dataset.")
+
+
+def test_question_2(clusters):
+    if clusters == 2:
+        print("That's right!  There are 2 clusters in this dataset.")
+    else:
+        print("Oops!  That doesn't look like what we expected for the number of clusters.  Try again.  A cluster is a group of points that are closer together and separated from other points in the dataset.")
+
+
+def test_question_3(clusters):
+    print("{} is a reasonable guess for a the number of clusters here.  In the next question, you will see a different angle of this data.".format(clusters))
+
+
+def test_question_4(clusters):
+    print("This data is actually the same as the data used in question 3.  Isn't it crazy how looking at data from a different angle can make us believe there are a different number of clusters in the data!  We will look at how to address this in the upcoming parts of this lesson.")
+    return display_gif('http://www.reactiongifs.com/wp-content/uploads/2013/03/mind-blown.gif')
--- a/Learning/Clustering/tests.py
+++ b/Learning/Clustering/tests.py
@@ -0,0 +1,29 @@
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+
+def display_gif(fn):
+    return '<img src="{}">'.format(fn)
+
+
+def test_question_1(data1):
+    if data1.shape[0] == 200 and data1.shape[1] == 5:
+        print("Looks good!  Continue!")
+    else:
+        print("Oops, that looks different than what we expected!  The first argument should be the number of rows, the second the number of columns, and the final should be the number of centers.")
+
+def test_question_2(k_value):
+    if k_value == 4:
+        print("That's right!  The value of k is the same as the number of centroids used to create your dataset.")
+    else:
+        print("Oops! That doesn't seem right!  The value of k should be the same as the number of centroids you used in your dataset.  In this case, the value for k should be 4.")
+
+def test_question_7(k_value):
+    if k_value == 4:
+        print("That's right!  We set up the data with 4 centers, and the plot is consistent!  We can see a strong leveling off after 4 clusters, which suggests 4 clusters should be used.")
+
+        return display_gif("https://media2.giphy.com/media/3ohzdIuqJoo8QdKlnW/giphy.gif")
+    else:
+        print("Oops! That doesn't seem right!  The value of k should be where the 'elbow' can be found in the scree plot.  You can see 4-10 all have similar SSE values, suggesting that 4 clusters is the minimum number of clusters to significantly reduce the SSE from centroids to each point.")
--- a/Learning/Clustering/tests2.py
+++ b/Learning/Clustering/tests2.py
@@ -0,0 +1,41 @@
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.cluster import KMeans
+from IPython.display import Image
+from sklearn.datasets.samples_generator import make_blobs
+
+
+
+def check_q1(stuff):
+    a = 0
+    b = 60
+    c = 22.9
+    d = 4.53
+    e = 511.7
+
+    q1_dict = {
+    'number of missing values': a,
+    'the mean 5k time in minutes': c,
+    'the mean test score as a raw value': e,
+    'number of individuals in the dataset': b
+    }
+
+    if stuff == q1_dict:
+        print("That looks right!")
+
+    else:
+        print("Oops!  That doesn't look quite right! Try again.")
+
+
+def check_q5(stuff):
+    a = 'We should always use normalizing'
+    b = 'We should always scale our variables between 0 and 1.'
+    c = 'Variable scale will frequently influence your results, so it is important to standardize for all of these algorithms.'
+    d = 'Scaling will not change the results of your output.'
+
+    if stuff == c:
+        return Image(filename="./giphy.gif")
+    else:
+        print("Oops!  That doesn't look quite right.  Try again!")