{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Scaling - Solution\n", "\n", "With any distance based machine learning model (regularized regression methods, neural networks, and now kmeans), you will want to scale your data. \n", "\n", "If you have some features that are on completely different scales, this can greatly impact the clusters you get when using K-Means. \n", "\n", "In this notebook, you will get to see this first hand. To begin, let's read in the necessary libraries." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.cluster import KMeans\n", "import matplotlib.pyplot as plt\n", "from sklearn import preprocessing as p\n", "\n", "%matplotlib inline\n", "\n", "plt.rcParams['figure.figsize'] = (16, 9)\n", "import helpers2 as h\n", "import tests as t\n", "\n", "\n", "# Create the dataset for the notebook\n", "data = h.simulate_data(200, 2, 4)\n", "df = pd.DataFrame(data)\n", "df.columns = ['height', 'weight']\n", "df['height'] = np.abs(df['height']*100)\n", "df['weight'] = df['weight'] + np.random.normal(50, 10, 200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`1.` Next, take a look at the data to get familiar with it. The dataset has two columns, and it is stored in the **df** variable. It might be useful to get an idea of the spread in the current data, as well as a visual of the points. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | height | \n", "weight | \n", "
|---|---|---|
| count | \n", "200.000000 | \n", "200.000000 | \n", "
| mean | \n", "569.726207 | \n", "52.141503 | \n", "
| std | \n", "246.966215 | \n", "11.342247 | \n", "
| min | \n", "92.998481 | \n", "16.263732 | \n", "
| 25% | \n", "357.542793 | \n", "45.228737 | \n", "
| 50% | \n", "545.766752 | \n", "52.065198 | \n", "
| 75% | \n", "773.310607 | \n", "61.207241 | \n", "
| max | \n", "1096.222348 | \n", "77.908867 | \n", "