udacity/python/Unsupervised Learning/Dimensionality Reduction and PCA/Interpret_PCA_Results_Solution.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your Turn! (Solution)\n",
    "\n",
    "In the last video, you saw two of the main aspects of principal components:\n",
    "\n",
    "1. **The amount of variability captured by the component.**\n",
    "2. **The components themselves.**\n",
    "\n",
    "In this notebook, you will get a chance to explore these a bit more yourself.  First, let's read in the necessary libraries, as well as the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, accuracy_score\n",
    "from helper_functions import show_images, do_pca, scree_plot, plot_component\n",
    "import test_code as t\n",
    "\n",
    "import matplotlib.image as mpimg\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "#read in our dataset\n",
    "train = pd.read_csv('./data/train.csv')\n",
    "train.fillna(0, inplace=True)\n",
    "\n",
    "# save the labels to a Pandas series target\n",
    "y = train['label']\n",
    "# Drop the label feature\n",
    "X = train.drop(\"label\",axis=1)\n",
    "\n",
    "show_images(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`1.` Perform PCA on the **X** matrix using on your own or using the **do_pca** function from the **helper_functions** module. Reduce the original more than 700 features to only 10 principal components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca, X_pca = do_pca(10, X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "`2.` Now use the **scree_plot** function from the **helper_functions** module to take a closer look at the results of your analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scree_plot(pca)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`3.` Using the results of your scree plot, match each letter as the value to the correct key in the **solution_three** dictionary.  Once you are confident in your solution run the next cell to see if your solution matches ours."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = True\n",
    "b = False\n",
    "c = 6.13\n",
    "d = 'The total amount of variability in the data explained by the first two principal components'\n",
    "e = None\n",
    "\n",
    "solution_three = {\n",
    "    '10.42' : d, \n",
    "    'The first component will ALWAYS have the most amount of variability explained.': a,\n",
    "    'The total amount of variability in the data explained by the first component': c,\n",
    "    'The sum of the variability explained by all the components can be greater than 100%': b\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run this cell to see if your solution matches ours\n",
    "t.question_3_check(solution_three)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`4.` Use the **plot_component** function from the **helper_functions** module to look at each of the components (remember they are 0 indexed).  Use the results to assist with question 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_component(pca, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`5.` Using the results from viewing each of your principal component weights in question 4, change the following values of the **solution_five** dictionary to the **number of the index** for the principal component that best matches the description.  Once you are confident in your solution run the next cell to see if your solution matches ours."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "solution_five = {\n",
    "    'This component looks like it will assist in identifying zero': 0,\n",
    "    'This component looks like it will assist in identifying three': 3\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run this cell to see if your solution matches ours\n",
    "t.question_5_check(solution_five)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this notebook, you have had an opportunity to look at the two major parts of PCA:\n",
    "\n",
    "`I.` The amount of **variance explained by each component**.  This is called an **eigenvalue**.\n",
    "\n",
    "`II.` The principal components themselves, each component is a vector of weights.  In this case, the principal components help us understand which pixels of the image are most helpful in identifying the difference between between digits. **Principal components** are also known as **eigenvectors**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}