{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Your Turn! (Solution)\n", "\n", "In the last video, you saw two of the main aspects of principal components:\n", "\n", "1. **The amount of variability captured by the component.**\n", "2. **The components themselves.**\n", "\n", "In this notebook, you will get a chance to explore these a bit more yourself. First, let's read in the necessary libraries, as well as the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score\n", "from helper_functions import show_images, do_pca, scree_plot, plot_component\n", "import test_code as t\n", "\n", "import matplotlib.image as mpimg\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline\n", "\n", "#read in our dataset\n", "train = pd.read_csv('./data/train.csv')\n", "train.fillna(0, inplace=True)\n", "\n", "# save the labels to a Pandas series target\n", "y = train['label']\n", "# Drop the label feature\n", "X = train.drop(\"label\",axis=1)\n", "\n", "show_images(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`1.` Perform PCA on the **X** matrix using on your own or using the **do_pca** function from the **helper_functions** module. Reduce the original more than 700 features to only 10 principal components." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pca, X_pca = do_pca(10, X)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "`2.` Now use the **scree_plot** function from the **helper_functions** module to take a closer look at the results of your analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scree_plot(pca)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`3.` Using the results of your scree plot, match each letter as the value to the correct key in the **solution_three** dictionary. Once you are confident in your solution run the next cell to see if your solution matches ours." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = True\n", "b = False\n", "c = 6.13\n", "d = 'The total amount of variability in the data explained by the first two principal components'\n", "e = None\n", "\n", "solution_three = {\n", " '10.42' : d, \n", " 'The first component will ALWAYS have the most amount of variability explained.': a,\n", " 'The total amount of variability in the data explained by the first component': c,\n", " 'The sum of the variability explained by all the components can be greater than 100%': b\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run this cell to see if your solution matches ours\n", "t.question_3_check(solution_three)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`4.` Use the **plot_component** function from the **helper_functions** module to look at each of the components (remember they are 0 indexed). Use the results to assist with question 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_component(pca, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`5.` Using the results from viewing each of your principal component weights in question 4, change the following values of the **solution_five** dictionary to the **number of the index** for the principal component that best matches the description. Once you are confident in your solution run the next cell to see if your solution matches ours." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "solution_five = {\n", " 'This component looks like it will assist in identifying zero': 0,\n", " 'This component looks like it will assist in identifying three': 3\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Run this cell to see if your solution matches ours\n", "t.question_5_check(solution_five)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this notebook, you have had an opportunity to look at the two major parts of PCA:\n", "\n", "`I.` The amount of **variance explained by each component**. This is called an **eigenvalue**.\n", "\n", "`II.` The principal components themselves, each component is a vector of weights. In this case, the principal components help us understand which pixels of the image are most helpful in identifying the difference between between digits. **Principal components** are also known as **eigenvectors**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }