214 lines
6.0 KiB
Plaintext
214 lines
6.0 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Your Turn! (Solution)\n",
|
|
"\n",
|
|
"In the last video, you saw two of the main aspects of principal components:\n",
|
|
"\n",
|
|
"1. **The amount of variability captured by the component.**\n",
|
|
"2. **The components themselves.**\n",
|
|
"\n",
|
|
"In this notebook, you will get a chance to explore these a bit more yourself. First, let's read in the necessary libraries, as well as the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from sklearn.decomposition import PCA\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"from sklearn.ensemble import RandomForestClassifier\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"from sklearn.metrics import confusion_matrix, accuracy_score\n",
|
|
"from helper_functions import show_images, do_pca, scree_plot, plot_component\n",
|
|
"import test_code as t\n",
|
|
"\n",
|
|
"import matplotlib.image as mpimg\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns\n",
|
|
"\n",
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"#read in our dataset\n",
|
|
"train = pd.read_csv('./data/train.csv')\n",
|
|
"train.fillna(0, inplace=True)\n",
|
|
"\n",
|
|
"# save the labels to a Pandas series target\n",
|
|
"y = train['label']\n",
|
|
"# Drop the label feature\n",
|
|
"X = train.drop(\"label\",axis=1)\n",
|
|
"\n",
|
|
"show_images(30)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`1.` Perform PCA on the **X** matrix using on your own or using the **do_pca** function from the **helper_functions** module. Reduce the original more than 700 features to only 10 principal components."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"pca, X_pca = do_pca(10, X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"source": [
|
|
"`2.` Now use the **scree_plot** function from the **helper_functions** module to take a closer look at the results of your analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"scree_plot(pca)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`3.` Using the results of your scree plot, match each letter as the value to the correct key in the **solution_three** dictionary. Once you are confident in your solution run the next cell to see if your solution matches ours."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"a = True\n",
|
|
"b = False\n",
|
|
"c = 6.13\n",
|
|
"d = 'The total amount of variability in the data explained by the first two principal components'\n",
|
|
"e = None\n",
|
|
"\n",
|
|
"solution_three = {\n",
|
|
" '10.42' : d, \n",
|
|
" 'The first component will ALWAYS have the most amount of variability explained.': a,\n",
|
|
" 'The total amount of variability in the data explained by the first component': c,\n",
|
|
" 'The sum of the variability explained by all the components can be greater than 100%': b\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Run this cell to see if your solution matches ours\n",
|
|
"t.question_3_check(solution_three)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`4.` Use the **plot_component** function from the **helper_functions** module to look at each of the components (remember they are 0 indexed). Use the results to assist with question 5."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plot_component(pca, 3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`5.` Using the results from viewing each of your principal component weights in question 4, change the following values of the **solution_five** dictionary to the **number of the index** for the principal component that best matches the description. Once you are confident in your solution run the next cell to see if your solution matches ours."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"solution_five = {\n",
|
|
" 'This component looks like it will assist in identifying zero': 0,\n",
|
|
" 'This component looks like it will assist in identifying three': 3\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Run this cell to see if your solution matches ours\n",
|
|
"t.question_5_check(solution_five)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"From this notebook, you have had an opportunity to look at the two major parts of PCA:\n",
|
|
"\n",
|
|
"`I.` The amount of **variance explained by each component**. This is called an **eigenvalue**.\n",
|
|
"\n",
|
|
"`II.` The principal components themselves, each component is a vector of weights. In this case, the principal components help us understand which pixels of the image are most helpful in identifying the difference between between digits. **Principal components** are also known as **eigenvectors**."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|