1177 lines
115 KiB
Plaintext
1177 lines
115 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Your Turn\n",
|
||
"\n",
|
||
"In the previous video, you saw an example of working with some MNIST digits data. The MNIST dataset can be found here: http://yann.lecun.com/exdb/mnist/.\n",
|
||
"\n",
|
||
"First, let's import the necessary libraries. Notice there are also some imports from a file called `helper_functions`, which contains the functions used in the previous video."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"from sklearn.decomposition import PCA\n",
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"from sklearn.ensemble import RandomForestClassifier\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.metrics import confusion_matrix, accuracy_score\n",
|
||
"from helper_functions import show_images, show_images_by_digit, fit_random_forest_classifier2 \n",
|
||
"from helper_functions import fit_random_forest_classifier, do_pca, plot_components\n",
|
||
"import test_code as t\n",
|
||
"\n",
|
||
"import matplotlib.image as mpimg\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns\n",
|
||
"\n",
|
||
"%matplotlib inline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`1.` Use pandas to read in the dataset, which can be found in this workspace using the filepath **'./data/train.csv'**. If you have missing values, fill them with 0. Take a look at info about the data using `head`, `tail`, `describe`, `info`, etc. You can learn more about the data values from the article here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df = pd.read_csv('./data/train.csv')\n",
|
||
"df.fillna(value=0, inplace=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>label</th>\n",
|
||
" <th>pixel0</th>\n",
|
||
" <th>pixel1</th>\n",
|
||
" <th>pixel2</th>\n",
|
||
" <th>pixel3</th>\n",
|
||
" <th>pixel4</th>\n",
|
||
" <th>pixel5</th>\n",
|
||
" <th>pixel6</th>\n",
|
||
" <th>pixel7</th>\n",
|
||
" <th>pixel8</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>pixel774</th>\n",
|
||
" <th>pixel775</th>\n",
|
||
" <th>pixel776</th>\n",
|
||
" <th>pixel777</th>\n",
|
||
" <th>pixel778</th>\n",
|
||
" <th>pixel779</th>\n",
|
||
" <th>pixel780</th>\n",
|
||
" <th>pixel781</th>\n",
|
||
" <th>pixel782</th>\n",
|
||
" <th>pixel783</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 785 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 \\\n",
|
||
"0 1 0 0 0 0 0 0 0 0 \n",
|
||
"1 0 0 0 0 0 0 0 0 0 \n",
|
||
"2 1 0 0 0 0 0 0 0 0 \n",
|
||
"3 4 0 0 0 0 0 0 0 0 \n",
|
||
"4 0 0 0 0 0 0 0 0 0 \n",
|
||
"\n",
|
||
" pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 \\\n",
|
||
"0 0 ... 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"1 0 ... 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"2 0 ... 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"3 0 ... 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"4 0 ... 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" pixel779 pixel780 pixel781 pixel782 pixel783 \n",
|
||
"0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"1 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"2 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"3 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"4 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
"[5 rows x 785 columns]"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.head(5)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>label</th>\n",
|
||
" <th>pixel0</th>\n",
|
||
" <th>pixel1</th>\n",
|
||
" <th>pixel2</th>\n",
|
||
" <th>pixel3</th>\n",
|
||
" <th>pixel4</th>\n",
|
||
" <th>pixel5</th>\n",
|
||
" <th>pixel6</th>\n",
|
||
" <th>pixel7</th>\n",
|
||
" <th>pixel8</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>pixel774</th>\n",
|
||
" <th>pixel775</th>\n",
|
||
" <th>pixel776</th>\n",
|
||
" <th>pixel777</th>\n",
|
||
" <th>pixel778</th>\n",
|
||
" <th>pixel779</th>\n",
|
||
" <th>pixel780</th>\n",
|
||
" <th>pixel781</th>\n",
|
||
" <th>pixel782</th>\n",
|
||
" <th>pixel783</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>count</th>\n",
|
||
" <td>6304.000000</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>6304.000000</td>\n",
|
||
" <td>6304.000000</td>\n",
|
||
" <td>6304.000000</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" <td>6304.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>mean</th>\n",
|
||
" <td>4.440355</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.139594</td>\n",
|
||
" <td>0.142291</td>\n",
|
||
" <td>0.026967</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>std</th>\n",
|
||
" <td>2.885613</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>5.099940</td>\n",
|
||
" <td>5.531089</td>\n",
|
||
" <td>1.675547</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>min</th>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>25%</th>\n",
|
||
" <td>2.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>50%</th>\n",
|
||
" <td>4.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>75%</th>\n",
|
||
" <td>7.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>max</th>\n",
|
||
" <td>9.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>253.000000</td>\n",
|
||
" <td>253.000000</td>\n",
|
||
" <td>130.000000</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>8 rows × 785 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 \\\n",
|
||
"count 6304.000000 6304.0 6304.0 6304.0 6304.0 6304.0 6304.0 6304.0 \n",
|
||
"mean 4.440355 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"std 2.885613 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"min 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"25% 2.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"50% 4.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"75% 7.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"max 9.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
" pixel7 pixel8 ... pixel774 pixel775 pixel776 \\\n",
|
||
"count 6304.0 6304.0 ... 6304.000000 6304.000000 6304.000000 \n",
|
||
"mean 0.0 0.0 ... 0.139594 0.142291 0.026967 \n",
|
||
"std 0.0 0.0 ... 5.099940 5.531089 1.675547 \n",
|
||
"min 0.0 0.0 ... 0.000000 0.000000 0.000000 \n",
|
||
"25% 0.0 0.0 ... 0.000000 0.000000 0.000000 \n",
|
||
"50% 0.0 0.0 ... 0.000000 0.000000 0.000000 \n",
|
||
"75% 0.0 0.0 ... 0.000000 0.000000 0.000000 \n",
|
||
"max 0.0 0.0 ... 253.000000 253.000000 130.000000 \n",
|
||
"\n",
|
||
" pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 \n",
|
||
"count 6304.0 6304.0 6304.0 6304.0 6304.0 6304.0 6304.0 \n",
|
||
"mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
||
"\n",
|
||
"[8 rows x 785 columns]"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df.describe()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`2.` Create a vector called y that holds the **label** column of the dataset. Store all other columns holding the pixel data of your images in X."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"y = df['label']\n",
|
||
"X = df.drop('label', axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"That looks right!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"#Check Your Solution \n",
|
||
"t.question_two_check(y, X)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`3.` Now use the `show_images_by_digit` function from the `helper_functions` module to take a look some of the `1`'s, `2`'s, `3`'s, or any other value you are interested in looking at. Do they all look like what you would expect?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/home/workspace/helper_functions.py:63: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.\n",
|
||
" mat_data = X.iloc[indices[0][digit_num]].as_matrix().reshape(28,28) #reshape images\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "\n",
|
||
"text/plain": [
|
||
"<matplotlib.figure.Figure at 0x7f6be2e500f0>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"show_images_by_digit(7) # Try looking at a few other digits"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`4.` Now that you have had a chance to look through some of the data, you can try some different algorithms to see what works well to use the X matrix to predict the response well. If you would like to use the function I used in the video regarding random forests, you can run the code below, but you might also try any of the supervised techniques you learned in the previous course to see what works best.\n",
|
||
"\n",
|
||
"If you decide to put together your own classifier, remember the 4 steps to this process:\n",
|
||
"\n",
|
||
"**I.** Instantiate your model. (with all the hyperparameter values you care about)\n",
|
||
"\n",
|
||
"**II.** Fit your model. (to the training data)\n",
|
||
"\n",
|
||
"**III.** Predict using your fitted model. (on the test data)\n",
|
||
"\n",
|
||
"**IV.** Score your model. (comparing the predictions to the actual values on the test data)\n",
|
||
"\n",
|
||
"You can also try a grid search to see if you can improve on your initial predictions."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[202 0 0 0 0 0 5 0 2 0]\n",
|
||
" [ 0 234 3 0 0 1 1 2 2 0]\n",
|
||
" [ 1 6 212 1 2 0 0 5 1 0]\n",
|
||
" [ 2 0 6 169 0 10 0 1 2 1]\n",
|
||
" [ 0 0 0 0 171 0 2 1 0 4]\n",
|
||
" [ 2 1 0 5 0 174 3 0 1 0]\n",
|
||
" [ 1 0 0 0 1 2 206 1 0 0]\n",
|
||
" [ 0 0 6 1 7 0 0 204 1 6]\n",
|
||
" [ 0 1 4 7 0 4 1 0 183 2]\n",
|
||
" [ 2 1 0 2 8 1 0 5 6 183]]\n",
|
||
"0.931283037001\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.93128303700144166"
|
||
]
|
||
},
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Remove the tag to fit the RF model from the video, you can also try fitting your own!\n",
|
||
"fit_random_forest_classifier(X, y)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`5.` Now for the purpose of this lesson, to look at PCA. In the video, I created a model just using two features. Replicate the process below. You can use the same `do_pca` function that was created in the previous video. Store your variables in **pca** and **X_pca**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(6304, 11)"
|
||
]
|
||
},
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"do_pca?\n",
|
||
"pca, X_pca = do_pca(11, X)\n",
|
||
"X_pca.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`6.` The **X_pca** has reduced the original number of more than 700 features down to only 2 features that capture the majority of the variability in the pixel values. Use the space below to fit a model using these two features to predict the written value. You can use the random forest model by running `fit_random_forest_classifier` the same way as in the video. How well does it perform?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[189 0 2 0 1 7 7 0 1 2]\n",
|
||
" [ 0 231 2 0 0 0 3 0 7 0]\n",
|
||
" [ 2 2 207 3 3 1 4 3 3 0]\n",
|
||
" [ 1 0 12 157 0 6 1 2 11 1]\n",
|
||
" [ 0 0 2 0 147 1 4 3 1 20]\n",
|
||
" [ 7 0 4 7 3 150 1 0 8 6]\n",
|
||
" [ 6 0 1 0 2 1 200 0 1 0]\n",
|
||
" [ 1 2 4 1 3 1 1 195 5 12]\n",
|
||
" [ 2 2 3 7 3 10 0 1 173 1]\n",
|
||
" [ 3 0 1 2 31 0 0 18 6 147]]\n",
|
||
"0.863046612206\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.86304661220567036"
|
||
]
|
||
},
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"fit_random_forest_classifier?\n",
|
||
"fit_random_forest_classifier(X_pca, y)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`7.` Now you can look at the separation of the values using the `plot_components` function. If you plot all of the points (more than 40,000), you will likely not be able to see much of what is happening. I recommend plotting just a subset of the data. Which value(s) have some separation that are being predicted better than others based on these two components?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "\n",
|
||
"text/plain": [
|
||
"<matplotlib.figure.Figure at 0x7f6be270c278>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Try plotting some of the numbers below - you can change the number\n",
|
||
"# of digits that are plotted, but it is probably best not to plot the \n",
|
||
"# entire dataset. Your visual will not be readable.\n",
|
||
"\n",
|
||
"plot_components(X_pca[:200], y[:200])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`8.` See if you can find a reduced number of features that provides better separation to make predictions. Say you want to get separation that allows for accuracy of more than 90%, how many principal components are needed to obtain this level of accuracy? Were you able to substantially reduce the number of features needed in your final model?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[111 0 38 13 7 7 26 0 7 0]\n",
|
||
" [ 0 208 0 4 1 8 4 7 7 4]\n",
|
||
" [ 38 2 72 25 16 23 35 1 13 3]\n",
|
||
" [ 13 2 31 18 22 27 25 15 25 13]\n",
|
||
" [ 5 4 18 19 47 17 16 11 10 31]\n",
|
||
" [ 8 8 31 24 14 23 30 3 34 11]\n",
|
||
" [ 30 4 54 17 13 28 38 2 23 2]\n",
|
||
" [ 3 6 3 10 33 9 3 90 8 60]\n",
|
||
" [ 7 6 23 29 24 32 28 7 36 10]\n",
|
||
" [ 2 12 2 17 29 9 4 54 8 71]]\n",
|
||
"0.34310427679\n",
|
||
"[[157 0 6 2 1 11 18 1 13 0]\n",
|
||
" [ 0 224 1 4 2 5 5 0 2 0]\n",
|
||
" [ 18 3 122 30 5 3 38 0 8 1]\n",
|
||
" [ 3 3 59 54 6 15 27 4 17 3]\n",
|
||
" [ 6 0 2 4 95 15 1 13 16 26]\n",
|
||
" [ 7 0 8 17 22 63 27 7 29 6]\n",
|
||
" [ 17 8 18 15 2 10 128 0 13 0]\n",
|
||
" [ 0 2 0 4 24 9 1 120 8 57]\n",
|
||
" [ 15 2 6 29 27 40 36 2 44 1]\n",
|
||
" [ 2 2 1 2 45 4 1 55 8 88]]\n",
|
||
"0.526189332052\n",
|
||
"[[175 0 8 0 4 11 4 0 7 0]\n",
|
||
" [ 0 225 1 1 0 2 3 2 9 0]\n",
|
||
" [ 13 2 167 10 4 8 20 0 3 1]\n",
|
||
" [ 2 0 22 122 0 11 3 1 28 2]\n",
|
||
" [ 5 0 4 1 124 1 7 15 0 21]\n",
|
||
" [ 5 0 5 28 16 81 3 5 34 9]\n",
|
||
" [ 4 0 28 0 2 3 173 0 1 0]\n",
|
||
" [ 3 3 3 1 20 4 0 133 9 49]\n",
|
||
" [ 14 6 6 29 4 42 1 3 97 0]\n",
|
||
" [ 2 1 2 2 47 6 1 45 7 95]]\n",
|
||
"0.66890917828\n",
|
||
"[[176 0 9 0 4 6 5 1 7 1]\n",
|
||
" [ 0 229 2 1 0 2 2 2 4 1]\n",
|
||
" [ 14 2 176 10 5 8 11 1 1 0]\n",
|
||
" [ 1 0 24 119 0 13 1 2 29 2]\n",
|
||
" [ 3 0 1 0 120 1 10 6 1 36]\n",
|
||
" [ 9 1 10 23 3 93 2 9 31 5]\n",
|
||
" [ 3 0 17 0 3 0 187 0 1 0]\n",
|
||
" [ 2 2 5 3 9 6 0 172 4 22]\n",
|
||
" [ 14 2 3 19 4 49 2 3 105 1]\n",
|
||
" [ 2 1 2 1 50 8 1 24 4 115]]\n",
|
||
"0.716962998558\n",
|
||
"[[177 0 5 0 4 11 8 0 4 0]\n",
|
||
" [ 0 231 2 1 0 1 2 0 6 0]\n",
|
||
" [ 2 2 204 6 5 1 1 2 5 0]\n",
|
||
" [ 0 0 18 134 0 10 2 3 23 1]\n",
|
||
" [ 1 0 3 0 125 1 8 7 1 32]\n",
|
||
" [ 7 0 1 10 2 146 2 4 9 5]\n",
|
||
" [ 4 0 6 1 4 0 195 0 1 0]\n",
|
||
" [ 1 3 7 0 10 1 0 182 4 17]\n",
|
||
" [ 4 3 5 22 6 13 1 2 145 1]\n",
|
||
" [ 0 1 0 0 56 2 1 24 11 113]]\n",
|
||
"0.793849111004\n",
|
||
"[[178 0 4 0 4 10 8 0 5 0]\n",
|
||
" [ 0 231 2 2 0 0 2 0 5 1]\n",
|
||
" [ 2 2 205 6 4 0 3 2 4 0]\n",
|
||
" [ 0 1 15 132 0 10 2 5 25 1]\n",
|
||
" [ 0 0 2 0 132 1 4 7 5 27]\n",
|
||
" [ 10 1 2 11 5 141 0 5 9 2]\n",
|
||
" [ 3 0 5 1 4 0 196 0 2 0]\n",
|
||
" [ 1 2 6 0 8 2 1 180 7 18]\n",
|
||
" [ 1 2 4 15 2 13 0 2 161 2]\n",
|
||
" [ 1 0 0 1 52 1 1 28 10 114]]\n",
|
||
"0.802498798654\n",
|
||
"[[184 0 3 1 1 8 10 0 1 1]\n",
|
||
" [ 0 231 2 3 0 1 2 0 4 0]\n",
|
||
" [ 2 2 202 7 5 1 2 3 4 0]\n",
|
||
" [ 1 0 13 153 0 8 3 2 11 0]\n",
|
||
" [ 0 0 3 0 135 2 7 2 3 26]\n",
|
||
" [ 6 0 4 6 5 154 1 0 8 2]\n",
|
||
" [ 6 0 4 0 3 0 196 0 2 0]\n",
|
||
" [ 0 0 5 1 6 1 1 195 4 12]\n",
|
||
" [ 1 2 4 14 4 12 0 2 162 1]\n",
|
||
" [ 1 0 1 1 46 1 0 18 9 131]]\n",
|
||
"0.837578087458\n",
|
||
"[[187 0 2 1 1 8 8 0 1 1]\n",
|
||
" [ 0 234 3 0 0 0 2 0 4 0]\n",
|
||
" [ 1 2 205 6 4 0 6 3 1 0]\n",
|
||
" [ 1 0 12 152 0 10 0 3 11 2]\n",
|
||
" [ 0 0 3 0 135 1 5 2 2 30]\n",
|
||
" [ 7 0 2 4 3 159 0 0 7 4]\n",
|
||
" [ 4 0 2 0 2 0 202 0 1 0]\n",
|
||
" [ 1 2 5 1 6 0 1 191 6 12]\n",
|
||
" [ 2 2 2 10 3 13 0 0 169 1]\n",
|
||
" [ 2 0 0 1 45 2 1 18 8 131]]\n",
|
||
"0.848149927919\n",
|
||
"[[191 0 3 0 1 5 7 0 1 1]\n",
|
||
" [ 0 233 2 1 0 0 2 0 4 1]\n",
|
||
" [ 2 2 208 5 3 0 1 3 4 0]\n",
|
||
" [ 0 0 14 158 0 6 0 2 10 1]\n",
|
||
" [ 0 0 0 0 144 2 5 3 1 23]\n",
|
||
" [ 4 0 2 9 3 154 0 0 8 6]\n",
|
||
" [ 6 0 1 0 4 0 199 0 1 0]\n",
|
||
" [ 0 2 4 1 6 0 1 193 7 11]\n",
|
||
" [ 2 2 1 11 4 11 0 1 169 1]\n",
|
||
" [ 1 0 0 3 32 1 1 16 10 144]]\n",
|
||
"0.861604997597\n",
|
||
"[[189 0 4 0 0 6 8 0 1 1]\n",
|
||
" [ 0 232 2 2 0 0 2 0 4 1]\n",
|
||
" [ 1 2 206 4 3 1 4 3 4 0]\n",
|
||
" [ 1 0 10 159 0 7 0 2 12 0]\n",
|
||
" [ 0 0 1 0 143 1 6 2 0 25]\n",
|
||
" [ 6 0 2 9 4 153 1 0 6 5]\n",
|
||
" [ 5 0 2 0 3 0 200 0 1 0]\n",
|
||
" [ 0 1 4 1 6 0 1 193 6 13]\n",
|
||
" [ 2 2 2 8 3 10 0 1 172 2]\n",
|
||
" [ 2 2 0 2 31 0 1 17 8 145]]\n",
|
||
"0.861124459395\n",
|
||
"[[190 0 3 1 0 4 7 2 2 0]\n",
|
||
" [ 0 232 2 1 0 0 2 0 5 1]\n",
|
||
" [ 2 2 207 4 2 1 3 3 4 0]\n",
|
||
" [ 1 0 10 156 0 8 1 2 12 1]\n",
|
||
" [ 0 0 2 0 149 1 4 2 1 19]\n",
|
||
" [ 6 0 3 7 2 157 0 0 6 5]\n",
|
||
" [ 4 0 1 0 2 0 203 0 1 0]\n",
|
||
" [ 0 1 4 3 4 0 1 194 6 12]\n",
|
||
" [ 2 2 1 9 3 12 0 1 171 1]\n",
|
||
" [ 2 0 1 3 26 1 0 16 9 150]]\n",
|
||
"0.869293608842\n",
|
||
"[[191 0 4 0 0 4 8 1 1 0]\n",
|
||
" [ 0 233 2 0 0 0 2 0 5 1]\n",
|
||
" [ 3 3 207 4 3 0 3 2 3 0]\n",
|
||
" [ 1 0 13 158 0 6 0 2 9 2]\n",
|
||
" [ 0 0 2 0 145 1 6 1 1 22]\n",
|
||
" [ 2 0 2 8 4 158 3 0 5 4]\n",
|
||
" [ 5 0 1 0 2 0 201 1 1 0]\n",
|
||
" [ 1 2 4 2 2 0 1 197 6 10]\n",
|
||
" [ 2 1 1 9 4 13 0 0 171 1]\n",
|
||
" [ 2 0 1 2 32 0 0 17 8 146]]\n",
|
||
"0.868332532436\n",
|
||
"[[190 0 3 2 1 4 7 1 1 0]\n",
|
||
" [ 0 234 2 0 0 1 2 0 4 0]\n",
|
||
" [ 1 2 208 6 3 0 2 3 3 0]\n",
|
||
" [ 2 0 8 165 0 4 1 2 6 3]\n",
|
||
" [ 0 0 3 0 149 1 6 1 0 18]\n",
|
||
" [ 2 0 2 6 0 166 2 0 5 3]\n",
|
||
" [ 5 0 1 0 2 1 201 0 1 0]\n",
|
||
" [ 0 2 5 2 2 0 1 196 3 14]\n",
|
||
" [ 3 2 2 11 1 10 0 0 171 2]\n",
|
||
" [ 3 0 1 2 30 0 0 17 3 152]]\n",
|
||
"0.880345987506\n",
|
||
"[[191 0 3 0 0 4 8 1 1 1]\n",
|
||
" [ 0 233 2 0 0 0 2 0 5 1]\n",
|
||
" [ 2 2 206 5 3 1 3 4 2 0]\n",
|
||
" [ 0 0 8 163 0 8 1 2 8 1]\n",
|
||
" [ 0 0 2 0 158 0 4 1 0 13]\n",
|
||
" [ 2 0 4 5 3 161 2 0 5 4]\n",
|
||
" [ 6 0 1 0 2 2 199 0 1 0]\n",
|
||
" [ 1 2 5 3 2 0 1 197 2 12]\n",
|
||
" [ 3 2 3 13 1 10 1 0 166 3]\n",
|
||
" [ 3 0 1 2 23 0 0 15 2 162]]\n",
|
||
"0.882268140317\n",
|
||
"[[191 0 3 1 1 4 7 0 1 1]\n",
|
||
" [ 0 234 2 0 0 0 2 0 4 1]\n",
|
||
" [ 2 2 204 6 3 1 4 3 3 0]\n",
|
||
" [ 1 0 8 164 0 5 0 2 8 3]\n",
|
||
" [ 0 0 4 0 151 1 6 1 0 15]\n",
|
||
" [ 1 0 3 8 2 160 2 0 6 4]\n",
|
||
" [ 5 0 1 0 3 1 200 0 1 0]\n",
|
||
" [ 0 1 6 2 4 0 0 196 3 13]\n",
|
||
" [ 2 2 1 10 1 10 0 0 175 1]\n",
|
||
" [ 2 0 1 2 17 1 0 14 4 167]]\n",
|
||
"0.885151369534\n",
|
||
"[[190 0 3 1 1 4 8 0 1 1]\n",
|
||
" [ 0 233 2 0 0 1 2 0 4 1]\n",
|
||
" [ 3 2 204 6 2 0 4 3 4 0]\n",
|
||
" [ 1 0 8 164 0 5 0 2 9 2]\n",
|
||
" [ 0 0 4 0 154 1 6 1 0 12]\n",
|
||
" [ 1 0 3 7 1 164 0 0 7 3]\n",
|
||
" [ 3 0 1 0 4 2 200 0 1 0]\n",
|
||
" [ 0 1 6 3 2 0 2 193 2 16]\n",
|
||
" [ 2 2 1 10 1 8 0 0 176 2]\n",
|
||
" [ 2 0 1 3 13 0 0 15 2 172]]\n",
|
||
"0.888995675156\n",
|
||
"[[192 0 3 1 0 3 7 1 2 0]\n",
|
||
" [ 0 233 2 1 0 0 2 0 4 1]\n",
|
||
" [ 2 2 206 6 3 0 2 3 4 0]\n",
|
||
" [ 0 0 8 168 0 5 0 2 7 1]\n",
|
||
" [ 0 0 3 0 156 1 5 1 0 12]\n",
|
||
" [ 2 0 2 7 1 166 2 0 3 3]\n",
|
||
" [ 4 0 1 0 3 1 201 0 1 0]\n",
|
||
" [ 0 0 5 3 6 0 0 196 2 13]\n",
|
||
" [ 3 2 2 13 1 8 1 0 169 3]\n",
|
||
" [ 2 1 2 3 17 0 0 16 1 166]]\n",
|
||
"0.890437289765\n",
|
||
"[[194 0 3 0 0 2 8 0 1 1]\n",
|
||
" [ 0 233 2 0 0 2 2 0 4 0]\n",
|
||
" [ 2 2 207 5 2 0 2 3 3 2]\n",
|
||
" [ 0 0 8 161 0 8 1 2 10 1]\n",
|
||
" [ 0 0 3 0 156 1 4 1 0 13]\n",
|
||
" [ 2 0 4 5 2 164 1 0 5 3]\n",
|
||
" [ 5 0 1 0 4 1 199 0 1 0]\n",
|
||
" [ 1 1 6 3 5 0 0 195 3 11]\n",
|
||
" [ 3 2 1 11 2 8 0 0 174 1]\n",
|
||
" [ 1 0 1 3 13 0 0 14 1 175]]\n",
|
||
"0.892839980778\n",
|
||
"[[191 0 4 0 0 2 9 1 1 1]\n",
|
||
" [ 0 234 2 0 0 1 2 0 4 0]\n",
|
||
" [ 3 2 206 5 2 0 3 3 4 0]\n",
|
||
" [ 0 0 8 167 0 9 0 2 4 1]\n",
|
||
" [ 0 0 3 0 159 1 2 1 0 12]\n",
|
||
" [ 2 0 1 7 1 168 0 0 4 3]\n",
|
||
" [ 5 0 2 0 4 0 199 0 1 0]\n",
|
||
" [ 0 1 5 2 5 0 1 200 2 9]\n",
|
||
" [ 2 2 1 10 0 11 0 0 174 2]\n",
|
||
" [ 2 0 2 3 17 0 0 16 1 167]]\n",
|
||
"0.896203748198\n",
|
||
"[[194 0 3 0 0 1 8 1 1 1]\n",
|
||
" [ 0 234 2 1 0 0 2 0 3 1]\n",
|
||
" [ 2 2 209 5 2 0 3 2 3 0]\n",
|
||
" [ 0 0 9 165 0 4 1 2 9 1]\n",
|
||
" [ 0 0 3 0 155 1 4 1 0 14]\n",
|
||
" [ 1 0 1 7 2 166 0 0 6 3]\n",
|
||
" [ 5 0 1 0 2 2 201 0 0 0]\n",
|
||
" [ 2 0 6 3 4 0 1 193 2 14]\n",
|
||
" [ 3 2 1 11 1 4 0 0 178 2]\n",
|
||
" [ 2 0 2 3 14 0 0 14 1 172]]\n",
|
||
"0.897164824604\n",
|
||
"[[194 0 2 1 1 1 7 1 1 1]\n",
|
||
" [ 0 233 2 1 0 0 2 0 5 0]\n",
|
||
" [ 2 2 209 4 2 0 2 3 2 2]\n",
|
||
" [ 0 0 8 167 0 6 0 2 6 2]\n",
|
||
" [ 0 0 4 0 152 1 4 1 0 16]\n",
|
||
" [ 2 0 1 9 1 166 1 0 3 3]\n",
|
||
" [ 5 0 1 0 2 2 201 0 0 0]\n",
|
||
" [ 1 1 7 1 2 0 0 193 4 16]\n",
|
||
" [ 3 1 1 12 0 7 1 0 174 3]\n",
|
||
" [ 1 0 2 3 17 1 0 14 2 168]]\n",
|
||
"0.892359442576\n",
|
||
"[[190 0 3 1 2 1 8 2 1 1]\n",
|
||
" [ 0 234 2 0 0 0 2 0 4 1]\n",
|
||
" [ 2 2 209 3 2 0 3 4 3 0]\n",
|
||
" [ 0 0 7 167 0 5 1 2 8 1]\n",
|
||
" [ 1 0 4 0 155 1 5 1 0 11]\n",
|
||
" [ 2 0 3 7 1 163 3 0 4 3]\n",
|
||
" [ 4 0 1 0 2 2 201 0 1 0]\n",
|
||
" [ 1 2 6 1 5 0 0 196 2 12]\n",
|
||
" [ 3 1 1 12 2 6 1 0 174 2]\n",
|
||
" [ 2 0 2 3 15 0 0 12 2 172]]\n",
|
||
"0.894281595387\n",
|
||
"[[193 0 2 1 1 1 6 3 1 1]\n",
|
||
" [ 0 234 2 1 0 0 2 0 3 1]\n",
|
||
" [ 3 2 207 4 1 0 3 3 4 1]\n",
|
||
" [ 0 0 9 168 0 6 0 2 5 1]\n",
|
||
" [ 0 0 4 0 154 1 4 1 0 14]\n",
|
||
" [ 2 0 3 7 1 165 2 0 3 3]\n",
|
||
" [ 6 0 1 0 1 1 201 0 1 0]\n",
|
||
" [ 0 0 7 1 6 0 0 196 3 12]\n",
|
||
" [ 0 1 1 12 0 6 1 0 178 3]\n",
|
||
" [ 2 0 2 3 18 0 0 16 1 166]]\n",
|
||
"0.89476213359\n",
|
||
"[[191 0 4 1 1 1 7 1 1 2]\n",
|
||
" [ 0 234 2 0 0 1 2 0 3 1]\n",
|
||
" [ 2 2 208 3 2 0 4 3 4 0]\n",
|
||
" [ 0 0 8 169 0 5 0 2 7 0]\n",
|
||
" [ 0 0 3 0 157 1 3 1 0 13]\n",
|
||
" [ 1 0 3 7 2 166 1 0 2 4]\n",
|
||
" [ 2 0 2 0 3 2 202 0 0 0]\n",
|
||
" [ 0 1 7 1 4 0 0 194 2 16]\n",
|
||
" [ 0 1 1 12 1 5 0 1 181 0]\n",
|
||
" [ 1 0 1 2 14 1 0 14 2 173]]\n",
|
||
"0.901009130226\n",
|
||
"With only 25 components, a random forest acheived an accuracy of 0.9010091302258529.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for comp in range(2, 100):\n",
|
||
" pca, X_pca = do_pca(comp, X)\n",
|
||
" acc = fit_random_forest_classifier(X_pca, y)\n",
|
||
" if acc > .90:\n",
|
||
" print(\"With only {} components, a random forest acheived an accuracy of {}.\".format(comp, acc))\n",
|
||
" break"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"`9.` It is possible that extra features in the dataset even lead to overfitting or the [curse of dimensionality](https://stats.stackexchange.com/questions/65379/machine-learning-curse-of-dimensionality-explained). Do you have evidence of this happening for this dataset? Can you support your evidence with a visual or table? To avoid printing out all of the metric results, I created another function called `fit_random_forest_classifier2`. I ran through a significant number of components to create the visual for the solution, but I strongly recommend you look in the range below 100 principal components!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# I would highly recommend not running the below code, as it had to run overnight to complete.\n",
|
||
"# Instead, you can run a smaller number of components that still allows you to see enough.\n",
|
||
"\n",
|
||
"\n",
|
||
"#accs = []\n",
|
||
"#comps = []\n",
|
||
"#for comp in range(2, 700):\n",
|
||
"# comps.append(comp)\n",
|
||
"# pca, X_pca = do_pca(comp, X)\n",
|
||
"# acc = fit_random_forest_classifier2(X_pca, y)\n",
|
||
"# accs.append(acc)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "NameError",
|
||
"evalue": "name 'comps' is not defined",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
||
"\u001b[0;32m<ipython-input-45-d11575a86762>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcomps\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'bo'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m;\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxlabel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Number of Components'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m;\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mylabel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Accuracy'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m;\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtitle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Number of Components by Accuracy'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m;\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
||
"\u001b[0;31mNameError\u001b[0m: name 'comps' is not defined"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"plt.plot(comps, accs, 'bo');\n",
|
||
"plt.xlabel('Number of Components');\n",
|
||
"plt.ylabel('Accuracy');\n",
|
||
"plt.title('Number of Components by Accuracy');"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"collapsed": true
|
||
},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.7.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|