{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Your Turn\n", "\n", "In the previous video, you saw an example of working with some MNIST digits data. The MNIST dataset can be found here: http://yann.lecun.com/exdb/mnist/.\n", "\n", "First, let's import the necessary libraries. Notice there are also some imports from a file called `helper_functions`, which contains the functions used in the previous video." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score\n", "from helper_functions import show_images, show_images_by_digit, fit_random_forest_classifier2 \n", "from helper_functions import fit_random_forest_classifier, do_pca, plot_components\n", "import test_code as t\n", "\n", "import matplotlib.image as mpimg\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`1.` Use pandas to read in the dataset, which can be found in this workspace using the filepath **'./data/train.csv'**. If you have missing values, fill them with 0. Take a look at info about the data using `head`, `tail`, `describe`, `info`, etc. You can learn more about the data values from the article here: https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./data/train.csv')\n", "df.fillna(value=0, inplace=True)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | label | \n", "pixel0 | \n", "pixel1 | \n", "pixel2 | \n", "pixel3 | \n", "pixel4 | \n", "pixel5 | \n", "pixel6 | \n", "pixel7 | \n", "pixel8 | \n", "... | \n", "pixel774 | \n", "pixel775 | \n", "pixel776 | \n", "pixel777 | \n", "pixel778 | \n", "pixel779 | \n", "pixel780 | \n", "pixel781 | \n", "pixel782 | \n", "pixel783 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 785 columns
\n", "| \n", " | label | \n", "pixel0 | \n", "pixel1 | \n", "pixel2 | \n", "pixel3 | \n", "pixel4 | \n", "pixel5 | \n", "pixel6 | \n", "pixel7 | \n", "pixel8 | \n", "... | \n", "pixel774 | \n", "pixel775 | \n", "pixel776 | \n", "pixel777 | \n", "pixel778 | \n", "pixel779 | \n", "pixel780 | \n", "pixel781 | \n", "pixel782 | \n", "pixel783 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "6304.000000 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "... | \n", "6304.000000 | \n", "6304.000000 | \n", "6304.000000 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "6304.0 | \n", "
| mean | \n", "4.440355 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.139594 | \n", "0.142291 | \n", "0.026967 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| std | \n", "2.885613 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "5.099940 | \n", "5.531089 | \n", "1.675547 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| min | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 25% | \n", "2.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 50% | \n", "4.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 75% | \n", "7.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| max | \n", "9.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "253.000000 | \n", "253.000000 | \n", "130.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
8 rows × 785 columns
\n", "