{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab: Titanic Survival Exploration with Decision Trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "In this lab, you will see how decision trees work by implementing a decision tree in sklearn.\n", "\n", "We'll start by loading the dataset and displaying some of its rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import libraries necessary for this project\n", "import numpy as np\n", "import pandas as pd\n", "from IPython.display import display # Allows the use of display() for DataFrames\n", "\n", "# Pretty display for notebooks\n", "%matplotlib inline\n", "\n", "# Set a random seed\n", "import random\n", "random.seed(42)\n", "\n", "# Load the dataset\n", "in_file = 'titanic_data.csv'\n", "full_data = pd.read_csv(in_file)\n", "\n", "# Print the first few entries of the RMS Titanic data\n", "display(full_data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that these are the various features present for each passenger on the ship:\n", "- **Survived**: Outcome of survival (0 = No; 1 = Yes)\n", "- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)\n", "- **Name**: Name of passenger\n", "- **Sex**: Sex of the passenger\n", "- **Age**: Age of the passenger (Some entries contain `NaN`)\n", "- **SibSp**: Number of siblings and spouses of the passenger aboard\n", "- **Parch**: Number of parents and children of the passenger aboard\n", "- **Ticket**: Ticket number of the passenger\n", "- **Fare**: Fare paid by the passenger\n", "- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)\n", "- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "\n", "Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets. \n", "Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Store the 'Survived' feature in a new variable and remove it from the dataset\n", "outcomes = full_data['Survived']\n", "features_raw = full_data.drop('Survived', axis = 1)\n", "\n", "# Show the new dataset with 'Survived' removed\n", "display(features_raw.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.\n", "\n", "## Preprocessing the data\n", "\n", "Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.\n", "\n", "**Question:** Why would it be a terrible idea to one-hot encode the data without removing the names?\n", "(Andw" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Removing the names\n", "features_no_names = features_raw.drop(['Name'], axis=1)\n", "\n", "# One-hot encoding\n", "features = pd.get_dummies(features_no_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we'll fill in any blanks with zeroes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = features.fillna(0.0)\n", "display(features.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## (TODO) Training the model\n", "\n", "Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the classifier from sklearn\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "# TODO: Define the classifier, and fit it to the data\n", "model = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing the model\n", "Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Making predictions\n", "y_train_pred = model.predict(X_train)\n", "y_test_pred = model.predict(X_test)\n", "\n", "# Calculate the accuracy\n", "from sklearn.metrics import accuracy_score\n", "train_accuracy = accuracy_score(y_train, y_train_pred)\n", "test_accuracy = accuracy_score(y_test, y_test_pred)\n", "print('The training accuracy is', train_accuracy)\n", "print('The test accuracy is', test_accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise: Improving the model\n", "\n", "Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.\n", "\n", "So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:\n", "- `max_depth`\n", "- `min_samples_leaf`\n", "- `min_samples_split`\n", "\n", "You can use your intuition, trial and error, or even better, feel free to use Grid Search!\n", "\n", "**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook next." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Train the model\n", "\n", "# TODO: Make predictions\n", "\n", "# TODO: Calculate the accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 1 }