244 lines
7.2 KiB
Plaintext
244 lines
7.2 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Lab: Titanic Survival Exploration with Decision Trees"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Getting Started\n",
|
|
"In this lab, you will see how decision trees work by implementing a decision tree in sklearn.\n",
|
|
"\n",
|
|
"We'll start by loading the dataset and displaying some of its rows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import libraries necessary for this project\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from IPython.display import display # Allows the use of display() for DataFrames\n",
|
|
"\n",
|
|
"# Pretty display for notebooks\n",
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"# Set a random seed\n",
|
|
"import random\n",
|
|
"random.seed(42)\n",
|
|
"\n",
|
|
"# Load the dataset\n",
|
|
"in_file = 'titanic_data.csv'\n",
|
|
"full_data = pd.read_csv(in_file)\n",
|
|
"\n",
|
|
"# Print the first few entries of the RMS Titanic data\n",
|
|
"display(full_data.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Recall that these are the various features present for each passenger on the ship:\n",
|
|
"- **Survived**: Outcome of survival (0 = No; 1 = Yes)\n",
|
|
"- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)\n",
|
|
"- **Name**: Name of passenger\n",
|
|
"- **Sex**: Sex of the passenger\n",
|
|
"- **Age**: Age of the passenger (Some entries contain `NaN`)\n",
|
|
"- **SibSp**: Number of siblings and spouses of the passenger aboard\n",
|
|
"- **Parch**: Number of parents and children of the passenger aboard\n",
|
|
"- **Ticket**: Ticket number of the passenger\n",
|
|
"- **Fare**: Fare paid by the passenger\n",
|
|
"- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)\n",
|
|
"- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
|
|
"\n",
|
|
"Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets. \n",
|
|
"Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Store the 'Survived' feature in a new variable and remove it from the dataset\n",
|
|
"outcomes = full_data['Survived']\n",
|
|
"features_raw = full_data.drop('Survived', axis = 1)\n",
|
|
"\n",
|
|
"# Show the new dataset with 'Survived' removed\n",
|
|
"display(features_raw.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.\n",
|
|
"\n",
|
|
"## Preprocessing the data\n",
|
|
"\n",
|
|
"Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.\n",
|
|
"\n",
|
|
"**Question:** Why would it be a terrible idea to one-hot encode the data without removing the names?\n",
|
|
"(Andw"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Removing the names\n",
|
|
"features_no_names = features_raw.drop(['Name'], axis=1)\n",
|
|
"\n",
|
|
"# One-hot encoding\n",
|
|
"features = pd.get_dummies(features_no_names)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And now we'll fill in any blanks with zeroes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"features = features.fillna(0.0)\n",
|
|
"display(features.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## (TODO) Training the model\n",
|
|
"\n",
|
|
"Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the classifier from sklearn\n",
|
|
"from sklearn.tree import DecisionTreeClassifier\n",
|
|
"\n",
|
|
"# TODO: Define the classifier, and fit it to the data\n",
|
|
"model = None"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Testing the model\n",
|
|
"Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Making predictions\n",
|
|
"y_train_pred = model.predict(X_train)\n",
|
|
"y_test_pred = model.predict(X_test)\n",
|
|
"\n",
|
|
"# Calculate the accuracy\n",
|
|
"from sklearn.metrics import accuracy_score\n",
|
|
"train_accuracy = accuracy_score(y_train, y_train_pred)\n",
|
|
"test_accuracy = accuracy_score(y_test, y_test_pred)\n",
|
|
"print('The training accuracy is', train_accuracy)\n",
|
|
"print('The test accuracy is', test_accuracy)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exercise: Improving the model\n",
|
|
"\n",
|
|
"Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.\n",
|
|
"\n",
|
|
"So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:\n",
|
|
"- `max_depth`\n",
|
|
"- `min_samples_leaf`\n",
|
|
"- `min_samples_split`\n",
|
|
"\n",
|
|
"You can use your intuition, trial and error, or even better, feel free to use Grid Search!\n",
|
|
"\n",
|
|
"**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook next."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# TODO: Train the model\n",
|
|
"\n",
|
|
"# TODO: Make predictions\n",
|
|
"\n",
|
|
"# TODO: Calculate the accuracy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|