udacity/python/Supervised Learning/Ensemble Methods/Spam_&_Ensembles_sol.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Our Mission ##\n",
    "\n",
    "You recently used Naive Bayes to classify spam in this [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). In this notebook, we will expand on the previous analysis by using a few of the new techniques you saw throughout this lesson.\n",
    "\n",
    "\n",
    "> In order to get caught back up to speed with what was done in the previous notebook, run the cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score:  0.9885139985642498\n",
      "Precision score:  0.9720670391061452\n",
      "Recall score:  0.9405405405405406\n",
      "F1 score:  0.9560439560439562\n"
     ]
    }
   ],
   "source": [
    "# Import our libraries\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
    "\n",
    "\n",
    "# Read in our dataset\n",
    "df = pd.read_table('smsspamcollection/SMSSpamCollection',\n",
    "                   sep='\\t', \n",
    "                   header=None, \n",
    "                   names=['label', 'sms_message'])\n",
    "\n",
    "# Fix our response value\n",
    "df['label'] = df.label.map({'ham':0, 'spam':1})\n",
    "\n",
    "# Split our dataset into training and testing data\n",
    "X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], \n",
    "                                                    df['label'], \n",
    "                                                    random_state=1)\n",
    "\n",
    "# Instantiate the CountVectorizer method\n",
    "count_vector = CountVectorizer()\n",
    "\n",
    "# Fit the training data and then return the matrix\n",
    "training_data = count_vector.fit_transform(X_train)\n",
    "\n",
    "# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()\n",
    "testing_data = count_vector.transform(X_test)\n",
    "\n",
    "# Instantiate our model\n",
    "naive_bayes = MultinomialNB()\n",
    "\n",
    "# Fit our model to the training data\n",
    "naive_bayes.fit(training_data, y_train)\n",
    "\n",
    "# Predict on the test data\n",
    "predictions = naive_bayes.predict(testing_data)\n",
    "\n",
    "# Score our model\n",
    "print('Accuracy score: ', format(accuracy_score(y_test, predictions)))\n",
    "print('Precision score: ', format(precision_score(y_test, predictions)))\n",
    "print('Recall score: ', format(recall_score(y_test, predictions)))\n",
    "print('F1 score: ', format(f1_score(y_test, predictions)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Turns Out...\n",
    "\n",
    "It turns out that our naive bayes model actually does a pretty good job.  However, let's take a look at a few additional models to see if we can't improve anyway.\n",
    "\n",
    "Specifically in this notebook, we will take a look at the following techniques:\n",
    "\n",
    "* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)\n",
    "* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)\n",
    "* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)\n",
    "\n",
    "Another really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).\n",
    "\n",
    "These ensemble methods use a combination of techniques you have seen throughout this lesson:\n",
    "\n",
    "* **Bootstrap the data** passed through a learner (bagging).\n",
    "* **Subset the features** used for a learner (combined with bagging signifies the two random components of random forests).\n",
    "* **Ensemble learners** together in a way that allows those that perform best in certain areas to create the largest impact (boosting).\n",
    "\n",
    "\n",
    "In this notebook, let's get some practice with these methods, which will also help you get comfortable with the process used for performing supervised machine learning in python in general.\n",
    "\n",
    "Since you cleaned and vectorized the text in the previous notebook, this notebook can be focused on the fun part - the machine learning part.\n",
    "\n",
    "### This Process Looks Familiar...\n",
    "\n",
    "In general, there is a five step process that can be used each type you want to use a supervised learning method (which you actually used above):\n",
    "\n",
    "1. **Import** the model.\n",
    "2. **Instantiate** the model with the hyperparameters of interest.\n",
    "3. **Fit** the model to the training data.\n",
    "4. **Predict** on the test data.\n",
    "5. **Score** the model by comparing the predictions to the actual values.\n",
    "\n",
    "Follow the steps through this notebook to perform these steps using each of the ensemble methods: **BaggingClassifier**, **RandomForestClassifier**, and **AdaBoostClassifier**.\n",
    "\n",
    "> **Step 1**: First use the documentation to `import` all three of the models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the Bagging, RandomForest, and AdaBoost Classifier\n",
    "from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Step 2:** Now that you have imported each of the classifiers, `instantiate` each with the hyperparameters specified in each comment.  In the upcoming lessons, you will see how we can automate the process to finding the best hyperparameters.  For now, let's get comfortable with the process and our new algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate a BaggingClassifier with:\n",
    "# 200 weak learners (n_estimators) and everything else as default values\n",
    "bagging = BaggingClassifier(n_estimators=200)\n",
    "\n",
    "\n",
    "# Instantiate a RandomForestClassifier with:\n",
    "# 200 weak learners (n_estimators) and everything else as default values\n",
    "forest = RandomForestClassifier(n_estimators=200)\n",
    "\n",
    "# Instantiate an a AdaBoostClassifier with:\n",
    "# With 300 weak learners (n_estimators) and a learning_rate of 0.2\n",
    "ada = AdaBoostClassifier(n_estimators=300, learning_rate=0.2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Step 3:** Now that you have instantiated each of your models, `fit` them using the **training_data** and **y_train**.  This may take a bit of time, you are fitting 700 weak learners after all!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,\n",
       "          learning_rate=0.2, n_estimators=300, random_state=None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit your BaggingClassifier to the training data\n",
    "bagging.fit(training_data, y_train)\n",
    "\n",
    "# Fit your RandomForestClassifier to the training data\n",
    "forest.fit(training_data, y_train)\n",
    "\n",
    "# Fit your AdaBoostClassifier to the training data\n",
    "ada.fit(training_data, y_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Step 4:** Now that you have fit each of your models, you will use each to `predict` on the **testing_data**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict using BaggingClassifier on the test data\n",
    "bag_result = bagging.predict(testing_data)\n",
    "\n",
    "# Predict using RandomForestClassifier on the test data\n",
    "forest_result = forest.predict(testing_data)\n",
    "\n",
    "# Predict using AdaBoostClassifier on the test data\n",
    "ada_result = ada.predict(testing_data)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Step 5:** Now that you have made your predictions, compare your predictions to the actual values using the function below for each of your models - this will give you the `score` for how well each of your models is performing.  It might also be useful to show the naive bayes model again here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_metrics(y_true, preds, model_name=None):\n",
    "    '''\n",
    "    INPUT:\n",
    "    y_true - the y values that are actually true in the dataset (numpy array or pandas series)\n",
    "    preds - the predictions for those values from some model (numpy array or pandas series)\n",
    "    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements \n",
    "    \n",
    "    OUTPUT:\n",
    "    None - prints the accuracy, precision, recall, and F1 score\n",
    "    '''\n",
    "    if model_name == None:\n",
    "        print('Accuracy score: ', format(accuracy_score(y_true, preds)))\n",
    "        print('Precision score: ', format(precision_score(y_true, preds)))\n",
    "        print('Recall score: ', format(recall_score(y_true, preds)))\n",
    "        print('F1 score: ', format(f1_score(y_true, preds)))\n",
    "        print('\\n\\n')\n",
    "    \n",
    "    else:\n",
    "        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))\n",
    "        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))\n",
    "        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))\n",
    "        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))\n",
    "        print('\\n\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score for Bagging : 0.9734386216798278\n",
      "Precision score Bagging : 0.9111111111111111\n",
      "Recall score Bagging : 0.8864864864864865\n",
      "F1 score Bagging : 0.8986301369863013\n",
      "\n",
      "\n",
      "\n",
      "Accuracy score for Random Forest : 0.9827709978463748\n",
      "Precision score Random Forest : 1.0\n",
      "Recall score Random Forest : 0.8702702702702703\n",
      "F1 score Random Forest : 0.930635838150289\n",
      "\n",
      "\n",
      "\n",
      "Accuracy score for AdaBoost : 0.9770279971284996\n",
      "Precision score AdaBoost : 0.9693251533742331\n",
      "Recall score AdaBoost : 0.8540540540540541\n",
      "F1 score AdaBoost : 0.9080459770114943\n",
      "\n",
      "\n",
      "\n",
      "Accuracy score for NB : 0.9885139985642498\n",
      "Precision score NB : 0.9720670391061452\n",
      "Recall score NB : 0.9405405405405406\n",
      "F1 score NB : 0.9560439560439562\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print Bagging scores\n",
    "print_metrics(y_test, bag_result, model_name='Bagging')\n",
    "\n",
    "# Print Random Forest scores\n",
    "print_metrics(y_test, forest_result, model_name='Random Forest')\n",
    "\n",
    "# Print AdaBoost scores\n",
    "print_metrics(y_test, ada_result, model_name='AdaBoost')\n",
    "\n",
    "# Naive Bayes Classifier scores\n",
    "print_metrics(y_test, predictions, model_name='NB')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recap\n",
    "\n",
    "Now you have seen the whole process for a few ensemble models! \n",
    "\n",
    "1. **Import** the model.\n",
    "2. **Instantiate** the model with the hyperparameters of interest.\n",
    "3. **Fit** the model to the training data.\n",
    "4. **Predict** on the test data.\n",
    "5. **Score** the model by comparing the predictions to the actual values.\n",
    "\n",
    "And that's it.  This is a very common process for performing machine learning.\n",
    "\n",
    "\n",
    "### But, Wait...\n",
    "\n",
    "You might be asking - \n",
    "\n",
    "* What do these metrics mean? \n",
    "\n",
    "* How do I optimize to get the best model?  \n",
    "\n",
    "* There are so many hyperparameters to each of these models, how do I figure out what the best values are for each?\n",
    "\n",
    "**This is exactly what the last two lessons of this course on supervised learning are all about.**\n",
    "\n",
    "**Notice, you can obtain a solution to this notebook by clicking the orange icon in the top left!**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}