{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Diabetes Case Study\n", "\n", "You now have had the opportunity to work with a range of supervised machine learning techniques for both classification and regression. Before you apply these in the project, let's do one more example to see how the machine learning process works from beginning to end with another popular dataset.\n", "\n", "We will start out by reading in the dataset and our necessary libraries. You will then gain an understanding of how to optimize a number of models using grid searching as you work through the notebook. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", "0 6 148 72 35 0 33.6 \n", "1 1 85 66 29 0 26.6 \n", "2 8 183 64 0 0 23.3 \n", "3 1 89 66 23 94 28.1 \n", "4 0 137 40 35 168 43.1 \n", "\n", " DiabetesPedigreeFunction Age Outcome \n", "0 0.627 50 1 \n", "1 0.351 31 0 \n", "2 0.672 32 1 \n", "3 0.167 21 0 \n", "4 2.288 33 1 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import our libraries\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.datasets import load_diabetes\n", "from sklearn.model_selection import train_test_split, RandomizedSearchCV\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n", "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "import matplotlib.pyplot as plt\n", "from sklearn.svm import SVC\n", "import seaborn as sns\n", "sns.set(style=\"ticks\")\n", "\n", "import check_file as ch\n", "\n", "%matplotlib inline\n", "\n", "# Read in our dataset\n", "diabetes = pd.read_csv('diabetes.csv')\n", "\n", "# Take a look at the first few rows of the dataset\n", "diabetes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because this course has been aimed at understanding machine learning techniques, we have largely ignored items related to parts of the data analysis process that come before building machine learning models - exploratory data analysis, feature engineering, data cleaning, and data wrangling. \n", "\n", "> **Step 1:** Let's do a few steps here. Take a look at some of usual summary statistics calculated to accurately match the values to the appropriate key in the dictionary below. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Cells for work\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
Outcome
00.65
10.35
\n", "
" ], "text/plain": [ " 0\n", "Outcome \n", "0 0.65\n", "1 0.35" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outcome_count=diabetes.groupby(['Outcome']).size()\n", "pd.DataFrame((outcome_count/outcome_count.sum()).round(2))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
count768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000
mean3.845052120.89453169.10546920.53645879.79947931.9925780.47187633.2408850.348958
std3.36957831.97261819.35580715.952218115.2440027.8841600.33132911.7602320.476951
min0.0000000.0000000.0000000.0000000.0000000.0000000.07800021.0000000.000000
25%1.00000099.00000062.0000000.0000000.00000027.3000000.24375024.0000000.000000
50%3.000000117.00000072.00000023.00000030.50000032.0000000.37250029.0000000.000000
75%6.000000140.25000080.00000032.000000127.25000036.6000000.62625041.0000001.000000
max17.000000199.000000122.00000099.000000846.00000067.1000002.42000081.0000001.000000
\n", "
" ], "text/plain": [ " Pregnancies Glucose BloodPressure SkinThickness Insulin \\\n", "count 768.000000 768.000000 768.000000 768.000000 768.000000 \n", "mean 3.845052 120.894531 69.105469 20.536458 79.799479 \n", "std 3.369578 31.972618 19.355807 15.952218 115.244002 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 1.000000 99.000000 62.000000 0.000000 0.000000 \n", "50% 3.000000 117.000000 72.000000 23.000000 30.500000 \n", "75% 6.000000 140.250000 80.000000 32.000000 127.250000 \n", "max 17.000000 199.000000 122.000000 99.000000 846.000000 \n", "\n", " BMI DiabetesPedigreeFunction Age Outcome \n", "count 768.000000 768.000000 768.000000 768.000000 \n", "mean 31.992578 0.471876 33.240885 0.348958 \n", "std 7.884160 0.331329 11.760232 0.476951 \n", "min 0.000000 0.078000 21.000000 0.000000 \n", "25% 27.300000 0.243750 24.000000 0.000000 \n", "50% 32.000000 0.372500 29.000000 0.000000 \n", "75% 36.600000 0.626250 41.000000 1.000000 \n", "max 67.100000 2.420000 81.000000 1.000000 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Awesome! These all look great!\n" ] } ], "source": [ "# Possible keys for the dictionary\n", "a = '0.65'\n", "b = '0'\n", "c = 'Age'\n", "d = '0.35'\n", "e = 'Glucose'\n", "f = '0.5'\n", "g = \"More than zero\"\n", "\n", "# Fill in the dictionary with the correct values here\n", "answers_one = {\n", " 'The proportion of diabetes outcomes in the dataset': d,\n", " 'The number of missing data points in the dataset': b,\n", " 'A dataset with a symmetric distribution': e,\n", " 'A dataset with a right-skewed distribution': c, \n", " 'This variable has the strongest correlation with the outcome': e\n", "}\n", "\n", "# Just to check your answer, don't change this\n", "ch.check_one(answers_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Step 2**: Since our dataset here is quite clean, we will jump straight into the machine learning. Our goal here is to be able to predict cases of diabetes. First, you need to identify the y vector and X matrix. Then, the following code will divide your dataset into training and test data. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "y = diabetes['Outcome']\n", "X = diabetes.drop(['Outcome'], axis=1)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have a training and testing dataset, we need to create some models that and ultimately find the best of them. However, unlike in earlier lessons, where we used the defaults, we can now tune these models to be the very best models they can be.\n", "\n", "It can often be difficult (and extremely time consuming) to test all the possible hyperparameter combinations to find the best models. Therefore, it is often useful to set up a randomized search. \n", "\n", "In practice, randomized searches across hyperparameters have shown to be more time confusing, while still optimizing quite well. One article related to this topic is available [here](https://blog.h2o.ai/2016/06/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/). The documentation for using randomized search in sklearn can be found [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).\n", "\n", "In order to use the randomized search effectively, you will want to have a pretty reasonable understanding of the distributions that best give a sense of your hyperparameters. Understanding what values are possible for your hyperparameters will allow you to write a grid search that performs well (and doesn't break).\n", "\n", "> **Step 3**: In this step, I will show you how to use randomized search, and then you can set up grid searches for the other models in Step 4. However, you will be helping, as I don't remember exactly what each of the hyperparameters in SVMs do. Match each hyperparameter to its corresponding tuning functionality.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for random forest : 0.7532467532467533\n", "Precision score random forest : 0.6545454545454545\n", "Recall score random forest : 0.6545454545454545\n", "F1 score random forest : 0.6545454545454545\n", "\n", "\n", "\n" ] } ], "source": [ "# build a classifier\n", "clf_rf = RandomForestClassifier()\n", "\n", "# Set up the hyperparameter search\n", "param_dist = {\"max_depth\": [3, None],\n", " \"n_estimators\": list(range(10, 200)),\n", " \"max_features\": list(range(1, X_test.shape[1]+1)),\n", " \"min_samples_split\": list(range(2, 11)),\n", " \"min_samples_leaf\": list(range(1, 11)),\n", " \"bootstrap\": [True, False],\n", " \"criterion\": [\"gini\", \"entropy\"]}\n", "\n", "\n", "# Run a randomized search over the hyperparameters\n", "random_search = RandomizedSearchCV(clf_rf, param_distributions=param_dist)\n", "\n", "# Fit the model on the training data\n", "random_search.fit(X_train, y_train)\n", "\n", "# Make predictions on the test data\n", "rf_preds = random_search.best_estimator_.predict(X_test)\n", "\n", "ch.print_metrics(y_test, rf_preds, 'random forest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Step 4**: Now that you have seen how to run a randomized grid search using random forest, try this out for the AdaBoost and SVC classifiers. You might also decide to try out other classifiers that you saw earlier in the lesson to see what works best." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for adaboost : 0.7792207792207793\n", "Precision score adaboost : 0.7441860465116279\n", "Recall score adaboost : 0.5818181818181818\n", "F1 score adaboost : 0.6530612244897959\n", "\n", "\n", "\n", "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,\n", " learning_rate=0.4, n_estimators=10, random_state=None)\n" ] } ], "source": [ "# build a classifier for ada boost\n", "clf_ada = AdaBoostClassifier()\n", "\n", "# Set up the hyperparameter search\n", "# look at setting up your search for n_estimators, learning_rate\n", "# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html\n", "param_dist = {\"n_estimators\": [10, 100, 200, 400],\n", " \"learning_rate\": [0.001, 0.005, .01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 10, 20]}\n", "\n", "# Run a randomized search over the hyperparameters\n", "ada_search = RandomizedSearchCV(clf_ada, param_distributions=param_dist)\n", "\n", "# Fit the model on the training data\n", "ada_search.fit(X_train, y_train)\n", "\n", "# Make predictions on the test data\n", "ada_preds = ada_search.best_estimator_.predict(X_test)\n", "\n", "# Return your metrics on test data\n", "ch.print_metrics(y_test, ada_preds, 'adaboost')\n", "\n", "# Print the hyperparams used\n", "print(ada_search.best_estimator_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for adaboost : 0.7597402597402597\n", "Precision score adaboost : 0.6551724137931034\n", "Recall score adaboost : 0.6909090909090909\n", "F1 score adaboost : 0.6725663716814159\n", "\n", "\n", "\n", "AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,\n", " n_estimators=10, random_state=None)\n" ] } ], "source": [ "ada_grid_search = GridSearchCV(clf_ada, param_dist)\n", "\n", "ada_grid_search.fit(X_train, y_train)\n", "\n", "ada_grid_preds = ada_grid_search.best_estimator_.predict(X_test)\n", "\n", "ch.print_metrics(y_test, ada_grid_preds, 'adaboost')\n", "\n", "print(ada_grid_search.best_estimator_)\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for svc : 0.7532467532467533\n", "Precision score svc : 0.6545454545454545\n", "Recall score svc : 0.6545454545454545\n", "F1 score svc : 0.6545454545454545\n", "\n", "\n", "\n", "SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)\n" ] } ], "source": [ "# build a classifier for support vector machines\n", "clf_svc = SVC()\n", "\n", "# Set up the hyperparameter search\n", "# look at setting up your search for C (recommend 0-10 range), \n", "# kernel, and degree\n", "# http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html\n", "param_dist = {\"C\": [0.1, 0.5, 1, 3, 5],\n", " \"kernel\": ['linear','rbf']\n", " }\n", "\n", "\n", "# Run a randomized search over the hyperparameters\n", "svc_search = RandomizedSearchCV(clf_svc, param_distributions=param_dist)\n", "\n", "# Fit the model on the training data\n", "svc_search.fit(X_train, y_train)\n", "\n", "# Make predictions on the test data\n", "svc_preds = svc_search.best_estimator_.predict(X_test)\n", "\n", "ch.print_metrics(y_test, svc_preds, 'svc')\n", "print(svc_search.best_estimator_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Step 5**: Use the test below to see if your best model matched, what we found after running the grid search. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nice! It looks like your best model matches the best model I found as well! It makes sense to use f1 score to determine best in this case given the imbalance of classes. There might be justification for precision or recall being the best metric to use as well - precision showed to be best with adaboost again. With recall, SVMs proved to be the best for our models.\n" ] } ], "source": [ "a = 'randomforest'\n", "b = 'adaboost'\n", "c = 'supportvector'\n", "\n", "best_model = b\n", "\n", "# See if your best model was also mine. \n", "# Notice these might not match depending your search!\n", "ch.check_best(best_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have found your best model, it is also important to understand why it is performing well. In regression models where you can see the weights, it can be much easier to interpret results. \n", "\n", "> **Step 6**: Despite the fact that your models here are more difficult to interpret, there are some ways to get an idea of which features are important. Using the \"best model\" from the previous question, find the features that were most important in helping determine if an individual would have diabetes or not. Do your conclusions match what you might have expected during the exploratory phase of this notebook?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 3 4 2 6 7 5 1]\n", "[ 0.04991133 0.32363753 0.08106476 0.05030821 0.05292798 0.17536724\n", " 0.11839917 0.14838379]\n" ] } ], "source": [ "print(np.argsort(random_search.best_estimator_.feature_importances_))\n", "print(random_search.best_estimator_.feature_importances_)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Show your work here - the plot below was helpful for me\n", "# https://stackoverflow.com/questions/44101458/random-forest-feature-importance-chart-using-python\n", "\n", "features = diabetes.columns[:diabetes.shape[1]]\n", "importances = random_search.best_estimator_.feature_importances_\n", "indices = np.argsort(importances)\n", "\n", "plt.title('Feature Importances')\n", "plt.barh(range(len(indices)), importances[indices], color='b', align='center')\n", "plt.yticks(range(len(indices)), features[indices])\n", "plt.xlabel('Relative Importance');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Step 7**: Using your results above to complete the dictionary below." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "That's right! Some of these were expected, but some were a bit unexpected too!\n" ] } ], "source": [ "# Check your solution by matching the correct values in the dictionary\n", "# and running this cell\n", "a = 'Age'\n", "b = 'BloodPressure'\n", "c = 'BMI'\n", "d = 'DiabetesPedigreeFunction'\n", "e = 'Insulin'\n", "f = 'Glucose'\n", "g = 'Pregnancy'\n", "h = 'SkinThickness'\n", "\n", "\n", "\n", "sol_seven = {\n", " 'The variable that is most related to the outcome of diabetes' : f,\n", " 'The second most related variable to the outcome of diabetes' : c,\n", " 'The third most related variable to the outcome of diabetes' : a,\n", " 'The fourth most related variable to the outcome of diabetes' : d\n", "}\n", "\n", "ch.check_q_seven(sol_seven)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Step 8**: Now provide a summary of what you did through this notebook, and how you might explain the results to a non-technical individual. When you are done, check out the solution notebook by clicking the orange icon in the upper left." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "In this case study, we looked at predicting diabetes for 768 patients. There was a reasonable amount of class imbalance with just under 35% of patients having diabetes. There were no missing data, and initial looks at the data showed it would be difficult to separate patients with diabetes from those that did not have diabetes.\n", "\n", "Three advanced modeling techniques were used to predict whether or not a patient has diabetes. The most successful of these techniques proved to be an AdaBoost Classification technique, which had the following metrics:\n", "\n", "Accuracy score for adaboost : 0.7792207792207793\n", "\n", "Precision score adaboost : 0.7560975609756098\n", "\n", "Recall score adaboost : 0.5636363636363636\n", "\n", "F1 score adaboost : 0.6458333333333333\n", "\n", "Based on the initial look at the data, it is unsurprising that Glucose, BMI, and Age were important in understanding if a patient has diabetes. These were consistent with more sophisticated approaches. Interesting findings were that pregnancy looked to be correlated when initially looking at the data. However, this was likely due to its large correlation with age.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }