{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at how to implement a logistic regression model in Python. First, we need to import the required packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:37.509336Z", "iopub.status.busy": "2024-05-31T21:41:37.509056Z", "iopub.status.idle": "2024-05-31T21:41:39.996531Z", "shell.execute_reply": "2024-05-31T21:41:39.995766Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, recall_score, precision_score, roc_curve\n", "pd.set_option('display.max_columns', 50) # Display up to 50 columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's download the dataset automatically, unzip it, and place it in a folder called `data` if you haven't done so already" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:40.001148Z", "iopub.status.busy": "2024-05-31T21:41:40.000773Z", "iopub.status.idle": "2024-05-31T21:41:43.055849Z", "shell.execute_reply": "2024-05-31T21:41:43.055214Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading dataset...\n", "DONE!\n" ] } ], "source": [ "from io import BytesIO\n", "from urllib.request import urlopen\n", "from zipfile import ZipFile\n", "import os.path\n", "\n", "# Check if the file exists\n", "if not os.path.isfile('data/card_transdata.csv'):\n", "\n", " print('Downloading dataset...')\n", "\n", " # Define the dataset to be downloaded\n", " zipurl = 'https://www.kaggle.com/api/v1/datasets/download/dhanushnarayananr/credit-card-fraud'\n", "\n", " # Download and unzip the dataset in the data folder\n", " with urlopen(zipurl) as zipresp:\n", " with ZipFile(BytesIO(zipresp.read())) as zfile:\n", " zfile.extractall('data')\n", "\n", " print('DONE!')\n", "\n", "else:\n", "\n", " print('Dataset already downloaded!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can load the data into a DataFrame using the `read_csv` function from the `pandas` library" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.059169Z", "iopub.status.busy": "2024-05-31T21:41:43.058925Z", "iopub.status.idle": "2024-05-31T21:41:43.901661Z", "shell.execute_reply": "2024-05-31T21:41:43.901060Z" } }, "outputs": [], "source": [ "df = pd.read_csv('data/card_transdata.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that it is common to call this variable `df` which is short for DataFrame.\n", "\n", "This is a **dataset of credit card transactions** from [Kaggle.com](https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud/data). The target variable $y$ is `fraud`, which indicates whether the transaction is fraudulent or not. The other variables are the features $x$ of the transactions.\n", "\n", "\n", "### Data Exploration & Preprocessing\n", "\n", "The first step whenever you load a new dataset is to familiarize yourself with it. You need to understand what the variables represent, what the target variable is, and what the data looks like. This is called **data exploration**. Depending on the dataset, you might need to preprocess it (e.g., check for missing values and duplicates, or create new variables) before you can use it to train a machine-learning model. This is called **data preprocessing**.\n", "\n", "#### Basic Dataframe Operations {-}\n", "\n", "Let's see how many rows and columns the dataset has" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.905015Z", "iopub.status.busy": "2024-05-31T21:41:43.904762Z", "iopub.status.idle": "2024-05-31T21:41:43.912619Z", "shell.execute_reply": "2024-05-31T21:41:43.912039Z" } }, "outputs": [ { "data": { "text/plain": [ "(1000000, 8)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset has 1 million rows (observations) and 8 columns (variables)! Now, let's have a look at the first few rows of the dataset with the `head()` method" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.915615Z", "iopub.status.busy": "2024-05-31T21:41:43.915385Z", "iopub.status.idle": "2024-05-31T21:41:43.929105Z", "shell.execute_reply": "2024-05-31T21:41:43.928533Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
distance_from_home57.87785710.8299435.0910792.24756444.190936
distance_from_last_transaction0.3111400.1755920.8051535.6000440.566486
ratio_to_median_purchase_price1.9459401.2942190.4277150.3626632.222767
repeat_retailer1.0000001.0000001.0000001.0000001.000000
used_chip1.0000000.0000000.0000001.0000001.000000
used_pin_number0.0000000.0000000.0000000.0000000.000000
online_order0.0000000.0000001.0000001.0000001.000000
fraud0.0000000.0000000.0000000.0000000.000000
\n", "
" ], "text/plain": [ " 0 1 2 3 \\\n", "distance_from_home 57.877857 10.829943 5.091079 2.247564 \n", "distance_from_last_transaction 0.311140 0.175592 0.805153 5.600044 \n", "ratio_to_median_purchase_price 1.945940 1.294219 0.427715 0.362663 \n", "repeat_retailer 1.000000 1.000000 1.000000 1.000000 \n", "used_chip 1.000000 0.000000 0.000000 1.000000 \n", "used_pin_number 0.000000 0.000000 0.000000 0.000000 \n", "online_order 0.000000 0.000000 1.000000 1.000000 \n", "fraud 0.000000 0.000000 0.000000 0.000000 \n", "\n", " 4 \n", "distance_from_home 44.190936 \n", "distance_from_last_transaction 0.566486 \n", "ratio_to_median_purchase_price 2.222767 \n", "repeat_retailer 1.000000 \n", "used_chip 1.000000 \n", "used_pin_number 0.000000 \n", "online_order 1.000000 \n", "fraud 0.000000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would like to see more entries in the dataset, you can use the `head()` method with an argument corresponding to the number of rows, e.g.," ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.932192Z", "iopub.status.busy": "2024-05-31T21:41:43.931949Z", "iopub.status.idle": "2024-05-31T21:41:43.949990Z", "shell.execute_reply": "2024-05-31T21:41:43.949346Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
057.8778570.3111401.9459401.01.00.00.00.0
110.8299430.1755921.2942191.00.00.00.00.0
25.0910790.8051530.4277151.00.00.01.00.0
32.2475645.6000440.3626631.01.00.01.00.0
444.1909360.5664862.2227671.01.00.01.00.0
55.58640813.2610730.0647681.00.00.00.00.0
63.7240190.9568380.2784651.00.00.01.00.0
74.8482470.3207351.2730501.00.01.00.00.0
80.8766322.5036091.5169990.00.00.00.00.0
98.8390472.9705122.3616831.00.00.01.00.0
1014.2635300.1587581.1361021.01.00.01.00.0
1113.5923680.2405401.3703301.01.00.01.00.0
12765.2825590.3715620.5512451.01.00.00.00.0
132.13195656.3724016.3586671.00.00.01.01.0
1413.9559720.2715222.7989011.00.00.01.00.0
15179.6651480.1209200.5356401.01.01.01.00.0
16114.5197890.7070030.5169901.00.00.00.00.0
173.5896496.2474581.8464511.00.00.00.00.0
1811.08515234.6613512.5307581.00.00.01.00.0
196.1946711.1420140.3072171.00.00.00.00.0
\n", "
" ], "text/plain": [ " distance_from_home distance_from_last_transaction \\\n", "0 57.877857 0.311140 \n", "1 10.829943 0.175592 \n", "2 5.091079 0.805153 \n", "3 2.247564 5.600044 \n", "4 44.190936 0.566486 \n", "5 5.586408 13.261073 \n", "6 3.724019 0.956838 \n", "7 4.848247 0.320735 \n", "8 0.876632 2.503609 \n", "9 8.839047 2.970512 \n", "10 14.263530 0.158758 \n", "11 13.592368 0.240540 \n", "12 765.282559 0.371562 \n", "13 2.131956 56.372401 \n", "14 13.955972 0.271522 \n", "15 179.665148 0.120920 \n", "16 114.519789 0.707003 \n", "17 3.589649 6.247458 \n", "18 11.085152 34.661351 \n", "19 6.194671 1.142014 \n", "\n", " ratio_to_median_purchase_price repeat_retailer used_chip \\\n", "0 1.945940 1.0 1.0 \n", "1 1.294219 1.0 0.0 \n", "2 0.427715 1.0 0.0 \n", "3 0.362663 1.0 1.0 \n", "4 2.222767 1.0 1.0 \n", "5 0.064768 1.0 0.0 \n", "6 0.278465 1.0 0.0 \n", "7 1.273050 1.0 0.0 \n", "8 1.516999 0.0 0.0 \n", "9 2.361683 1.0 0.0 \n", "10 1.136102 1.0 1.0 \n", "11 1.370330 1.0 1.0 \n", "12 0.551245 1.0 1.0 \n", "13 6.358667 1.0 0.0 \n", "14 2.798901 1.0 0.0 \n", "15 0.535640 1.0 1.0 \n", "16 0.516990 1.0 0.0 \n", "17 1.846451 1.0 0.0 \n", "18 2.530758 1.0 0.0 \n", "19 0.307217 1.0 0.0 \n", "\n", " used_pin_number online_order fraud \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 1.0 0.0 \n", "3 0.0 1.0 0.0 \n", "4 0.0 1.0 0.0 \n", "5 0.0 0.0 0.0 \n", "6 0.0 1.0 0.0 \n", "7 1.0 0.0 0.0 \n", "8 0.0 0.0 0.0 \n", "9 0.0 1.0 0.0 \n", "10 0.0 1.0 0.0 \n", "11 0.0 1.0 0.0 \n", "12 0.0 0.0 0.0 \n", "13 0.0 1.0 1.0 \n", "14 0.0 1.0 0.0 \n", "15 1.0 1.0 0.0 \n", "16 0.0 0.0 0.0 \n", "17 0.0 0.0 0.0 \n", "18 0.0 1.0 0.0 \n", "19 0.0 0.0 0.0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that analogously you can also use the `tail()` method to see the last few rows of the dataset.\n", "\n", "We can also check what the variables in our dataset are called" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.953563Z", "iopub.status.busy": "2024-05-31T21:41:43.953287Z", "iopub.status.idle": "2024-05-31T21:41:43.958035Z", "shell.execute_reply": "2024-05-31T21:41:43.957439Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['distance_from_home', 'distance_from_last_transaction',\n", " 'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',\n", " 'used_pin_number', 'online_order', 'fraud'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the data types of the variables" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.961158Z", "iopub.status.busy": "2024-05-31T21:41:43.960918Z", "iopub.status.idle": "2024-05-31T21:41:43.965614Z", "shell.execute_reply": "2024-05-31T21:41:43.965085Z" } }, "outputs": [ { "data": { "text/plain": [ "distance_from_home float64\n", "distance_from_last_transaction float64\n", "ratio_to_median_purchase_price float64\n", "repeat_retailer float64\n", "used_chip float64\n", "used_pin_number float64\n", "online_order float64\n", "fraud float64\n", "dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, all our variables are floating-point numbers (`float`). This means that they are numbers that have a fractional part such as 1.5, 3.14, etc. The number after `float`, `64` in this case refers to the number of bits that are used to represent this number in the computer's memory. With 64 bits you can store more decimals than you could with, for example, 32, meaning that the results of computations can be more precise. But for the topics discussed in this course, this is not very important. Other common data types that you might encounter are integers (`int`) such as 1, 3, 5, etc., or strings (`str`) such as `'hello'`, `'world'`, etc.\n", "\n", "Let's dig deeper into the dataset and see some summary statistics" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:43.968537Z", "iopub.status.busy": "2024-05-31T21:41:43.968304Z", "iopub.status.idle": "2024-05-31T21:41:44.273250Z", "shell.execute_reply": "2024-05-31T21:41:44.272566Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
distance_from_home1000000.026.62879265.3907840.0048743.8780089.96776025.74398510632.723672
distance_from_last_transaction1000000.05.03651925.8430930.0001180.2966710.9986503.35574811851.104565
ratio_to_median_purchase_price1000000.01.8241822.7995890.0043990.4756730.9977172.096370267.802942
repeat_retailer1000000.00.8815360.3231570.0000001.0000001.0000001.0000001.000000
used_chip1000000.00.3503990.4770950.0000000.0000000.0000001.0000001.000000
used_pin_number1000000.00.1006080.3008090.0000000.0000000.0000000.0000001.000000
online_order1000000.00.6505520.4767960.0000000.0000001.0000001.0000001.000000
fraud1000000.00.0874030.2824250.0000000.0000000.0000000.0000001.000000
\n", "
" ], "text/plain": [ " count mean std min \\\n", "distance_from_home 1000000.0 26.628792 65.390784 0.004874 \n", "distance_from_last_transaction 1000000.0 5.036519 25.843093 0.000118 \n", "ratio_to_median_purchase_price 1000000.0 1.824182 2.799589 0.004399 \n", "repeat_retailer 1000000.0 0.881536 0.323157 0.000000 \n", "used_chip 1000000.0 0.350399 0.477095 0.000000 \n", "used_pin_number 1000000.0 0.100608 0.300809 0.000000 \n", "online_order 1000000.0 0.650552 0.476796 0.000000 \n", "fraud 1000000.0 0.087403 0.282425 0.000000 \n", "\n", " 25% 50% 75% max \n", "distance_from_home 3.878008 9.967760 25.743985 10632.723672 \n", "distance_from_last_transaction 0.296671 0.998650 3.355748 11851.104565 \n", "ratio_to_median_purchase_price 0.475673 0.997717 2.096370 267.802942 \n", "repeat_retailer 1.000000 1.000000 1.000000 1.000000 \n", "used_chip 0.000000 0.000000 1.000000 1.000000 \n", "used_pin_number 0.000000 0.000000 0.000000 1.000000 \n", "online_order 0.000000 1.000000 1.000000 1.000000 \n", "fraud 0.000000 0.000000 0.000000 1.000000 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `describe()` method we can see the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of each variable in the dataset.\n", "\n", "\n", "#### Checking for Missing Values and Duplicated Rows {-}\n", "\n", "It is also important to check for missing values and duplicated rows in the dataset. Missing values can be problematic for machine learning models, as they might not be able to handle them. Duplicated rows can also be problematic, as they might introduce bias in the model.\n", "\n", "We can check for missing values (NA) that are encoded as None or `numpy.NaN` (Not a Number) with the `isna()` method. This method returns a boolean DataFrame (i.e., a DataFrame with `True` and `False` values) with the same shape as the original DataFrame, where `True` values indicate missing values." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.276571Z", "iopub.status.busy": "2024-05-31T21:41:44.276326Z", "iopub.status.idle": "2024-05-31T21:41:44.292844Z", "shell.execute_reply": "2024-05-31T21:41:44.291889Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
0FalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalse
...........................
999995FalseFalseFalseFalseFalseFalseFalseFalse
999996FalseFalseFalseFalseFalseFalseFalseFalse
999997FalseFalseFalseFalseFalseFalseFalseFalse
999998FalseFalseFalseFalseFalseFalseFalseFalse
999999FalseFalseFalseFalseFalseFalseFalseFalse
\n", "

1000000 rows × 8 columns

\n", "
" ], "text/plain": [ " distance_from_home distance_from_last_transaction \\\n", "0 False False \n", "1 False False \n", "2 False False \n", "3 False False \n", "4 False False \n", "... ... ... \n", "999995 False False \n", "999996 False False \n", "999997 False False \n", "999998 False False \n", "999999 False False \n", "\n", " ratio_to_median_purchase_price repeat_retailer used_chip \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "... ... ... ... \n", "999995 False False False \n", "999996 False False False \n", "999997 False False False \n", "999998 False False False \n", "999999 False False False \n", "\n", " used_pin_number online_order fraud \n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "... ... ... ... \n", "999995 False False False \n", "999996 False False False \n", "999997 False False False \n", "999998 False False False \n", "999999 False False False \n", "\n", "[1000000 rows x 8 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or to make it easier to see, we can sum the number of missing values for each variable" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.296013Z", "iopub.status.busy": "2024-05-31T21:41:44.295745Z", "iopub.status.idle": "2024-05-31T21:41:44.312444Z", "shell.execute_reply": "2024-05-31T21:41:44.311788Z" } }, "outputs": [ { "data": { "text/plain": [ "distance_from_home 0\n", "distance_from_last_transaction 0\n", "ratio_to_median_purchase_price 0\n", "repeat_retailer 0\n", "used_chip 0\n", "used_pin_number 0\n", "online_order 0\n", "fraud 0\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Luckily, there seem to be no missing values. However, you need to be careful! Sometimes missing values are encoded as empty strings `''` or `numpy.inf` (infinity), which are not considered missing values by the `isna()` method. If you suspect that this might be the case, you need to make additional checks.\n", "\n", "As an alternative, we could also look at the `info()` method, which provides a summary of the DataFrame, including the number of non-null values in each column. If there are missing values, the number of non-null values will be less than the number of rows in the dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.315680Z", "iopub.status.busy": "2024-05-31T21:41:44.315410Z", "iopub.status.idle": "2024-05-31T21:41:44.341337Z", "shell.execute_reply": "2024-05-31T21:41:44.340575Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000000 entries, 0 to 999999\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 distance_from_home 1000000 non-null float64\n", " 1 distance_from_last_transaction 1000000 non-null float64\n", " 2 ratio_to_median_purchase_price 1000000 non-null float64\n", " 3 repeat_retailer 1000000 non-null float64\n", " 4 used_chip 1000000 non-null float64\n", " 5 used_pin_number 1000000 non-null float64\n", " 6 online_order 1000000 non-null float64\n", " 7 fraud 1000000 non-null float64\n", "dtypes: float64(8)\n", "memory usage: 61.0 MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also check for duplicated rows with the `duplicated()` method. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.344806Z", "iopub.status.busy": "2024-05-31T21:41:44.344474Z", "iopub.status.idle": "2024-05-31T21:41:44.755545Z", "shell.execute_reply": "2024-05-31T21:41:44.754828Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [distance_from_home, distance_from_last_transaction, ratio_to_median_purchase_price, repeat_retailer, used_chip, used_pin_number, online_order, fraud]\n", "Index: []" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df.duplicated()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Luckily, there are also no duplicated rows.\n", "\n", "\n", "#### Data Visualization {-}\n", "\n", "Let's continue with some data visualization. We can use the `matplotlib` library to create plots. We have already imported the library at the beginning of the notebook.\n", "\n", "Let's start by plotting the distribution of the target variable `fraud` which can only take values zero and one. We can type" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.758854Z", "iopub.status.busy": "2024-05-31T21:41:44.758591Z", "iopub.status.idle": "2024-05-31T21:41:44.775173Z", "shell.execute_reply": "2024-05-31T21:41:44.774500Z" } }, "outputs": [ { "data": { "text/plain": [ "fraud\n", "0.0 912597\n", "1.0 87403\n", "Name: count, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['fraud'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to get the count of each value. We can also use the `normalize=True` argument to get the fraction of observations instead of the count" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.778957Z", "iopub.status.busy": "2024-05-31T21:41:44.778650Z", "iopub.status.idle": "2024-05-31T21:41:44.790903Z", "shell.execute_reply": "2024-05-31T21:41:44.790082Z" } }, "outputs": [ { "data": { "text/plain": [ "fraud\n", "0.0 0.912597\n", "1.0 0.087403\n", "Name: proportion, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['fraud'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then plot it as follows" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.794842Z", "iopub.status.busy": "2024-05-31T21:41:44.794563Z", "iopub.status.idle": "2024-05-31T21:41:44.988475Z", "shell.execute_reply": "2024-05-31T21:41:44.987957Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df['fraud'].value_counts(normalize=True).plot(kind='bar')\n", "plt.xlabel('Fraud')\n", "plt.ylabel('Fraction of Observations')\n", "plt.title('Distribution of Fraud')\n", "ax = plt.gca()\n", "ax.set_ylim([0.0, 1.0])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can plot it as a pie chart" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:44.992370Z", "iopub.status.busy": "2024-05-31T21:41:44.992157Z", "iopub.status.idle": "2024-05-31T21:41:45.098729Z", "shell.execute_reply": "2024-05-31T21:41:45.097631Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.value_counts(\"fraud\").plot.pie(autopct = \"%.1f\")\n", "plt.ylabel('')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our dataset seems to be quite imbalanced, as only 8.7% of the transactions are fraudulent. This is a common problem in fraud detection datasets, as fraudulent transactions are usually very rare. We will need to **keep this in mind** when evaluating our machine learning model: the accuracy measure will be very high even for bad models, as the model can just predict that all transactions are not fraudulent and still get an accuracy of 91.3%.\n", "\n", "Let's look at some distributions. Most of the variables in the dataset are binary (0 or 1) variables. However, we also have some continuous variables. Let's plot the distribution of the variable `ratio_to_median_purchase_price`, which is a continuous variable." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:45.102989Z", "iopub.status.busy": "2024-05-31T21:41:45.102739Z", "iopub.status.idle": "2024-05-31T21:41:45.302505Z", "shell.execute_reply": "2024-05-31T21:41:45.301894Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df['ratio_to_median_purchase_price'].hist(bins = 50, range=[0, 30])\n", "plt.xlabel('Ratio to Median Purchase Price')\n", "plt.ylabel('Count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also plot the distribution of the variable `ratio_to_median_purchase_price` by the target variable `fraud` to see if there are any differences between fraudulent and non-fraudulent transactions" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:45.305505Z", "iopub.status.busy": "2024-05-31T21:41:45.305276Z", "iopub.status.idle": "2024-05-31T21:41:45.747771Z", "shell.execute_reply": "2024-05-31T21:41:45.746974Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1,2)\n", "df['ratio_to_median_purchase_price'].hist(bins = 50, range=[0, 30], by=df['fraud'], ax = ax)\n", "ax[0].set_xlabel('Ratio to Median Purchase Price')\n", "ax[1].set_xlabel('Ratio to Median Purchase Price')\n", "ax[0].set_ylabel('Count')\n", "ax[0].set_title('No Fraud')\n", "ax[1].set_title('Fraud')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are indeed some differences between fraudulent and non-fraudulent transactions. For example, fraudulent transactions seem to have a higher ratio to the median purchase price, which is expected as fraudsters might try to make large transactions to maximize their profit.\n", "\n", "We can also look at the correlation between the variables in the dataset. The correlation is a measure of how two variables move together" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:45.751131Z", "iopub.status.busy": "2024-05-31T21:41:45.750873Z", "iopub.status.idle": "2024-05-31T21:41:45.933494Z", "shell.execute_reply": "2024-05-31T21:41:45.932881Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
distance_from_home1.0000000.000193-0.0013740.143124-0.000697-0.001622-0.0013010.187571
distance_from_last_transaction0.0001931.0000000.001013-0.0009280.002055-0.0008990.0001410.091917
ratio_to_median_purchase_price-0.0013740.0010131.0000000.0013740.0005870.000942-0.0003300.462305
repeat_retailer0.143124-0.0009280.0013741.000000-0.001345-0.000417-0.000532-0.001357
used_chip-0.0006970.0020550.000587-0.0013451.000000-0.001393-0.000219-0.060975
used_pin_number-0.001622-0.0008990.000942-0.000417-0.0013931.000000-0.000291-0.100293
online_order-0.0013010.000141-0.000330-0.000532-0.000219-0.0002911.0000000.191973
fraud0.1875710.0919170.462305-0.001357-0.060975-0.1002930.1919731.000000
\n", "
" ], "text/plain": [ " distance_from_home \\\n", "distance_from_home 1.000000 \n", "distance_from_last_transaction 0.000193 \n", "ratio_to_median_purchase_price -0.001374 \n", "repeat_retailer 0.143124 \n", "used_chip -0.000697 \n", "used_pin_number -0.001622 \n", "online_order -0.001301 \n", "fraud 0.187571 \n", "\n", " distance_from_last_transaction \\\n", "distance_from_home 0.000193 \n", "distance_from_last_transaction 1.000000 \n", "ratio_to_median_purchase_price 0.001013 \n", "repeat_retailer -0.000928 \n", "used_chip 0.002055 \n", "used_pin_number -0.000899 \n", "online_order 0.000141 \n", "fraud 0.091917 \n", "\n", " ratio_to_median_purchase_price \\\n", "distance_from_home -0.001374 \n", "distance_from_last_transaction 0.001013 \n", "ratio_to_median_purchase_price 1.000000 \n", "repeat_retailer 0.001374 \n", "used_chip 0.000587 \n", "used_pin_number 0.000942 \n", "online_order -0.000330 \n", "fraud 0.462305 \n", "\n", " repeat_retailer used_chip used_pin_number \\\n", "distance_from_home 0.143124 -0.000697 -0.001622 \n", "distance_from_last_transaction -0.000928 0.002055 -0.000899 \n", "ratio_to_median_purchase_price 0.001374 0.000587 0.000942 \n", "repeat_retailer 1.000000 -0.001345 -0.000417 \n", "used_chip -0.001345 1.000000 -0.001393 \n", "used_pin_number -0.000417 -0.001393 1.000000 \n", "online_order -0.000532 -0.000219 -0.000291 \n", "fraud -0.001357 -0.060975 -0.100293 \n", "\n", " online_order fraud \n", "distance_from_home -0.001301 0.187571 \n", "distance_from_last_transaction 0.000141 0.091917 \n", "ratio_to_median_purchase_price -0.000330 0.462305 \n", "repeat_retailer -0.000532 -0.001357 \n", "used_chip -0.000219 -0.060975 \n", "used_pin_number -0.000291 -0.100293 \n", "online_order 1.000000 0.191973 \n", "fraud 0.191973 1.000000 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.corr() # Pearson correlation (for linear relationships)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:45.936547Z", "iopub.status.busy": "2024-05-31T21:41:45.936299Z", "iopub.status.idle": "2024-05-31T21:41:46.996272Z", "shell.execute_reply": "2024-05-31T21:41:46.995477Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
distance_from_homedistance_from_last_transactionratio_to_median_purchase_pricerepeat_retailerused_chipused_pin_numberonline_orderfraud
distance_from_home1.000000-0.001068-0.0001520.559724-0.000118-0.000338-0.0018120.095032
distance_from_last_transaction-0.0010681.000000-0.000111-0.001352-0.0001650.000555-0.0010760.034661
ratio_to_median_purchase_price-0.000152-0.0001111.0000000.001202-0.0000990.000251-0.0003760.342838
repeat_retailer0.559724-0.0013520.0012021.000000-0.001345-0.000417-0.000532-0.001357
used_chip-0.000118-0.000165-0.000099-0.0013451.000000-0.001393-0.000219-0.060975
used_pin_number-0.0003380.0005550.000251-0.000417-0.0013931.000000-0.000291-0.100293
online_order-0.001812-0.001076-0.000376-0.000532-0.000219-0.0002911.0000000.191973
fraud0.0950320.0346610.342838-0.001357-0.060975-0.1002930.1919731.000000
\n", "
" ], "text/plain": [ " distance_from_home \\\n", "distance_from_home 1.000000 \n", "distance_from_last_transaction -0.001068 \n", "ratio_to_median_purchase_price -0.000152 \n", "repeat_retailer 0.559724 \n", "used_chip -0.000118 \n", "used_pin_number -0.000338 \n", "online_order -0.001812 \n", "fraud 0.095032 \n", "\n", " distance_from_last_transaction \\\n", "distance_from_home -0.001068 \n", "distance_from_last_transaction 1.000000 \n", "ratio_to_median_purchase_price -0.000111 \n", "repeat_retailer -0.001352 \n", "used_chip -0.000165 \n", "used_pin_number 0.000555 \n", "online_order -0.001076 \n", "fraud 0.034661 \n", "\n", " ratio_to_median_purchase_price \\\n", "distance_from_home -0.000152 \n", "distance_from_last_transaction -0.000111 \n", "ratio_to_median_purchase_price 1.000000 \n", "repeat_retailer 0.001202 \n", "used_chip -0.000099 \n", "used_pin_number 0.000251 \n", "online_order -0.000376 \n", "fraud 0.342838 \n", "\n", " repeat_retailer used_chip used_pin_number \\\n", "distance_from_home 0.559724 -0.000118 -0.000338 \n", "distance_from_last_transaction -0.001352 -0.000165 0.000555 \n", "ratio_to_median_purchase_price 0.001202 -0.000099 0.000251 \n", "repeat_retailer 1.000000 -0.001345 -0.000417 \n", "used_chip -0.001345 1.000000 -0.001393 \n", "used_pin_number -0.000417 -0.001393 1.000000 \n", "online_order -0.000532 -0.000219 -0.000291 \n", "fraud -0.001357 -0.060975 -0.100293 \n", "\n", " online_order fraud \n", "distance_from_home -0.001812 0.095032 \n", "distance_from_last_transaction -0.001076 0.034661 \n", "ratio_to_median_purchase_price -0.000376 0.342838 \n", "repeat_retailer -0.000532 -0.001357 \n", "used_chip -0.000219 -0.060975 \n", "used_pin_number -0.000291 -0.100293 \n", "online_order 1.000000 0.191973 \n", "fraud 0.191973 1.000000 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.corr('spearman') # Spearman correlation (for monotonic relationships)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is still a bit hard to read. We can visualize the correlation matrix with a heatmap using the Seaborn library, which we have already imported at the beginning of the notebook." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:47.000105Z", "iopub.status.busy": "2024-05-31T21:41:46.999804Z", "iopub.status.idle": "2024-05-31T21:41:48.139298Z", "shell.execute_reply": "2024-05-31T21:41:48.138743Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "corr = df.corr('spearman')\n", "cmap = sns.diverging_palette(10, 255, as_cmap=True) # Create a color map\n", "mask = np.triu(np.ones_like(corr, dtype=bool)) # Create a mask to only show the lower triangle of the matrix\n", "sns.heatmap(corr, cmap=cmap, vmax=1, center=0, mask=mask) # Create a heatmap of the correlation matrix (Note: vmax=1 makes sure that the color map goes up to 1 and center=0 are used to center the color map at 0)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how `ratio_to_median_purchase_price` is positively correlated with `fraud`, which is expected as we saw in the previous plot that fraudulent transactions have a higher ratio to the median purchase price. Furthermore, `used_chip` and `used_pin_number` are negatively correlated with `fraud`, which makes sense as transactions, where the chip or the pin is used, are supposed to be more secure.\n", "\n", "We can also plot boxplots to visualize the distribution of the variables" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:48.142332Z", "iopub.status.busy": "2024-05-31T21:41:48.142081Z", "iopub.status.idle": "2024-05-31T21:41:48.841655Z", "shell.execute_reply": "2024-05-31T21:41:48.840495Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "selector = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price'] # Select the variables we want to plot\n", "plt.figure()\n", "ax = sns.boxplot(data = df[selector], orient = 'h') \n", "ax.set(xscale = \"log\") # Set the x-axis to a logarithmic scale to better visualize the data\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boxplots are a good way to visualize the distribution of a variable, as they show the median, the interquartile range, and the outliers. Each of the distributions shown in the boxplots above has a long right tail, which explains the large number of outliers. However, you have to be careful: you cannot just remove these outliers since these are likely to be fraudulent transactions.\n", "\n", "Let's see how many fraudulent transactions we would remove if we blindly remove the outliers according to the interquartile range" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:48.845243Z", "iopub.status.busy": "2024-05-31T21:41:48.844878Z", "iopub.status.idle": "2024-05-31T21:41:48.897660Z", "shell.execute_reply": "2024-05-31T21:41:48.896990Z" } }, "outputs": [ { "data": { "text/plain": [ "fraud\n", "1.0 53092\n", "0.0 31294\n", "Name: count, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the interquartile range\n", "Q1 = df['ratio_to_median_purchase_price'].quantile(0.25)\n", "Q3 = df['ratio_to_median_purchase_price'].quantile(0.75)\n", "IQR = Q3 - Q1\n", "\n", "# Identify outliers based on the interquartile range\n", "threshold = 1.5\n", "outliers = df[(df['ratio_to_median_purchase_price'] < Q1 - threshold * IQR) | (df['ratio_to_median_purchase_price'] > Q3 + threshold * IQR)]\n", "\n", "# Count the number of fraudulent transactions amoung our selected outliers\n", "outliers['fraud'].value_counts()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:48.900879Z", "iopub.status.busy": "2024-05-31T21:41:48.900613Z", "iopub.status.idle": "2024-05-31T21:41:48.912427Z", "shell.execute_reply": "2024-05-31T21:41:48.911733Z" } }, "outputs": [ { "data": { "text/plain": [ "fraud\n", "0.0 912597\n", "1.0 87403\n", "Name: count, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['fraud'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "53092 of 87403 (more than half!) of our fraudulent transactions would be removed if we would have blindly removed the outliers according to the interquartile range. This is a significant number of observations, which would likely hurt the performance of our machine-learning model. Therefore, we should not remove these outliers. It would make the imbalance of our dataset even worse.\n", "\n", "\n", "#### Splitting the Data into Training and Test Sets {-}\n", "\n", "Before we can train a machine learning model, we need to split our dataset into a training set and a test set. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:48.915777Z", "iopub.status.busy": "2024-05-31T21:41:48.915196Z", "iopub.status.idle": "2024-05-31T21:41:48.928414Z", "shell.execute_reply": "2024-05-31T21:41:48.927866Z" } }, "outputs": [], "source": [ "X = df.drop('fraud', axis=1) # All variables except `fraud`\n", "y = df['fraud'] # Only our fraud variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training set is used to train the model, while the test set is used to evaluate the model. We will use the `train_test_split` function from the `sklearn.model_selection` module to split our dataset. We will use 70% of the data for training and 30% for testing. We will also set the `stratify` argument to `y` to make sure that the distribution of the target variable is the same in the training and test sets. Otherwise, we might randomly not have any fraudulent transactions in the test set, which would make it impossible to correctly evaluate our model." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:48.931420Z", "iopub.status.busy": "2024-05-31T21:41:48.931186Z", "iopub.status.idle": "2024-05-31T21:41:49.298773Z", "shell.execute_reply": "2024-05-31T21:41:49.297427Z" } }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.3, random_state = 42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Scaling Features {-}\n", "\n", "To improve the performance of our machine learning model, we should scale the features. This is especially important for models that are sensitive to the scale of the features, such as logistic regression. We will use the `StandardScaler` class from the `sklearn.preprocessing` module to scale the features. The `StandardScaler` class scales the features so that they have a mean of 0 and a standard deviation of 1. Since we don't want to scale features that are binary (0 or 1), we will define a small function that scales only the features that we want" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:49.303388Z", "iopub.status.busy": "2024-05-31T21:41:49.302883Z", "iopub.status.idle": "2024-05-31T21:41:49.307848Z", "shell.execute_reply": "2024-05-31T21:41:49.307075Z" } }, "outputs": [], "source": [ "def scale_features(scaler, df, col_names, only_transform=False):\n", "\n", " # Extract the features we want to scale\n", " features = df[col_names] \n", "\n", " # Fit the scaler to the features and transform them\n", " if only_transform:\n", " features = scaler.transform(features.values)\n", " else:\n", " features = scaler.fit_transform(features.values)\n", "\n", " # Replace the original features with the scaled features\n", " df[col_names] = features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we need to run the function" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:49.311533Z", "iopub.status.busy": "2024-05-31T21:41:49.311238Z", "iopub.status.idle": "2024-05-31T21:41:49.369827Z", "shell.execute_reply": "2024-05-31T21:41:49.368806Z" } }, "outputs": [], "source": [ "col_names = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price'] \n", "scaler = StandardScaler() \n", "scale_features(scaler, X_train, col_names)\n", "scale_features(scaler, X_test, col_names, only_transform=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we only fit the scaler to the training set and then transform both the training and test set. This ensures that the same values for the features produce the same output in the training and test set. Otherwise, if we fit the scaler to the test data as well, the meaning of certain values in the test set might change, which would make it impossible to evaluate the model correctly.\n", "\n", ":::{.callout-note}\n", "### Mini-Exercise\n", "Try switching to `MinMaxScaler` instead of `StandardScaler` and see how it affects the performance of the model. `MinMaxScaler` scales the features so that they are between 0 and 1.\n", ":::\n", "\n", "\n", "### Implementing Logistic Regression\n", "\n", "Now that we have explored and preprocessed our dataset, we can move on to the next step: training a machine learning model. We will use a logistic regression model to predict whether a transaction is fraudulent or not.\n", "\n", "Using the `LogisticRegression` class from the `sklearn.linear_model` module, fitting the model to the data is straightforward using the `fit` method" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:49.373677Z", "iopub.status.busy": "2024-05-31T21:41:49.373372Z", "iopub.status.idle": "2024-05-31T21:41:50.436153Z", "shell.execute_reply": "2024-05-31T21:41:50.435419Z" } }, "outputs": [], "source": [ "clf = LogisticRegression().fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then use the `predict` method to predict the class of the test set" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:50.439618Z", "iopub.status.busy": "2024-05-31T21:41:50.439377Z", "iopub.status.idle": "2024-05-31T21:41:50.445848Z", "shell.execute_reply": "2024-05-31T21:41:50.445106Z" } }, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., 0., 1.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(X_test.head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The actual classes of the first five observations in the test dataset are" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:50.448995Z", "iopub.status.busy": "2024-05-31T21:41:50.448754Z", "iopub.status.idle": "2024-05-31T21:41:50.453822Z", "shell.execute_reply": "2024-05-31T21:41:50.453110Z" } }, "outputs": [ { "data": { "text/plain": [ "217309 0.0\n", "902387 0.0\n", "175152 0.0\n", "527113 0.0\n", "973041 1.0\n", "Name: fraud, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This seems to match quite well. Let's have a look at different performance metrics" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:50.457063Z", "iopub.status.busy": "2024-05-31T21:41:50.456840Z", "iopub.status.idle": "2024-05-31T21:41:50.895730Z", "shell.execute_reply": "2024-05-31T21:41:50.895063Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.95908\n", "Precision: 0.8954682094038908\n", "Recall: 0.6021128103428549\n", "ROC AUC: 0.9671832218100465\n" ] } ], "source": [ "y_pred = clf.predict(X_test)\n", "y_proba = clf.predict_proba(X_test)\n", "\n", "print(f\"Accuracy: {accuracy_score(y_test, y_pred)}\")\n", "print(f\"Precision: {precision_score(y_test, y_pred)}\")\n", "print(f\"Recall: {recall_score(y_test, y_pred)}\")\n", "print(f\"ROC AUC: {roc_auc_score(y_test, y_proba[:, 1])}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the accuracy is quite high since we do not have many fraudulent transactions. Recall that the precision ($\\text{Precision} = \\frac{\\text{TP}}{\\text{TP}+\\text{FP}}$) is the fraction of correctly predicted fraudulent transactions among all transactions transactions predicted to be fraudulent. The recall ($\\text{Recall} = \\frac{\\text{TP}}{\\text{TP}+\\text{FN}}$) is the fraction of correctly predicted fraudulent transactions among the actual fraudulent transactions. The ROC AUC is the area under the curve for the receiver operating characteristic (ROC) curve" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:50.899220Z", "iopub.status.busy": "2024-05-31T21:41:50.898944Z", "iopub.status.idle": "2024-05-31T21:41:51.177719Z", "shell.execute_reply": "2024-05-31T21:41:51.177134Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compute the ROC curve\n", "y_proba = clf.predict_proba(X_test)\n", "fpr, tpr, thresholds = roc_curve(y_test, y_proba[:,1])\n", "\n", "# Plot the ROC curve\n", "plt.plot(fpr, tpr)\n", "plt.plot([0, 1], [0, 1], linestyle='--', color='grey')\n", "plt.xlabel('False Positive Rate (FPR)')\n", "plt.ylabel('True Positive Rate (TPR)')\n", "plt.title('ROC Curve')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confusion matrix for the test set can be computed as follows" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:51.180496Z", "iopub.status.busy": "2024-05-31T21:41:51.180281Z", "iopub.status.idle": "2024-05-31T21:41:51.377066Z", "shell.execute_reply": "2024-05-31T21:41:51.376466Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 15788, 1843],\n", " [ 10433, 271936]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conf_mat = confusion_matrix(y_test, y_pred, labels=[1, 0]).transpose() # Transpose the sklearn confusion matrix to match the convention in the lecture\n", "conf_mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also plot the confusion matrix as a heatmap" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:51.380150Z", "iopub.status.busy": "2024-05-31T21:41:51.379899Z", "iopub.status.idle": "2024-05-31T21:41:51.509694Z", "shell.execute_reply": "2024-05-31T21:41:51.508995Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.heatmap(conf_mat, annot=True, cmap='Blues', fmt='g', xticklabels=['Fraud', 'No Fraud'], yticklabels=['Fraud', 'No Fraud'])\n", "plt.xlabel(\"Actual\")\n", "plt.ylabel(\"Predicted\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we have mostly true negatives and true positives. However, there is still a significant number of false negatives, which means that we are missing fraudulent transactions, and a significant number of false positives, which means that we are predicting transactions as fraudulent that are not fraudulent.\n", "\n", "If we would like to use a threshold other than 0.5 to predict the class of the test set, we can do so as follows" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:51.512829Z", "iopub.status.busy": "2024-05-31T21:41:51.512625Z", "iopub.status.idle": "2024-05-31T21:41:51.819314Z", "shell.execute_reply": "2024-05-31T21:41:51.818695Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.9112033333333334\n", "Precision: 0.49579121188932296\n", "Recall: 0.9389420693337401\n" ] } ], "source": [ "# Alternative threshold\n", "threshold = 0.1\n", "\n", "# Predict the class of the test set\n", "y_pred_alt = (y_proba[:, 1] >= threshold).astype(int)\n", "\n", "# Show the performance metrics\n", "print(f\"Accuracy: {accuracy_score(y_test, y_pred_alt)}\")\n", "print(f\"Precision: {precision_score(y_test, y_pred_alt)}\")\n", "print(f\"Recall: {recall_score(y_test, y_pred_alt)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting a lower threshold increases the recall but decreases the precision. This is because we are more likely to predict a transaction as fraudulent, which increases the number of true positives but also the number of false positives.\n", "\n", "What the correct threshold is depends on the problem at hand. For example, if the cost of missing a fraudulent transaction is very high, you might want to set a lower threshold to increase the recall. If the cost of falsely predicting a transaction as fraudulent is very high, you might want to set a higher threshold to increase the precision.\n", "\n", "We can also plot the performance metrics for different thresholds" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "execution": { "iopub.execute_input": "2024-05-31T21:41:51.822493Z", "iopub.status.busy": "2024-05-31T21:41:51.822238Z", "iopub.status.idle": "2024-05-31T21:42:06.668768Z", "shell.execute_reply": "2024-05-31T21:42:06.667656Z" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "N = 50\n", "thresholds_array = np.linspace(0.0, 0.999, N)\n", "accuracy_array = np.zeros(N)\n", "precision_array = np.zeros(N)\n", "recall_array = np.zeros(N)\n", "\n", "# Compute the performance metrics for different thresholds\n", "for ii, thresh in enumerate(thresholds_array):\n", " y_pred_alt_tmp = (y_proba[:, 1] > thresh).astype(int)\n", " accuracy_array[ii] = accuracy_score(y_test, y_pred_alt_tmp)\n", " precision_array[ii] = precision_score(y_test, y_pred_alt_tmp)\n", " recall_array[ii] = recall_score(y_test, y_pred_alt_tmp)\n", "\n", "# Plot the performance metrics\n", "plt.plot(thresholds_array, accuracy_array, label='Accuracy')\n", "plt.plot(thresholds_array, precision_array, label='Precision')\n", "plt.plot(thresholds_array, recall_array, label='Recall')\n", "plt.xlabel('Threshold')\n", "plt.ylabel('Score')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusions\n", "\n", "In this notebook, we have seen how to implement a logistic regression model in Python. We have loaded a dataset, explored and preprocessed it, and trained a logistic regression model to predict whether a transaction is fraudulent or not. We have evaluated the model using different performance metrics and have seen how the choice of threshold affects the performance of the model.\n", "\n", "There are many ways to improve the performance of the model. For example, we could try different machine learning models, or engineer new features. We could also try to deal with the imbalanced dataset by using techniques such as oversampling or undersampling. However, this is beyond the scope of this notebook.\n", "\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.19" } }, "nbformat": 4, "nbformat_minor": 4 }