Collaboration with Shifaz Ali and Clayton Bond. Both students of the University of Waikato July 2023

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

Introduction¶

699 observations with 10 predictors for classifying malignant or benign outcomes. A kNN (k Nearest Neighbour) classifier model is created for prediction of breast cancer.

Result¶

A kNN with k-value of 3 yields the highest accuracy score of 98.5% when training a classifier on a simple 80/20 training and test split.

Methodology¶

  • A simplistic kNN classifier is created to predict breast cancer
  • A pairplot is used to explore the data to confirm that there is sufficient boundary separation between classifications
  • 80% of the data is set aside for training and 20% for testing
  • A confusion matrix and classification report is generated to further analyse the fitted model
  • Multiple models are trained using variable k-values ranging between 1 to 100 and the performance is plotted

Load data¶

Wisconsin Breast Cancer Database¶

Data obtained from https://www.kaggle.com/datasets/mariolisboa/breast-cancer-wisconsin-original-data-set

  • Columns 2 through 10 in the dataset are the measurements taken
  • 'class' column represents a 0 (benign) or 1 (malignant) classification
In [2]:
# Define id for generating random state
ID = 1586140

# Load in the file
df = pd.read_csv('https://raw.githubusercontent.com/cbondnz/data_analytics_files_2023/main/breast_cancer_bd.csv')

# Remove rows with '?'
df = df[(df != '?').all(axis=1)]

# Convert class values 2 -> 0, and 4 -> 1
df['class'] = df['class'].replace({2: 0, 4: 1})

Generate pairplot¶

Opacity is used to make overlapping markers more visible. The ID column is dropped as it adds no value to the analysis.

Benign (green dots) are clustered around the low range suggesting lower values in the measurments are associated with benign outcomes. A boundary line can be drawn that separates the classification.

In [3]:
sns.pairplot(data=df.iloc[:,1:], hue='class', palette='Set2', kind='scatter', plot_kws={'alpha':0.3}, height=1.2, aspect=1.2)
Out[3]:
<seaborn.axisgrid.PairGrid at 0x7af6e8d5d4b0>

Split data - 80% train and 20% test¶

Stratifying means that the data is split with the same proportion of spread in the population data. Using student ID to define the seed.

In [4]:
# Start after the ID Column
X = df.iloc[:, 1:-1]
y = df.iloc[:, -1]

# test_size: 20% for test set, 80% for training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=ID, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(546, 9) (546,)
(137, 9) (137,)

Train kNN classifier and predict¶

In [5]:
# Initilize and fit from the training set
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Showing some values of y_test compared to the predicted y_pred
compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
compare
Out[5]:
Actual Predicted
334 1 1
608 1 1
488 1 1
674 0 0
468 0 0
... ... ...
48 0 0
453 1 1
195 0 0
564 0 0
133 0 0

137 rows × 2 columns

Generate the Confusion Matrix and Classification Report¶

Confusion Matrix:

  • 88 patients accurately predicted to be benign when they were benign
  • 1 patient predicted to be malignant but were actually benign
  • 2 patients predicted to be malignant but are actually benign
  • 46 patients predicted to be malignant and are malignant

Therefore, 134 patients were accurately predicted to be benign/malignant, while 3 patients received an inaccurate prediction of their diagnosis.

Classification Report:

Precision column: How many samples were correctly identified in a class

  • The KNN model has high precision for both classes. This indicates that if a case is benign, 98% of the time it will be correctly identified whereas if a case is malignant then it will be correctly identified 98% of the time as well.

Recall column: How many samples in total were correctly identified over all the actual positive samples

  • The recall is also high for both classes which means 99% out of all the cases that are benign are getting correctly identified and 96% out of all cases that are malignant are getting correctly identified as well.

High precision and high recall means the model is performing well. This is seen by the f1-score - the average of the precision and recall values of each category - being close to 1.

The overall accuracy is 98% with n_neighbours having the default value of 5.

In [6]:
# Generate the Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Generate the classifcation report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Confusion Matrix:
[[88  1]
 [ 2 46]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.96      0.97        48

    accuracy                           0.98       137
   macro avg       0.98      0.97      0.98       137
weighted avg       0.98      0.98      0.98       137

Train kNN models using k-values ranging from 1 to 100¶

  • neighbours_k: determines the number of neighbours to consider when making a prediction
  • Trying different k-values from 1-100 to find the best one for the training set
In [7]:
# Generate values from 1 to 100
ks = [i for i in range (1, 101)]

# Takes the true and predicted y values and returns accuracy values for each 'k' stored in an array
acc = [accuracy_score(y_test, KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train).predict(X_test)) for k in ks]

# K will be at the index of the acc array (+1 since 0-based index)
k_val = acc.index(max(acc)) + 1
print(acc)
print(f"The highest accuracy is {max(acc)}")
[0.9708029197080292, 0.9635036496350365, 0.9854014598540146, 0.9708029197080292, 0.9781021897810219, 0.9708029197080292, 0.9781021897810219, 0.9635036496350365, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9708029197080292, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365, 0.9635036496350365]
The highest accuracy is 0.9854014598540146

Generate Plot for K vs Accuracy¶

From the vector of accuracy scores, a k-value of 3 has the highest accuracy rating of 98.5%. This is confirmed by looking at the below plot. K-values of 5 and 7 have the next highest score of 97.8%.

Ideally, we would want to pick a k-value that is odd which will avoid the situation that arises where there is a tie in the classification. Picking a low k-value can result in over-fitting. Picking a larger k-value is more desireable as it can lead to a classifier that generalises well.

In our situation, we could pick a k-value of 3, but this could lead to overfitting. Picking 5 or 7 reduces the accuracy to 97.8%. There are obvious trade-offs to be considered with picking the most optimal k-value.

In [8]:
plt.plot(ks, acc)
plt.grid()
plt.xlabel('k')
plt.ylabel('accs')
plt.title('K vs Accuracy')
plt.show()