Predicting Breast Cancer

Classifying breast cancer predictions with a deep learning model built with the newly released Tensorflow 2.0.

Predicting Breast Cancer

Tensorflow 2.0 was recently released, and I’m really excited about it! When I first started learning about deep learning back in 2017, I was using Tensorflow 1.0. And it was a pain in the ass! Tensorflow was released as a research tool for deep learning. As such, it was a very low level library. Simple programs would be hundreds of lines of code, and most of it was boilerplate or manipulating data.

This is where Keras came in. If you weren’t a researcher developing some new type of architecture, then you could abstract away a lot of that boilerplate. With Keras, you could build your deep learning neural network in a few lines of code.

As a side note: both Tensorflow and Keras were developed by Google.

So what’s so great about Tensorflow 2.0?

The biggest thing for me is that Keras is now part of Tensorflow 2.0.

Think about this for a sec. You still have access to the low level functionality (if you’re a researcher)… But if you’re looking to apply a well known model archeticture, which most people are, it’s now accomplished all within the same library.

This suggests a shift towards consolidation of ideas and standardization of processes. And when this happens, the technology becomes available to more people who aren’t necessarily domain experts in deep learning or artificial intelligence.

My First Tensorflow 2.0 Program

When learning a new software library, it helps to recreate something youv’e done before. After all, the underlying theory doesn’t change, only the tools do.

For this project I’m going to predict breast cancer using the Breast Cancer Wisconsin (Diagnostic) Data Set.

Creating a deep learning model pretty much boils down to the following steps:

  1. Load the data.
  2. Split the data into a training set and a test set.
  3. Build the model.
  4. Train the model on the training data.
  5. Evaluate the model with the test data.
  6. Make predictions.

I’ll be using Google Colab. It’s a Jupyter Notebook hosted by Google. Your programming environment is already setup for you, and you get FREE access to a GPU for training your model. This is huge! Training models is the most time consuming part of all of this. Training on a GPU is faster, but you need access to a GPU (obviously). There are entire companies, like Floydhub, that exist just to give you access to a GPU for training.

Step 1: Load the Data

As of this writing, Google Colab still loads Tensorflow 1.0 by default. So we need to tell it to load 2.0.

%tensorflow_version 2.x
import tensorflow as tf

Now we load our dataset. It’s can be downloaded from the UCI Machine Learning Repository, but it’s also included in Scikit Learn.

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

Let’s examine this data.

If we examine the data with type(data), we find that data is sklearn.utils.Bunch. This is sort of like a standard python dictionary.

Let’s look at what was actually measured with data.feature_names:

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Step 2: Split the Data into a Training Set and a Test Set

Split up our data. We’ll use 67% of the data as training data, and the remaining 33% as test data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.33)

Now we need to scale our data.

What’s the point of scaling our data? We need all of the numbers to be roughly the same magnitude. We may have a wide range of numbers in our data, some might be small like 0.0001, others might be big like 1,000,000. If we just feed these numbers into our model, the big numbers will dominate and all of the decisions will be biased towards them.

We’ll use Scikit Learn’s StandardScaler, which subtracts the mean (average) and divides by the standard deviation. This results in our data being centered around zero, with roughly 68% of them falling within the -1 to +1 range.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Build the Model

Now for the fun part, actually building our model! We’ll use a simple one here, consisting of an input layer and an output layer.

Note that this wouldn’t be considered a “deep neural network”, since there are no layers between the input and output. “Deep” doesn’t really have a threshold, but generally a deep nerual network would have 2 or more (sometimes hundreds) of layers between the input and output. These are called hidden layers.

model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=(D,)),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Here our output layer is denoted by tf.keras.layers.Dense(1, activation='sigmoid'), where the 1 means we only have 1 output (binary, as in yes cancer or no cancer), and our activation function is a sigmoid. The sigmoid gives us a probability of cancer being present given the inputs.

Sigmoid function

Step 4: Train the Model

We can train our model with just 1 line of code.

r = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100)

We feed in our training data, and we train our model 100 times on this data.

The last 5 training epochs outputs:

Epoch 96/100
381/381 [==============================] - 0s 88us/sample - loss: 0.0906 - accuracy: 0.9764 - val_loss: 0.0902 - val_accuracy: 0.9734
Epoch 97/100
381/381 [==============================] - 0s 86us/sample - loss: 0.0902 - accuracy: 0.9764 - val_loss: 0.0899 - val_accuracy: 0.9734
Epoch 98/100
381/381 [==============================] - 0s 88us/sample - loss: 0.0898 - accuracy: 0.9764 - val_loss: 0.0896 - val_accuracy: 0.9734
Epoch 99/100
381/381 [==============================] - 0s 82us/sample - loss: 0.0894 - accuracy: 0.9790 - val_loss: 0.0893 - val_accuracy: 0.9734
Epoch 100/100
381/381 [==============================] - 0s 91us/sample - loss: 0.0890 - accuracy: 0.9790 - val_loss: 0.0891 - val_accuracy: 0.9734

Step 5: Evaluate the Model

print('Train score:', model.evaluate(X_train, y_train))
print('Test score:',  model.evaluate(X_test, y_test))

Which outputs:

381/381 [==============================] - 0s 53us/sample - loss: 0.0888 - accuracy: 0.9790
Train score: [0.08877973610491265, 0.9790026]
188/188 [==============================] - 0s 51us/sample - loss: 0.0891 - accuracy: 0.9734
Test score: [0.08907421020434257, 0.9734042]

Our basic model is predicting breast cancer with 97.34% accuracy!

Let’s look at how our model performed while training.

import matplotlib.pyplot as plt

plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.legend()

Training and validation loss (error)

The loss, aka error, is a measure of how many predictions the model got right/wrong during each training iteration. This is what you want to see, it starting off big and getting smaller.

We can make a a similar plot of the model’s accuracy.

plt.plot(r.history['accuracy'], label='acc')
plt.plot(r.history['val_accuracy'], label='val_acc')
plt.legend()

Training and validation accuracy

Step 6: Make Predictions!

Let’s make some predictions now!

P = model.predict(X_test)

P is a vector of probabilities of cancer given the inputs in X_test. This is a conditional probability interpreted as p(y=1 | X).

But we don’t need probabilities. We want a vector of 0’s and 1’s, where 0 means no cancer and 1 means cancer.

Let’s round our predictions vector, and reshape it so it’s the same as our y_test data.

import numpy as np

P = np.round(P).flatten()

print(P)

Our P vector now looks like:

[1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0.
 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0.
 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 1.
 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0.
 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1.]

Now let’s calculate the accuracy, and compare it to our model.evaluate() output from before.

print('Manually calculated accuracy:', np.mean(P == y_test))
print('Evaluate() output:', model.evaluate(X_test, y_test))
Manually calculated accuracy: 0.973404255319149
188/188 [==============================] - 0s 64us/sample - loss: 0.0891 - accuracy: 0.9734
Evaluate() output: [0.08907421020434257, 0.9734042]

As expected, it’s the same.

Final Thoughts

Think about what this means for a sec… With just a few lines of code, and using real data from real medical examinations, we’re able to predict breast cancer with 97% accuracy.

They took our jobs!

Of course the point of this exercise wasn’t to take doctors’ jobs. It was to get a feel for Tensorflow 2.0.

Stay tuned for more practical deep learning and artificial intelligence posts!

View the full notebook here.