A Principal Component Analysis and Logistic Regression of Breast Cancer Data
As usual, it will be presented from a R Notebook.
<!DOCTYPE html>
Logistic Regression with PCA Example
Dr. Clifton Baldwin
April 2020
This script reads in breast cancer data, does a PCA to determine significant features, and then does some logistic regressions using the significant features from the first few principal components
library(tidyverse) # For read_csv and ggplot
library(factoextra) # For fviz_screeplot and fviz_contrib
Cancer data
- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
- http://jotterbach.github.io/2016/03/24/Principal_Component_Analysis/
I want to determine Diagnosis, which requires supervised learning, but I do not know what features to include in my model. I need to determine what columns in the dataset are appropriate for analysis, for example, I want to identify possible data from identifiers like IDs or Names. Then I can use PCA to determine the features for a logistic regression supervised learning model. I will compute the principal components. Then I will select a subset of original features from the weights of the principal components. Using the original variables and then again with the principal components, I will run a logistic regression. Finally I will look at the statistics to see how good each model does.
Prep the data using “cleaned data”
cancer <- read_csv("wdbc.csv", col_names = TRUE)
Parsed with column specification:
cols(
.default = col_double(),
Diagnosis = [31mcol_character()[39m
)
See spec(...) for full column specifications.
# Let us see the data
glimpse(cancer)
Observations: 569
Variables: 32
$ ID [3m[38;5;246m<dbl>[39m[23m 842302, 842517, 84300903, 84348301, 84358402, 843786, 844…
$ Diagnosis [3m[38;5;246m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
$ mean_radius [3m[38;5;246m<dbl>[39m[23m 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.250, 1…
$ mean_texture [3m[38;5;246m<dbl>[39m[23m 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.83, 2…
$ mean_perimeter [3m[38;5;246m<dbl>[39m[23m 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60, 90.…
$ mean_area [3m[38;5;246m<dbl>[39m[23m 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0, 577…
$ mean_smoothness [3m[38;5;246m<dbl>[39m[23m 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780, 0.0…
$ mean_compactness [3m[38;5;246m<dbl>[39m[23m 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000, 0.1…
$ mean_concavity [3m[38;5;246m<dbl>[39m[23m 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780, 0.1…
$ mean_concave_points [3m[38;5;246m<dbl>[39m[23m 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0.08089, 0.0…
$ mean_symmetry [3m[38;5;246m<dbl>[39m[23m 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.1794, 0…
$ mean_fractal_dimension [3m[38;5;246m<dbl>[39m[23m 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613, 0.0…
$ se_radius [3m[38;5;246m<dbl>[39m[23m 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345, 0.4467, 0…
$ se_texture [3m[38;5;246m<dbl>[39m[23m 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902, 0.7732, 1…
$ se_perimeter [3m[38;5;246m<dbl>[39m[23m 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.180, 3.856, 2…
$ se_area [3m[38;5;246m<dbl>[39m[23m 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.91, 50.96, …
$ se_smoothness [3m[38;5;246m<dbl>[39m[23m 0.006399, 0.005225, 0.006150, 0.009110, 0.011490, 0.00751…
$ se_compactness [3m[38;5;246m<dbl>[39m[23m 0.049040, 0.013080, 0.040060, 0.074580, 0.024610, 0.03345…
$ se_concavity [3m[38;5;246m<dbl>[39m[23m 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0.03672, 0.0…
$ se_concave_points [3m[38;5;246m<dbl>[39m[23m 0.015870, 0.013400, 0.020580, 0.018670, 0.018850, 0.01137…
$ se_symmetry [3m[38;5;246m<dbl>[39m[23m 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0.02165, 0.0…
$ se_fractal_dimension [3m[38;5;246m<dbl>[39m[23m 0.006193, 0.003532, 0.004571, 0.009208, 0.005115, 0.00508…
$ worst_radius [3m[38;5;246m<dbl>[39m[23m 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.88, 17.06, 1…
$ worst_texture [3m[38;5;246m<dbl>[39m[23m 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.66, 28.14, 3…
$ worst_perimeter [3m[38;5;246m<dbl>[39m[23m 184.60, 158.80, 152.50, 98.87, 152.20, 103.40, 153.20, 11…
$ worst_area [3m[38;5;246m<dbl>[39m[23m 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, 1606.0, 897…
$ worst_smoothness [3m[38;5;246m<dbl>[39m[23m 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791, 0.1442, 0…
$ worst_compactness [3m[38;5;246m<dbl>[39m[23m 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249, 0.2576, 0…
$ worst_concavity [3m[38;5;246m<dbl>[39m[23m 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0.53550, 0.3…
$ worst_concave_points [3m[38;5;246m<dbl>[39m[23m 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0.17410, 0.1…
$ worst_symmetry [3m[38;5;246m<dbl>[39m[23m 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985, 0.3063, 0…
$ worst_fractal_dimension [3m[38;5;246m<dbl>[39m[23m 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0.12440, 0.0…
Notice that Diagnosis is a character with either B (benign) or M (malignant). Since there is a finite (only 2) choices for this feature, I will convert it to a factor
cancer$Diagnosis <- factor(cancer$Diagnosis)
Also note that the first column is ID. I do not think it will be useful in any analysis. I will use the remaining columns as features in a PCA.
pcaCancer <- prcomp(cancer[3:32], center = TRUE, scale. = TRUE)
Proportion of variance explained for 1st principal component
vars <- pcaCancer$rotation[,1] / sum(pcaCancer$rotation[,1])
head(vars[order(-vars)])
mean_concave_points mean_concavity worst_concave_points mean_compactness
0.05169734 0.05121114 0.04972187 0.04742280
worst_perimeter worst_concavity
0.04689847 0.04533833
mean_concave_points is the strongest (not by much) contributor.
Proportion of variance explained for 2nd principal component
vars <- pcaCancer$rotation[,2] / sum(pcaCancer$rotation[,2])
head(vars[order(-vars)])
mean_fractal_dimension se_fractal_dimension worst_fractal_dimension
0.2841535 0.2171152 0.2134313
se_compactness se_smoothness se_concavity
0.1803913 0.1584657 0.1528666
mean_fractal_dimension is the strongest contributor.
Proportion of variance explained for 3rd principal component
vars <- pcaCancer$rotation[,3] / sum(pcaCancer$rotation[,3])
head(vars[order(-vars)])
se_texture se_smoothness se_symmetry se_radius se_perimeter
0.4634608 0.3820659 0.3570088 0.3321394 0.3298680
se_concave_points
0.2779248
se_texture
Proportion of variance explained for 4th principal component
vars <- pcaCancer$rotation[,4] / sum(pcaCancer$rotation[,4])
head(vars[order(-vars)])
worst_texture mean_texture se_texture
0.77147824 0.73519936 0.43871247
worst_compactness worst_fractal_dimension worst_concavity
0.11134167 0.09393858 0.09015647
worst_texture
We could graph the contributions to each principal component
fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 1, top = 10, title = "1st Principal Component")
It does not appear that there is a clear winner?? Maybe we will have to use the principal components instead of the contributing variables?
fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 2, top = 5, title = "2nd Principal Component")
mean_fractal_dimension is clearly the strongest contributor
I can get these four columns in this case using cancer[,c(10,12,14,24)] (I had to figure that out by looking at names(cancer))
names(cancer[,c(10,12,14,24)])
[1] "mean_concave_points" "mean_fractal_dimension" "se_texture"
[4] "worst_texture"
I want to make sure se_texture and worst_texture are NOT correlated.
plot(cancer$se_texture, cancer$worst_texture, main = "se_texture vs. worst_texture")
Or using ggplot so I can add a regression line
ggplot(cancer, aes(x=se_texture, y=worst_texture)) +
geom_point(size=2, shape=23) +
geom_smooth(method='lm') +
labs(title="se_texture vs. worst_texture")
There does not appear to be correlation, but we can double check with statistics
cor(cancer$se_texture, cancer$worst_texture)
[1] 0.4090028
Hmm, those two features might be weakly correlated, but I will keep them anyway. However, I have to keep in mind they may be correlated, in which case I may end up not using both.
Let us separate training and test data.
set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))
training <- cancer[indexes,]
testing <- cancer[-indexes,]
rm(indexes)
Now run a logistic regression using the four determined features.
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture + worst_texture, data=training, family=binomial)
Look at the results of the logistic regression.
summary(diagnose)
Call:
glm(formula = Diagnosis ~ mean_concave_points + mean_fractal_dimension +
se_texture + worst_texture, family = binomial, data = training)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9869 -0.1365 -0.0258 0.0372 3.3857
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.97074 2.79201 -1.780 0.075019 .
mean_concave_points 150.02211 19.76962 7.589 3.24e-14 ***
mean_fractal_dimension -162.51587 45.33638 -3.585 0.000338 ***
se_texture -2.09973 0.71709 -2.928 0.003410 **
worst_texture 0.36795 0.06359 5.786 7.19e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 564.91 on 425 degrees of freedom
Residual deviance: 102.04 on 421 degrees of freedom
AIC: 112.04
Number of Fisher Scoring iterations: 8
Using the training data, let us look at the confusion matrix. I will use 0.45 instead of 0.5 as the boundary because I want to err on the side of catching a cancer.
predictTrain = predict(diagnose, type="response")
table(training$Diagnosis, predictTrain >= 0.45)
FALSE TRUE
B 256 9
M 12 149
The sensitivity
t1 = table(training$Diagnosis, predictTrain >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity
[1] 0.9254658
And the specificity
t1[1]/(t1[1] + t1[3]) # specificity
[1] 0.9660377
And the accuracy
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9507042
Really good! But that was using the training data. I really need to get the confusion matrix using the testing data.
predictTest = predict(diagnose, type="response", newdata=testing)
table(testing$Diagnosis, predictTest >= 0.45)
FALSE TRUE
B 88 4
M 4 47
The metrics are
t1 = table(testing$Diagnosis, predictTest >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity
[1] 0.9215686
t1[1]/(t1[1] + t1[3]) # specificity
[1] 0.9565217
And the accuracy is
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
94% That is really good
Maybe we can do even better without our possible correlation.
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
The accuracy is computed as follows:
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.8951049
rm(t1)
A bit worse, but maybe different features again
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + worst_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)
That was the best so far! I am going with mean_concave_points + mean_fractal_dimension + worst_texture. I could try other features (I did not look at the 5th principal component, but I could choose any)
Now let us try again but this time with the principal components. Create a graph to determine how many principal components I should use.
fviz_screeplot(pcaCancer, ncp=5)
Looks like two or maybe three principal components would be enough.
Again, I need training and test data
set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))
# Because we are using the principal components, from pcaCancer, we need to include the target variable
Diagnosis <- cancer[indexes,]$Diagnosis
training <- data.frame(Diagnosis, pcaCancer$x[indexes,] )
Diagnosis <- cancer[-indexes,]$Diagnosis
testing <- data.frame(Diagnosis, pcaCancer$x[-indexes,] )
rm(Diagnosis)
Now run the logistic regression using the first three principal components
diagnose = glm(Diagnosis ~ PC1 + PC2 + PC3, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)
Maybe we can do better with only two principal components
diagnose = glm(Diagnosis ~ PC1 + PC2, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)
WOW! 94% accuracy - with only two principal components!
I suspect that the lack of a clear contributor to the 1st principal component should have convinced us to use the principal components anyway.