Clifton_Baldwin

Data Science Journal of Clif Baldwin

Predicting Food Inspection Results

16 April ’20

Predicting Food Facility Inspections

I did not get a chance to submit this exploration and model to the MachineHack challenge, but it makes for a good post.

<!DOCTYPE html>

Predicting Food Inspections

Machine Hack Challenge for Food Inspections

March - April 2020

Objective:

Given data from past inspections, build a predictive model that is capable of predicting the outcome of an inspection conducted in a facility based on the given set of features

The Data

Before proceeding, load the required R libraries

library(tidyverse)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(readxl) # used to read the given Excel file
library(lubridate) # used to handle the Date column

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

From the provided documentation Features : ID: A unique id for each inspection Date: The date at which the inspection was done in a particular facility LicenseNo: De-identified license number for a particular facility FacilityID: De-identified unique facility id for a facility FacilityName: The encoded name of a facility Type: The type of the facility being inspected Street: The encoded street where the facility is located City: The encoded city where the facility is located State: The encoded state where the facility is located LocationID: An encoded location feature. Reason: The primary reason for the inspection SectionViolations: Laws violated by the facility RiskLevel: The level of risk the facility possesses to the consumers. Geo_Loc: De-identified geo location of the facility

Target : Inspection_Results: The result of the inspection The inspection results can have any of the following values : 0:’FACILITY CHANGED’ 1:’FAIL’ 2:’FURTHER INSPECTION REQUIRED’, 3:’INSPECTION OVERRULED’ 4:’PASS’ 5:’PASS(CONDITIONAL)’ 6:’SHUT-DOWN’

Now load the training dataset as a data frame and take a look

df <- read_excel("Data_Train.xlsx", sheet = 1)
glimpse(df)
Observations: 147,443
Variables: 15
$ ID                 <dbl> 3.110349e+13, 1.008900e+13, 4.014897e+13, 3.715771e+13, 4.747805e+13, 2.523429e+13, 1.764017e+13,…
$ Date               <chr> "26-04-2010", "21-06-2009", "01-05-2013", "28-09-2015", "09-12-2015", "07-02-2016", "04-04-2011",…
$ LicenseNo          <dbl> 4744, 2973, 18223, 20825, 2136, 13009, 9621, 32170, 27027, 7131, 26522, 3350, 15323, 34228, 37390…
$ FacilityID         <dbl> 8123, 12268, 1112, 20007, 16867, 7942, 3791, 24054, 24785, 2892, 5574, 12919, 2017, 24693, 26965,…
$ FacilityName       <dbl> 7715, 11664, 969, 19115, 10409, 7547, 3576, 8766, 23652, 2707, 5307, 12263, 1829, 23563, 5722, 11…
$ Type               <chr> "RESTAURANT", "GROCERY STORE", "RESTAURANT", "RESTAURANT", "RESTAURANT", "RESTAURANT", "RESTAURAN…
$ Street             <dbl> 15522, 3057, 14988, 3661, 7876, 12275, 2133, 5006, 4554, 5899, 16017, 4538, 17881, 14317, 4306, 1…
$ City               <chr> "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-112…
$ State              <chr> "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134",…
$ LocationID         <dbl> 81876, 81862, 81883, 81859, 81886, 81877, 81887, 81851, 81892, 81877, 81864, 81867, 81862, 81905,…
$ Reason             <chr> "CANVASS", "COMPLAINT", "CANVASS", "CANVASS RE-INSPECTION", "COMPLAINT", "CANVASS", "CANVASS RE-I…
$ SectionViolations  <dbl> 33, 33, NA, 31, 30, 18, 32, NA, NA, 32, NA, 33, NA, 16, 2, 32, 33, 18, 21, NA, 32, 32, 33, NA, 32…
$ RiskLevel          <chr> "High", "High", "High", "Medium", "High", "High", "High", "Low", "High", "High", "Medium", "High"…
$ Geo_Loc            <chr> "locid16406", "locid878", "locid3368", "locid11839", "locid12264", "locid3935", "locid8876", "loc…
$ Inspection_Results <chr> "4", "4", "6", "4", "4", "1", "4", "1", "4", "4", "6", "4", "3", "1", "5", "4", "4", "1", "5", "1…

Most of the columns are not in the appropriate format for this analysis. Although I might not need to use ID, I know it should be a factor instead of a number.

# Convert ID to type factor
df$ID <- factor(df$ID)

Change Date from character to date. In doing so, Date fails with parsing errors. Investigate…

x <- dmy(df$Date)
 53 failed to parse.
ids <- df[is.na(x),"ID"]
ids <- ids %>% pull()
df %>% filter(ID %in% ids) %>% select(Date)

There are 53 February 29 dates during non-Leap Years! Since I cannot ask anyone what these dates should be, I guess I will make them March 1?

df <- df %>%
  mutate(Date = case_when(ID %in% ids ~ str_replace(Date, "29-02","01-03"), TRUE ~ Date))

Now Date will convert from character to date

df$Date <- dmy(df$Date)

There are many other categorical variables that are stored as double or character. Convert these remaining appropriate (categorical) features to factors

df$LicenseNo <- factor(df$LicenseNo)
df$FacilityID <- factor(df$FacilityID)
df$FacilityName <- factor(df$FacilityName)
df$Type <- factor(df$Type)
df$City <- factor(df$City)
df$State <- factor(df$State) # State should be unique, but Street may not (same street name in different states?)
df$LocationID <- factor(df$LocationID)
df$Reason <- factor(df$Reason)
df$SectionViolations <- factor(df$SectionViolations)
df$RiskLevel <- factor(df$RiskLevel)
df$Geo_Loc <- factor(df$Geo_Loc)

Inspection_Results needs to be a factor also, but I want to make it an Integer for some histograms first.

Analysis

I wonder if inspections take place throughout the year, and if so is there a seasonal effect? A count of obsevations per month should show if any months are lacking.

sapply(1:12, function(x) { length(which(month(df$Date)==x)) })
 [1] 11267 10543 12863 12455 13708 13596 11073 13372 13708 13522 11471  9865

Ok, at least 10 thousand obsevations per month. Well, December has 9,865 but close enough since it is not unusual that there are less inspections around the holidays. Although inspections do not appear seasonal in the dataset, perhaps time of year has an impact on inspection results. I will create a variable to indicate cold months from warm months.

df$winter <- ifelse((month(df$Date) %in% c(10,11,12,1,2,3)), 1L, 0L)
df$winter <- factor(df$winter)

How many different Type are there

length(levels(df$Type))
[1] 409

Hmm, 499 different factors. How many are NA?

sum(is.na(df$Type))
[1] 3485

There are 3485 without a Type. Investigate what is happening when no Type

nas <- df[is.na(df$Type),]
sum(is.na(nas$SectionViolations))
[1] 3401

Almost all (3401 out of 3485 that have no Type have no SectionViolations. Although interesting, do these NA values have any impact on the Inspection_Results, which is the target variable. Let us look at a histogram of all Inspection_Results

hist(as.integer(df$Inspection_Results), main = "Histogram of All Inspection_Results", xlab = "Inspection Results")

Now let us compare the histogram with Type observations of only NA and SectionViolations of NA

hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results with Type NA and SectionViolations NA", xlab = "Inspection Results")

Interesting! Why are most of the missing SectionViolations the ones that are ShutDown??

sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results != "6"))
[1] 806390
sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results == "6"))
[1] 4978711

There are 806,390 observations that are not “ShutDown” and 4,978,711 that are ShutDown. I am thinking I cannot remove these NA obsevations without introducing bias, but I should look at teh data without those observations first.

nas <- df[!is.na(df$Type),]
hist(as.integer(nas$Inspection_Results), main = "Histogram of Inspection_Results without Type = NA", xlab = "Inspection Results")

What if we just remove the SectionViolations that are NA

hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results without Type = NA and with SectionViolations NA", xlab = "Inspection Results")

It appears we lose many of the “PASS” (i.e. “5”). So I am not going to remove the NA values at this time. Of the non-NA Type, how many are unique (i.e. only one obsevarion)

df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n < 2) %>% nrow()
[1] 71

71 unique observations of Type. How many additional ones have less than 10 observations

df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n < 10 & n > 1) %>% nrow()
[1] 245

Another 245 Types, which means 316 have less than 10 observations. Going to the other extreme, how many have more than 20 observations per Type

df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n > 20) %>% nrow()
[1] 53

53 Types have more than 20 observations. How mnany total observations of data is that

df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n > 20) %>% summarise(sum(n))
NA

There are 142,350 observations with more than 20 observations per Type. So do I keep or exclude the NA’s and the rare Types?? For now, I am keeping all the observations. Moving on to City,

levels(df$City)
[1] "id-11235901" "id-11275913"

Only two cities in the dataset. Are they from different states?

levels(df$State)
[1] "id_1890134" "id_1890135"

Also two factors. Although I am sure there is one city per state, I should verify it.

length(which(df$State[which(df$City == "id-11235901")] == 'id_1890135'))
[1] 19
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890135'))
[1] 22

Wait, what?? What about the two cities in the other state

length(which(df$State[which(df$City == "id-11235901")] == 'id_1890134'))
[1] 147177
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890134'))
[1] 225

It appears there are a total of four, not two, combinations of City and State. Running a chi-square test to determine independence

chisq.test(df$City, df$State, correct = FALSE)
Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  df$City and df$State
X-squared = 7016.5, df = 1, p-value < 2.2e-16

So City and State are independent! How about Reason

levels(df$Reason)
 [1] "CANVASS"                                "CANVASS RE-INSPECTION"                 
 [3] "COMPLAINT"                              "COMPLAINT RE-INSPECTION"               
 [5] "COMPLAINT-FIRE"                         "CONSULTATION"                          
 [7] "LICENSE"                                "LICENSE RE-INSPECTION"                 
 [9] "LICENSE-TASK FORCE"                     "OUT OF BUSINESS"                       
[11] "RECENT INSPECTION"                      "SHORT FORM COMPLAINT"                  
[13] "SHORT FORM FIRE-COMPLAINT"              "SUSPECTED FOOD POISONING"              
[15] "SUSPECTED FOOD POISONING RE-INSPECTION" "TAG REMOVAL"                           
[17] "TASK FORCE LIQUOR 1475"                

17 levels. How many are NA

sum(is.na(df$Reason))
[1] 0

And no NA for Reason. Next, SectionViolations

length(levels(df$SectionViolations))
[1] 61

61 levels of SectionViolations. How many are NA

sum(is.na(df$SectionViolations))
[1] 39068

39,068 are NA (about 26%). If we take out the NA SectionViolations, let us look at the histogram

hist(as.integer(df[!is.na(df$SectionViolations),]$Inspection_Results), main = "Inspection_Results without SectionViolations of NA", xlab = "Inspection Results")

We lose the 2, 6, and most of the 3’s. So the presence of SectionViolations indicates the result is probably 1, 4, or 5 (PASS). So I guess we cannot exclude them.

Look at the RiskLevel

levels(df$RiskLevel)
[1] "High"      "Low"       "Medium"    "Uncertain"

Note 4 levels with one being “Uncertain”. How many are NA

sum(is.na(df$RiskLevel))
[1] 0

No RiskLevel are NA. How many are “Uncertain” though

length(which(df$RiskLevel=="Uncertain"))
[1] 21

Only 21 are “Uncertain”

Finally, let us look at Geo_loc

length(levels(df$Geo_Loc))
[1] 16316

No! Over 16k levels! I think I will use City, Reason, and RiskLevel, and try also Type and SectionViolations.

I need to make sure the target variable is clean

df$Inspection_Results <- factor(df$Inspection_Results)
levels(df$Inspection_Results)
[1] "0" "1" "2" "3" "4" "5" "6"

Validation dataset

I need to break the train dataset into a training and validation datasets

indexes = sample(1:nrow(df), size=0.75*nrow(df))
training = df[indexes,]  
testing = df[-indexes,]  
rm(indexes)

Modeling

Of course there are many options, but an appropriate model needs to be a classification model that accepts categorical features. So I cannot select features using Principal Component Aanalysis since that requires numerical data. The choice I am selecting is a Naive Bayes Classification model.

Load the R library that has a Naive Bayes function

library(e1071)

I will use set.seed so that each model run starts off “the same.” Therefore the only change should be the features input to the model. For the first Naive Bayes model, I will use City + Reason + RiskLevel + Type + SectionViolations + winter as the features.

set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations + winter, data=training)

Using the test dataset, predict using the model

Predict <- predict(model,newdata = testing )

Create a confusion matrix to check accuracy

cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7669352
rm(model, Predict, cm)

76% accurate I am not sure if we prefer a higher sensitivity or higher specificity, where Sensitivity is the true positive rate or probability of detection and Specificity is the true negative rate or proportion of actual negatives that are correctly identified as such. Furthermore, I am not sure if we care more about certain results than others. Therefore I will just use accuracy as the measure of the model.

Let us try the model without the winter feature

set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7663113
rm(model, Predict, cm)

Almost no difference in accuracy. Assuming a simpler model is a better model, I will omit winter. I am skeptical of SectionViolations and will try the model without it.

set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )

(sum(cm) - sum(diag(cm))) / sum(cm) #Accuracy
[1] 0.4508559
rm(model, Predict, cm)

I was wrong!! SectionViolations is important! I will put SectionViolations back in as a feature but try to omit City.

set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + Type + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7664198
rm(model, Predict, cm)

It appears City has no impact on the model. Now I will try the model without Type

set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7748026
rm(model, Predict, cm)

Only marginal, but this model is the best so far. Maybe I need to do something with the Type with NA. Try the model with Type without NA’s

set.seed(7)
model = training %>%
  filter(!is.na(Type)) %>%
  naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations + Type, data=.)
  
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7483519
rm(model, Predict, cm)

As I suspected earlier, we lose something when Type of NA is omitted. I guess the NA’s are predictive, but Type still not helpful. From what I tried, the best Naive Bayes model gets around 77% accuracy with the model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training).

Just an aside, this “challenge” was on MachineHack but had ended before I found it. Of course it does not really matter since my results are not as good as the top models submitted to the challenge.

---
title: "Predicting Food Inspections"
author: "Dr. Clifton Baldwin"
output: html_notebook
---

# Machine Hack Challenge for Food Inspections
March - April 2020

## Objective: 
Given data from past inspections, build a predictive model that is capable of predicting the outcome of an inspection conducted in a facility based on the given set of features

## The Data
Before proceeding, load the required R libraries

```{r}
library(tidyverse)
library(readxl) # used to read the given Excel file
library(lubridate) # used to handle the Date column

```

From the provided documentation 
 Features : 
    ID: A unique id for each inspection
    Date: The date at which the inspection was done in a particular facility
    LicenseNo: De-identified license number for a particular facility
    FacilityID: De-identified unique facility id  for a facility
    FacilityName: The encoded name of a facility
    Type: The type of the facility being inspected
    Street: The encoded street where the facility is located
    City: The encoded city where the facility is located
    State: The encoded state where the facility is located
    LocationID: An encoded location feature.
    Reason: The primary reason for the inspection
    SectionViolations: Laws violated by the facility
    RiskLevel: The level of risk the facility possesses to the consumers.
    Geo_Loc: De-identified geo location of the facility

 Target : Inspection_Results: The result of the inspection
  The inspection results can have any of the following values :
    0:’FACILITY CHANGED’
    1:’FAIL’
    2:’FURTHER INSPECTION REQUIRED’,
    3:’INSPECTION OVERRULED’
    4:’PASS’
    5:’PASS(CONDITIONAL)’
    6:’SHUT-DOWN’

Now load the training dataset as a data frame and take a look
```{r}
df <- read_excel("Data_Train.xlsx", sheet = 1)
glimpse(df)
```

Most of the columns are not in the appropriate format for this analysis.
Although I might not need to use ID, I know it should be a factor instead of a number.
```{r}
# Convert ID to type factor
df$ID <- factor(df$ID)
```

Change Date from character to date. In doing so, Date fails with parsing errors. Investigate...
```{r}
x <- dmy(df$Date)
ids <- df[is.na(x),"ID"]
ids <- ids %>% pull()
df %>% filter(ID %in% ids) %>% select(Date)
```

There are 53 February 29 dates during non-Leap Years! Since I cannot ask anyone what these dates should be, I guess I will make them March 1?
```{r}
df <- df %>%
  mutate(Date = case_when(ID %in% ids ~ str_replace(Date, "29-02","01-03"), TRUE ~ Date))
```

Now Date will convert from character to date
```{r}
df$Date <- dmy(df$Date)
```

There are many other categorical variables that are stored as double or character. Convert these remaining appropriate (categorical) features to factors
```{r}
df$LicenseNo <- factor(df$LicenseNo)
df$FacilityID <- factor(df$FacilityID)
df$FacilityName <- factor(df$FacilityName)
df$Type <- factor(df$Type)
df$City <- factor(df$City)
df$State <- factor(df$State) # State should be unique, but Street may not (same street name in different states?)
df$LocationID <- factor(df$LocationID)
df$Reason <- factor(df$Reason)
df$SectionViolations <- factor(df$SectionViolations)
df$RiskLevel <- factor(df$RiskLevel)
df$Geo_Loc <- factor(df$Geo_Loc)
```

Inspection_Results needs to be a factor also, but I want to make it an Integer for some histograms first.

## Analysis
I wonder if inspections take place throughout the year, and if so is there a seasonal effect? A count of obsevations per month should show if any months are lacking.
```{r}
sapply(1:12, function(x) { length(which(month(df$Date)==x)) })
```

Ok, at least 10 thousand obsevations per month. Well, December has 9,865 but close enough since it is not unusual that there are less inspections around the holidays. Although inspections do not appear seasonal in the dataset, perhaps time of year has an impact on inspection results. I will create a variable to indicate cold months from warm months.
```{r}
df$winter <- ifelse((month(df$Date) %in% c(10,11,12,1,2,3)), 1L, 0L)
df$winter <- factor(df$winter)

```

How many different Type are there
```{r}
length(levels(df$Type))
```

Hmm, 499 different factors. How many are NA?
```{r}
sum(is.na(df$Type))
```

There are 3485 without a Type. Investigate what is happening when no Type
```{r}
nas <- df[is.na(df$Type),]
sum(is.na(nas$SectionViolations))
```

Almost all (3401 out of 3485 that have no Type have no SectionViolations. Although interesting, do these NA values have any impact on the Inspection_Results, which is the target variable. Let us look at a histogram of all Inspection_Results
```{r}
hist(as.integer(df$Inspection_Results), main = "Histogram of All Inspection_Results", xlab = "Inspection Results")
```

Now let us compare the histogram with Type observations of only NA and SectionViolations of NA 
```{r}
hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results with Type NA and SectionViolations NA", xlab = "Inspection Results")
```

Interesting! Why are most of the missing SectionViolations the ones that are ShutDown??
```{r}
sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results != "6"))
```

```{r}
sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results == "6"))
```
There are 806,390 observations that are not "ShutDown" and 4,978,711 that are ShutDown. I am thinking I cannot remove these NA obsevations without introducing bias, but I should look at teh data without those observations first.

```{r}
nas <- df[!is.na(df$Type),]
hist(as.integer(nas$Inspection_Results), main = "Histogram of Inspection_Results without Type = NA", xlab = "Inspection Results")

```

What if we just remove the SectionViolations that are NA
```{r}
hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results without Type = NA and with SectionViolations NA", xlab = "Inspection Results")

```

It appears we lose many of the "PASS" (i.e. "5"). So I am not going to remove the NA values at this time.
Of the non-NA Type, how many are unique (i.e. only one obsevarion)
```{r}
df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n < 2) %>% nrow()

```

71 unique observations of Type. How many additional ones have less than 10 observations
```{r}
df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n < 10 & n > 1) %>% nrow()

```

Another 245 Types, which means 316 have less than 10 observations. Going to the other extreme, how many have more than 20 observations per Type
```{r}
df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n > 20) %>% nrow()

```

53 Types have more than 20 observations. How mnany total observations of data is that
```{r}
df %>%
  filter(!is.na(Type)) %>%  
  group_by(Type) %>%
  summarise(n = n()) %>% 
  filter(n > 20) %>% summarise(sum(n))

```

There are 142,350 observations with more than 20 observations per Type. So do I keep or exclude the NA's and the rare Types?? For now, I am keeping all the observations.
Moving on to City,
```{r}
levels(df$City)
```

Only two cities in the dataset. Are they from different states?
```{r}
levels(df$State)
```

Also two factors. Although I am sure there is one city per state, I should verify it.
```{r}
length(which(df$State[which(df$City == "id-11235901")] == 'id_1890135'))
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890135'))

```

Wait, what?? What about the two cities in the other state
```{r}
length(which(df$State[which(df$City == "id-11235901")] == 'id_1890134'))
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890134'))

```

It appears there are a total of four, not two, combinations of City and State. Running a chi-square test to determine independence
```{r}
chisq.test(df$City, df$State, correct = FALSE)
```

So City and State are independent! 
How about Reason
```{r}
levels(df$Reason)
```

17 levels. How many are NA
```{r}
sum(is.na(df$Reason))
```

And no NA for Reason. Next, SectionViolations
```{r}
length(levels(df$SectionViolations))
```

61 levels of SectionViolations. How many are NA
```{r}
sum(is.na(df$SectionViolations))
```

39,068 are NA (about 26%). If we take out the NA SectionViolations, let us look at the histogram
```{r}
hist(as.integer(df[!is.na(df$SectionViolations),]$Inspection_Results), main = "Inspection_Results without SectionViolations of NA", xlab = "Inspection Results")
```

We lose the 2, 6, and most of the 3's. So the presence of SectionViolations indicates the result is probably 1, 4, or 5 (PASS). So I guess we cannot exclude them.

Look at the RiskLevel
```{r}
levels(df$RiskLevel)
```

Note 4 levels with one being "Uncertain". How many are NA
```{r}
sum(is.na(df$RiskLevel))
```

No RiskLevel are NA. How many are "Uncertain" though
```{r}
length(which(df$RiskLevel=="Uncertain"))
```

Only 21 are "Uncertain"

Finally, let us look at Geo_loc
```{r}
length(levels(df$Geo_Loc))
```

No! Over 16k levels!
I think I will use City, Reason, and RiskLevel, and try also Type and SectionViolations.

I need to make sure the target variable is clean
```{r}
df$Inspection_Results <- factor(df$Inspection_Results)
levels(df$Inspection_Results)
```


## Validation dataset
I need to break the train dataset into a training and validation datasets
```{r}
indexes = sample(1:nrow(df), size=0.75*nrow(df))
training = df[indexes,]  
testing = df[-indexes,]  
rm(indexes)

```

## Modeling
Of course there are many options, but an appropriate model needs to be a classification model that accepts categorical features. So I cannot select features using Principal Component Aanalysis since that requires numerical data. The choice I am selecting is a Naive Bayes Classification model. 

Load the R library that has a Naive Bayes function
```{r}
library(e1071)
```

I will use set.seed so that each model run starts off "the same." Therefore the only change should be the features input to the model. For the first Naive Bayes model, I will use City + Reason + RiskLevel + Type + SectionViolations + winter as the features.
```{r}
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations + winter, data=training)

```

Using the test dataset, predict using the model
```{r}
Predict <- predict(model,newdata = testing )
```

Create a confusion matrix to check accuracy
```{r}
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
rm(model, Predict, cm)
```

76% accurate
I am not sure if we prefer a higher sensitivity or higher specificity, where Sensitivity is the true positive rate or probability of detection and Specificity is the true negative rate or proportion of actual negatives that are correctly identified as such. Furthermore, I am not sure if we care more about certain results than others. Therefore I will just use accuracy as the measure of the model. 

Let us try the model without the winter feature
```{r}
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy

rm(model, Predict, cm)
```

Almost no difference in accuracy. Assuming a simpler model is a better model, I will omit winter. I am skeptical of SectionViolations and will try the model without it.

```{r}
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )

(sum(cm) - sum(diag(cm))) / sum(cm) #Accuracy
rm(model, Predict, cm)
```

I was wrong!! SectionViolations is important! I will put SectionViolations back in as a feature but try to omit City.
```{r}
set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + Type + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy

rm(model, Predict, cm)
```

It appears City has no impact on the model. Now I will try the model without Type
```{r}
set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training)

# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy

rm(model, Predict, cm)
```

Only marginal, but this model is the best so far. Maybe I need to do something with the Type with NA. Try the model with Type without NA's
```{r}
set.seed(7)
model = training %>%
  filter(!is.na(Type)) %>%
  naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations + Type, data=.)
  
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )

#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy

rm(model, Predict, cm)
```

As I suspected earlier, we lose something when Type of NA is omitted. I guess the NA's are predictive, but Type still not helpful. From what I tried, the best Naive Bayes model gets around 77% accuracy with the model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training).

Just an aside, this "challenge" was on MachineHack but had ended before I found it. Of course it does not really matter since my results are not as good as the top models submitted to the challenge.


A Python Program for a Stock Watchlist

7 April ’20

Simple Python Program for a Raspberry Pi

A little diversion from the usual data science posts Since I usually post an R script that does some data analysis, I thought it could not hurt to show I know some Python also.

<!DOCTYPE html>

StockWatch

StockWatch on Raspberry Pi Zero W

I created a Python program for my Raspberry Pi Zero W that could download stock data every night for a specific set of stocks I wanted to watch. For each stock ticker symbol, the program determines the company name, the most recent closing Price, the closing price from the previous day the market was open, and the percentage that the current price is off from the 52-week high. The data is stored in a CSV file. In the actual program, the CSV file is emailed to my phone, but I do not want to reveal my email information. I figured out how to email with the help of a Google search, and I did not do anything fancy.

In [2]:
# Load the required Python libraries
import pandas as pd # for Excel file
import pickle  # for storing data - in a pickle named 'wsj_financials'
from datetime import date
from pandas_datareader import data

Function price(sym, today)

A function called prices takes the ticker symbol and the current date. It uses DataReader to access Yahoo! for the stock price data. I do not like using Yahoo! because they have changed things in the past, which broke my programs, but I do not know a better option at this time for what I am doing.

The output of this function is a Tuple that consists of the stock's current closing price, the closing price of the previous market session, and the computed percentage that the current price is off the 52-week-high.

In [3]:
def prices(sym, today):
    """Get the current price of a stock, the 52-week high, 
       and 52-week low.
       Return the (price - high)/(high - low) ratio.
    """
   # Get price data for the recent year
    lastYear = date.fromordinal(today.toordinal()-365)

    try:
        price = data.DataReader(sym, start=lastYear, 
                                end=today, data_source='yahoo')
    except:
        print("**** " + sym + " error!")
        with open('./stock_errors.log', 'a') as out_file:
            out_file.write(sym 
                           + " dropped due to price error! "
                           + str(today) + "\n")
        
    p = price['Close'] # 52-weeks (one year) of closing prices
    high52 = p.max() # the 52-week high closing price
    low52 = p.min()  # the 52-week low closing price
    latest = p[-1]   # the most recent closing price
    yesterdays = p[-2] # the closing price on the previous market session
    try:
        priceIndex = ((float(high52) - float(latest))
                      / (float(high52) - float(low52))) # Compute the percentage off the 52-week-high
    except:          # if some stocks do not return sufficient data, save it in an error log
        with open('./stock_errors.log', 'a') as out_file:
            out_file.write(sym 
                           + " has insufficient data for 52-week range - price set to 0.0 "
                           + str(today) + "\n")
        priceIndex = 0.0
    priceTuple = (latest, yesterdays, priceIndex) # Return the desired data
    return priceTuple

The Main Section

The list of stock ticker symbols is saved in an Excel file. This Excel file is read to create an array of ticker sysmbols, which is named wsj_stocks. Just as an aside, I named it wsj_stocks because originally I was going to look at the thousand or so stocks that the Wall Street Journal covers. Not only was that too many stocks for me to "glance" through, I realized I care only for about 50 stocks or less. Also I do not care if the Wall Street Journal covers them or not. So I should rename the array "clif_stocks" but I just kept the original name.

In [4]:
# Look up the stock value of tickers as of today
today = date.today()

# Identify the file to read as an Excel file
xlsx = pd.ExcelFile('special 20200314.xlsx')

# Read the Excel file
wsj_stocks = pd.read_excel(xlsx, 'Sheet1', usecols=[0,1], header=0, skipinitialspace=True)
In [5]:
wsj_stocks  # The list of stock tickers with their company names that will be included in the report.
Out[5]:
COMPANY SYM
0 Activision Inc ATVI
1 Alibaba BABA
2 Alphabet Inc Class A GOOGL
3 Alphabet Inc Class C GOOG
4 Amazon.com AMZN
5 American Airlines Group AAL
6 American Express AXP
7 Amgen AMGN
8 Apple Computer AAPL
9 Applied Materials AMAT
10 Biogen Idec BIIB
11 Canadian National Railway Co. CNI
12 Carnival Corp. CCL
13 Chewy CHWY
14 Costco Wholesale Corp. COST
15 Delta Airlines DAL
16 Electronic Arts Inc EA
17 Energy Transfer ET
18 Exxon Mobil Corp. XOM
19 Fortinet FTNT
20 General Electric GE
21 Johnson & Johnson JNJ
22 Kimberly-Clark KMB
23 Royal Dutch Shell RDS-B
24 Marathon Petroleum MPC
25 Merck & Co. MRK
26 Norfolk Southern Corp. NSC
27 Norweigian Cruise Line Holdings NCLH
28 Nvidia NVDA
29 Palo Alto Networks PANW
30 Regeneron Pharmaceuticals REGN
31 Royal Carribean RCL
32 Southwest Airlines LUV
33 Take Two Interactive TTWO
34 Tesla Motors TSLA
35 Twitter Inc TWTR
36 Visa V
37 Western Midstream WES
38 MasterCard Inc MA
39 Walt Disney Corp DIS
40 CSX Corp CSX
41 Gilead Sciences GILD
42 Lowes Corporation L
43 Home Depot HD
44 T. Rowe Price Group TROW
45 Bristol-Myers Squibb BMY
46 Eli Lilly LLY
47 Tencent (Not a Henry recommendation) TCEHY

For each stock ticker, the program calls the price() function and then saves the data in a data dictionary. Once in a dictionary, the ticker symbols can be sorted.

In [6]:
wsj = {} # dictionary to store the retrieved data
for index, row in wsj_stocks.iterrows(): 
    ticker = row["SYM"].strip()
    company = row["COMPANY"].strip()

    try:  # Attempt to get current price
        priceTuple = prices(ticker, today) # Get the current price information of this stock
        # priceTuple = (latest, yesterdays, priceIndex)
        price = round(priceTuple[0],2)
        yesterdays = round(priceTuple[1],2)
        priceIndex = round(100.0 * priceTuple[2],2)
        # Only save if the price was acquired
        wsj[ticker] = {"PRICE": price, 
                       "YESTERDAY": yesterdays, 
                       "OFFHIGH": priceIndex,
                       "COMPANY": company}
    except:
        continue
# End for loop keys in wsj

A large string is constructed so that the information can be sent in an email or in this case just output to the screen.

In [9]:
# Compose the body of the email to be sent
body = "Select stocks to watch for " + str(today) + " compiled on a Raspberry Pi Zero W.\n"
body = body + "In alphabetical order by ticker symbol\n\n"
body = body + "Numbered, Ticker, Company, Closing Price, Yesterday's Close, Percentage Off High \n"

rating = 0
# Populate the body of the email message in alphabetical order
for sym in sorted(wsj.keys()):
    rating += 1
    thisStock = wsj[sym]
    body = body + str(rating) + ": " + sym + ", " + \
           str(thisStock['COMPANY']) + ", $" + \
           str(thisStock['PRICE']) +  ", $" + \
           str(thisStock['YESTERDAY']) + ", " + \
           str(thisStock['OFFHIGH']) + "% \n"
    

body = body + "\n\nDISCLAIMER: Code has been validated, but the potential for errors still exist!\n\n"
In [10]:
print(body)
Select stocks to watch for 2020-04-07 compiled on a Raspberry Pi Zero W.
In alphabetical order by ticker symbol

Numbered, Ticker, Company, Closing Price, Yesterday's Close, Percentage Off High 
1: AAL, American Airlines Group, $10.22, $9.5, 96.73% 
2: AAPL, Apple Computer, $259.43, $262.47, 44.04% 
3: AMAT, Applied Materials, $47.56, $46.15, 67.47% 
4: AMGN, Amgen, $208.78, $211.58, 44.99% 
5: AMZN, Amazon.com, $2011.6, $1997.59, 32.13% 
6: ATVI, Activision Inc, $59.87, $61.6, 20.29% 
7: AXP, American Express, $87.58, $83.87, 72.61% 
8: BABA, Alibaba, $198.0, $196.45, 39.99% 
9: BIIB, Biogen Idec, $301.21, $311.39, 32.22% 
10: BMY, Bristol-Myers Squibb, $56.9, $57.87, 42.7% 
11: CCL, Carnival Corp., $11.3, $10.21, 93.04% 
12: CHWY, Chewy, $33.31, $33.22, 29.55% 
13: CNI, Canadian National Railway Co., $80.03, $79.98, 56.68% 
14: COST, Costco Wholesale Corp., $303.63, $305.12, 24.2% 
15: CSX, CSX Corp, $61.15, $61.27, 58.95% 
16: DAL, Delta Airlines, $22.25, $22.32, 97.85% 
17: DIS, Walt Disney Corp, $101.24, $99.58, 76.5% 
18: EA, Electronic Arts Inc, $106.32, $106.51, 26.98% 
19: ET, Energy Transfer, $5.58, $5.54, 90.63% 
20: FTNT, Fortinet, $106.69, $107.88, 28.33% 
21: GE, General Electric, $7.03, $7.23, 86.95% 
22: GILD, Gilead Sciences, $74.67, $77.73, 29.84% 
23: GOOG, Alphabet Inc Class C, $1186.51, $1186.92, 69.36% 
24: GOOGL, Alphabet Inc Class A, $1182.56, $1183.19, 70.42% 
25: HD, Home Depot, $192.29, $191.33, 57.69% 
26: JNJ, Johnson & Johnson, $137.48, $139.76, 38.53% 
27: KMB, Kimberly-Clark, $131.39, $133.15, 46.95% 
28: L, Lowes Corporation, $36.27, $35.7, 75.28% 
29: LLY, Eli Lilly, $141.88, $141.61, 13.49% 
30: LUV, Southwest Airlines, $32.77, $30.7, 92.04% 
31: MA, MasterCard Inc, $259.08, $265.94, 60.51% 
32: MPC, Marathon Petroleum, $22.51, $23.09, 88.54% 
33: MRK, Merck & Co., $78.56, $80.31, 52.57% 
34: NCLH, Norweigian Cruise Line Holdings, $11.01, $10.01, 93.75% 
35: NSC, Norfolk Southern Corp., $153.53, $154.75, 63.89% 
36: NVDA, Nvidia, $259.03, $268.4, 30.77% 
37: PANW, Palo Alto Networks, $172.81, $171.34, 65.96% 
38: RCL, Royal Carribean, $33.55, $29.61, 90.05% 
39: RDS-B, Royal Dutch Shell, $35.65, $35.71, 66.09% 
40: REGN, Regeneron Pharmaceuticals, $501.51, $504.27, 1.2% 
41: TCEHY, Tencent (Not a Henry recommendation), $49.47, $49.94, 30.29% 
42: TROW, T. Rowe Price Group, $101.9, $102.7, 69.28% 
43: TSLA, Tesla Motors, $545.45, $516.24, 50.37% 
44: TTWO, Take Two Interactive, $120.06, $121.29, 31.82% 
45: TWTR, Twitter Inc, $25.61, $24.93, 84.59% 
46: V, Visa, $168.59, $169.44, 57.65% 
47: WES, Western Midstream, $4.62, $3.9, 95.13% 
48: XOM, Exxon Mobil Corp., $41.24, $40.47, 81.15% 


DISCLAIMER: Code has been validated, but the potential for errors still exist!


The comma-delimited format of the output allows me to save it as a CSV file also and import it into R or Excel for further analysis at a later time.

  • Dr. Clifton Baldwin
  • April 7, 2020

Modeling Covid-19 in the US

3 April ’20

Modeling the Cases of Covid-19 in the United States

Using public data obtained from Johns Hopkins, I modeled the spread of CoVid-19 up to April 3, 2020. As usual, it will be presented from a R Notebook.

<!DOCTYPE html>

Exponential Model of CoVid-19 in the US

Confirmed Cases in the US

While staying at home to mitigate the spread of this virus, I decided to model the spread of CoVid-19, mostly for personal therapeutic reasons. Modeling the data makes me feel like I am doing something other than anxiously watching it. No one should trust my predictions! My model is not very sophisticated but appears to have a good fit. Of course the steps being taken to mitigate this virus as well as the potential for new therapies could impact the predictions.

CoVid-19 Data

The data was downloaded from https://github.com/CSSEGISandData/COVID-19 on April 2, 2020. Johns Hopkins University updates the data on this Github site daily.

Once the data is downloaded locally, I use readr from the tidyverse to load the data into R.

covid19 <- read_csv("time_series_covid19_confirmed_global.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.

Since each country is handling the pandemic a bit differently, I single out the US.

We have heard that China may be providing false statistics. It has been reported that Italy does not have enough tests and can only test the very sick. Logically if they test only the very sick, they are going to have a lower Confirmed Cases count and a higher percentage of mortality. Although we know arguably some of the problems with the US Confirmed Case count, I am interested primarily in the US situation. Of course I care about the rest of the world, but the rest of the world does not directly impact my day to day activities. Therefore this analysis will use only the US ConfirmedCases and Fatalities.

Assumptions

The main assumption of this prediction is that the trend will continue on its current trajectory. If social distancing, stay-at-home, and any other policies have any effectivity, this prediction will miss the mark. Of course any new treatments and therapies could also alter the prediction.

A limiting factor is only people with positive Coronavirus tests are counted as Confirmed Cases, but many people that may have a milder case of the virus are not tested. Once a person is sick, it may be a day or more before that person gets tested, and then it takes a few days before results are known. Therefore the Confirmed Cases may be lagging by almost a week.

Clean the Data

Using the piping feature from the Tidyverse, the US data can be extracted and cleaned in one line. The US data is extracted, the dataset format is transformed into columns from rows, the appropriate rows are extracted, and the date and confirmed case columns are set to the correct type format.

covid19 <- subset(covid19, `Country/Region` == 'US') %>% gather() %>% slice(5:n()) %>% 
  mutate(key=as.Date(key, format = "%m/%d/%y"), value=as.integer(value))

With the date in the appropriate format, a graph can be produced of the confirmed cases over time.

covid19 %>% 
  ggplot(aes(x=key, y= value)) + geom_line() + ggtitle("US Confirmed Cases") +
  xlab('Dates') + ylab('Count')

Looking at this graph shows an exponential curve. The first confirmed case is recorded as January 22, and the case count did not increase by much for several weeks. It makes sense that the cases went from 1 to 2 and then to 5 over the initial days. In any case, I want to model the data against the exponential curve in order to forecast forward.

I model the data without the use of R machine learning libraries. First compute lambda for confirmed cases by taking the log of the latest value and subtract the log of the first value. Since the first value is 1 and log(1) = 0, I can use just the log of the latest value. The result is divided by the number of days, which turns out to be the number of observations. This calculation is implemented in R.

lambda = covid19 %>% select(value) %>% tail(n=1) %>% pull() %>% as.integer() %>% log2() 
x2 = covid19 %>% nrow()
Lambda = lambda / x2
rm(lambda, x2)

The computed exponent is 0.249, which indicates the cases double every 4 days or so.

With the exponent, lambda, I can forecast the confirmed cases for the next two weeks. I use the equation 2 ^ (lambda * x), where x is the day number starting from day 1. I use a base of 2 so that it is easy to determine how frequent the confirmed cases double.

dates = c(covid19$key[1:14], covid19$key + 14L) # Extend out two weeks (14 days)
x <- 1:length(dates)
predictions <- data.frame(x=x, key=dates, value=2 ^ (Lambda * x))

Graph the confirmed cases and the predicted cases together.

covid19 %>%  mutate(Type = 'Actual_US') %>% bind_rows(predictions %>% mutate(Type = 'Prediction')) %>% 
  ggplot(aes(x = key, y = value, color = Type)) + 
  geom_line() + ggtitle("US Actual vs Predicted Cases") +   xlab('Dates') + ylab('Count')

As one can see, the prediction follows very closely up to today. I guess we will see if the predictions hold in two weeks.

Fatalities in the US

Let us do the exact same thing with the fatalities data in the US. Since the code is exactly the same, I will do it in one block.

rm(covid19, predictions)
covid19 <- read_csv("time_series_covid19_deaths_global.csv") # Fatality data
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
covid19 <- subset(covid19, `Country/Region` == 'US') %>% gather() %>% slice(5:n()) %>% 
  mutate(key=as.Date(key, format = "%m/%d/%y"), value=as.integer(value))

# Code the exponential model
lambda = covid19 %>% select(value) %>% tail(n=1) %>% pull() %>% as.integer() %>% log2() 
x2 = covid19 %>% nrow()
mortality = lambda / x2

dates = c(covid19$key[1:14], covid19$key + 14L)
x <- 1:length(dates)
predictions <- data.frame(x=x, key=dates, value=2 ^ (mortality * x))

rm(x, x2, dates, lambda)

The computed exponent for fatalities is 0.174, which indicates the cases double every 5 or 6 days. At least at this time, the fatalitity rate is not as high as the rate of confirmed cases, which is curious. Perhaps there is a lag, although I hope not. In any case, the predictions in this post are just the result of my assumptions and calculations, and I hope they are wrong.

The graph of actual versus forecasted fatalities.

covid19 %>%  mutate(Type = 'Actual_US') %>% bind_rows(predictions %>% mutate(Type = 'Prediction')) %>% 
  ggplot(aes(x = key, y = value, color = Type)) + 
  geom_line() + ggtitle("US Actual vs Predicted Fatalities") +   xlab('Dates') + ylab('Count')

The prediction of fatalities is not as good of a fit as confirmed cases. Hopefully the prediction is incorrect because the mortality rate is reduced soon. But another possibility is the stage of the pandemic. I ran this code a few weeks ago, and the confirmed cases graph look similar. Now that the disease has spread, the model fits the data better. Although I hope my forecast is wrong, it is possible the data will fit better in a few weeks.

PCA and Logistic Regression of Breast Cancer Data

1 April ’20

A Principal Component Analysis and Logistic Regression of Breast Cancer Data

As usual, it will be presented from a R Notebook.

<!DOCTYPE html>

Logistic Regression with PCA Example

April 2020

This script reads in breast cancer data, does a PCA to determine significant features, and then does some logistic regressions using the significant features from the first few principal components

library(tidyverse) # For read_csv and ggplot
library(factoextra) # For fviz_screeplot and fviz_contrib

Cancer data

I want to determine Diagnosis, which requires supervised learning, but I do not know what features to include in my model. I need to determine what columns in the dataset are appropriate for analysis, for example, I want to identify possible data from identifiers like IDs or Names. Then I can use PCA to determine the features for a logistic regression supervised learning model. I will compute the principal components. Then I will select a subset of original features from the weights of the principal components. Using the original variables and then again with the principal components, I will run a logistic regression. Finally I will look at the statistics to see how good each model does.

Prep the data using “cleaned data”

cancer <- read_csv("wdbc.csv", col_names = TRUE)
Parsed with column specification:
cols(
  .default = col_double(),
  Diagnosis = col_character()
)
See spec(...) for full column specifications.
# Let us see the data
glimpse(cancer)
Observations: 569
Variables: 32
$ ID                      <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786, 844…
$ Diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
$ mean_radius             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450, 18.250, 1…
$ mean_texture            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98, 20.83, 2…
$ mean_perimeter          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 119.60, 90.…
$ mean_area               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1040.0, 577…
$ mean_smoothness         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0.12780, 0.0…
$ mean_compactness        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0.17000, 0.1…
$ mean_concavity          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0.15780, 0.1…
$ mean_concave_points     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0.08089, 0.0…
$ mean_symmetry           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087, 0.1794, 0…
$ mean_fractal_dimension  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0.07613, 0.0…
$ se_radius               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345, 0.4467, 0…
$ se_texture              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902, 0.7732, 1…
$ se_perimeter            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.180, 3.856, 2…
$ se_area                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.91, 50.96, …
$ se_smoothness           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.011490, 0.00751…
$ se_compactness          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.024610, 0.03345…
$ se_concavity            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0.03672, 0.0…
$ se_concave_points       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.018850, 0.01137…
$ se_symmetry             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0.02165, 0.0…
$ se_fractal_dimension    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.005115, 0.00508…
$ worst_radius            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.88, 17.06, 1…
$ worst_texture           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.66, 28.14, 3…
$ worst_perimeter         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40, 153.20, 11…
$ worst_area              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, 1606.0, 897…
$ worst_smoothness        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791, 0.1442, 0…
$ worst_compactness       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249, 0.2576, 0…
$ worst_concavity         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0.53550, 0.3…
$ worst_concave_points    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0.17410, 0.1…
$ worst_symmetry          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985, 0.3063, 0…
$ worst_fractal_dimension <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0.12440, 0.0…

Notice that Diagnosis is a character with either B (benign) or M (malignant). Since there is a finite (only 2) choices for this feature, I will convert it to a factor

cancer$Diagnosis <- factor(cancer$Diagnosis)

Also note that the first column is ID. I do not think it will be useful in any analysis. I will use the remaining columns as features in a PCA.

pcaCancer <- prcomp(cancer[3:32], center = TRUE, scale. = TRUE)

Proportion of variance explained for 1st principal component

vars <- pcaCancer$rotation[,1] / sum(pcaCancer$rotation[,1])
head(vars[order(-vars)])
 mean_concave_points       mean_concavity worst_concave_points     mean_compactness 
          0.05169734           0.05121114           0.04972187           0.04742280 
     worst_perimeter      worst_concavity 
          0.04689847           0.04533833 

mean_concave_points is the strongest (not by much) contributor.

Proportion of variance explained for 2nd principal component

vars <- pcaCancer$rotation[,2] / sum(pcaCancer$rotation[,2])
head(vars[order(-vars)])
 mean_fractal_dimension    se_fractal_dimension worst_fractal_dimension 
              0.2841535               0.2171152               0.2134313 
         se_compactness           se_smoothness            se_concavity 
              0.1803913               0.1584657               0.1528666 

mean_fractal_dimension is the strongest contributor.

Proportion of variance explained for 3rd principal component

vars <- pcaCancer$rotation[,3] / sum(pcaCancer$rotation[,3])
head(vars[order(-vars)])
       se_texture     se_smoothness       se_symmetry         se_radius      se_perimeter 
        0.4634608         0.3820659         0.3570088         0.3321394         0.3298680 
se_concave_points 
        0.2779248 

se_texture

Proportion of variance explained for 4th principal component

vars <- pcaCancer$rotation[,4] / sum(pcaCancer$rotation[,4])
head(vars[order(-vars)])
          worst_texture            mean_texture              se_texture 
             0.77147824              0.73519936              0.43871247 
      worst_compactness worst_fractal_dimension         worst_concavity 
             0.11134167              0.09393858              0.09015647 

worst_texture

We could graph the contributions to each principal component

fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 1, top = 10, title = "1st Principal Component")

It does not appear that there is a clear winner?? Maybe we will have to use the principal components instead of the contributing variables?

fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 2, top = 5, title = "2nd Principal Component")

mean_fractal_dimension is clearly the strongest contributor

I can get these four columns in this case using cancer[,c(10,12,14,24)] (I had to figure that out by looking at names(cancer))

names(cancer[,c(10,12,14,24)])
[1] "mean_concave_points"    "mean_fractal_dimension" "se_texture"            
[4] "worst_texture"         

I want to make sure se_texture and worst_texture are NOT correlated.

plot(cancer$se_texture, cancer$worst_texture, main = "se_texture vs. worst_texture")

Or using ggplot so I can add a regression line

ggplot(cancer, aes(x=se_texture, y=worst_texture)) +
  geom_point(size=2, shape=23) +
  geom_smooth(method='lm') + 
  labs(title="se_texture vs. worst_texture")

There does not appear to be correlation, but we can double check with statistics

cor(cancer$se_texture, cancer$worst_texture)
[1] 0.4090028

Hmm, those two features might be weakly correlated, but I will keep them anyway. However, I have to keep in mind they may be correlated, in which case I may end up not using both.

Let us separate training and test data.

set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))
training <- cancer[indexes,] 
testing <- cancer[-indexes,] 
rm(indexes)

Now run a logistic regression using the four determined features.

diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture + worst_texture, data=training, family=binomial)

Look at the results of the logistic regression.

summary(diagnose)

Call:
glm(formula = Diagnosis ~ mean_concave_points + mean_fractal_dimension + 
    se_texture + worst_texture, family = binomial, data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9869  -0.1365  -0.0258   0.0372   3.3857  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -4.97074    2.79201  -1.780 0.075019 .  
mean_concave_points     150.02211   19.76962   7.589 3.24e-14 ***
mean_fractal_dimension -162.51587   45.33638  -3.585 0.000338 ***
se_texture               -2.09973    0.71709  -2.928 0.003410 ** 
worst_texture             0.36795    0.06359   5.786 7.19e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 564.91  on 425  degrees of freedom
Residual deviance: 102.04  on 421  degrees of freedom
AIC: 112.04

Number of Fisher Scoring iterations: 8

Using the training data, let us look at the confusion matrix. I will use 0.45 instead of 0.5 as the boundary because I want to err on the side of catching a cancer.

predictTrain = predict(diagnose, type="response")
table(training$Diagnosis, predictTrain >= 0.45)
   
    FALSE TRUE
  B   256    9
  M    12  149

The sensitivity

t1 = table(training$Diagnosis, predictTrain >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity
[1] 0.9254658

And the specificity

t1[1]/(t1[1] + t1[3]) # specificity
[1] 0.9660377

And the accuracy

(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9507042

Really good! But that was using the training data. I really need to get the confusion matrix using the testing data.

predictTest = predict(diagnose, type="response", newdata=testing)
table(testing$Diagnosis, predictTest >= 0.45)
   
    FALSE TRUE
  B    88    4
  M     4   47

The metrics are

t1 = table(testing$Diagnosis, predictTest >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity
[1] 0.9215686
t1[1]/(t1[1] + t1[3]) # specificity
[1] 0.9565217

And the accuracy is

(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559

94% That is really good

Maybe we can do even better without our possible correlation.

diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)

The accuracy is computed as follows:

t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.8951049
rm(t1)

A bit worse, but maybe different features again

diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + worst_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)

That was the best so far! I am going with mean_concave_points + mean_fractal_dimension + worst_texture. I could try other features (I did not look at the 5th principal component, but I could choose any)

Now let us try again but this time with the principal components. Create a graph to determine how many principal components I should use.

fviz_screeplot(pcaCancer, ncp=5)

Looks like two or maybe three principal components would be enough.

Again, I need training and test data

set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))

# Because we are using the principal components, from pcaCancer, we need to include the target variable
Diagnosis <- cancer[indexes,]$Diagnosis
training <- data.frame(Diagnosis, pcaCancer$x[indexes,] )

Diagnosis <- cancer[-indexes,]$Diagnosis
testing <- data.frame(Diagnosis, pcaCancer$x[-indexes,] )
rm(Diagnosis)

Now run the logistic regression using the first three principal components

diagnose = glm(Diagnosis ~ PC1 + PC2 + PC3, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)

Maybe we can do better with only two principal components

diagnose = glm(Diagnosis ~ PC1 + PC2, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
[1] 0.9440559
rm(t1)

WOW! 94% accuracy - with only two principal components!

I suspect that the lack of a clear contributor to the 1st principal component should have convinced us to use the principal components anyway.

---
title: "Logistic Regression with PCA Example"
author: "Dr. Clifton Baldwin"
output: html_notebook
---

April 2020

This script reads in breast cancer data, does a PCA to determine significant features, and then does some logistic regressions using the significant features from the first few principal components

```{r message=FALSE}
library(tidyverse) # For read_csv and ggplot
library(factoextra) # For fviz_screeplot and fviz_contrib
```

Cancer data

- https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21
- http://jotterbach.github.io/2016/03/24/Principal_Component_Analysis/

I want to determine Diagnosis, which requires supervised learning, but I do not know what features to include in my model.
I need to determine what columns in the dataset are appropriate for analysis, for example, I want to identify possible data from identifiers like IDs or Names.
Then I can use PCA to determine the features for a logistic regression supervised learning model.
I will compute the principal components. Then I will select a subset of original features from the weights of the principal components.
Using the original variables and then again with the principal components, I will run a logistic regression.
Finally I will look at the statistics to see how good each model does.


Prep the data using "cleaned data"
```{r}
cancer <- read_csv("wdbc.csv", col_names = TRUE)
# Let us see the data
glimpse(cancer)
```

Notice that Diagnosis is a character with either B (benign) or M (malignant). Since there is a finite (only 2) choices for this feature, I will convert it to a factor
```{r}
cancer$Diagnosis <- factor(cancer$Diagnosis)
```


Also note that the first column is ID. I do not think it will be useful in any analysis. I will use the remaining columns as features in a PCA.
```{r}
pcaCancer <- prcomp(cancer[3:32], center = TRUE, scale. = TRUE)

```

Proportion of variance explained for 1st principal component
```{r}
vars <- pcaCancer$rotation[,1] / sum(pcaCancer$rotation[,1])
head(vars[order(-vars)])

```

mean_concave_points is the strongest (not by much) contributor.

Proportion of variance explained for 2nd principal component
```{r}
vars <- pcaCancer$rotation[,2] / sum(pcaCancer$rotation[,2])
head(vars[order(-vars)])
```

mean_fractal_dimension is the strongest contributor.

Proportion of variance explained for 3rd principal component
```{r}
vars <- pcaCancer$rotation[,3] / sum(pcaCancer$rotation[,3])
head(vars[order(-vars)])
```

se_texture

Proportion of variance explained for 4th principal component
```{r}
vars <- pcaCancer$rotation[,4] / sum(pcaCancer$rotation[,4])
head(vars[order(-vars)])
```

worst_texture

We could graph the contributions to each principal component
```{r}
fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 1, top = 10, title = "1st Principal Component")
```

It does not appear that there is a clear winner?? Maybe we will have to use the principal components instead of the contributing variables?


```{r}
fviz_contrib(pcaCancer, choice = "var", sort.val = c("desc"), axes = 2, top = 5, title = "2nd Principal Component")
```

mean_fractal_dimension is clearly the strongest contributor

I can get these four columns in this case using cancer[,c(10,12,14,24)] (I had to figure that out by looking at names(cancer))
```{r}
names(cancer[,c(10,12,14,24)])
```

I want to make sure se_texture and worst_texture are NOT correlated.
```{r}
plot(cancer$se_texture, cancer$worst_texture, main = "se_texture vs. worst_texture")
```

Or using ggplot so I can add a regression line
```{r}
ggplot(cancer, aes(x=se_texture, y=worst_texture)) +
  geom_point(size=2, shape=23) +
  geom_smooth(method='lm') + 
  labs(title="se_texture vs. worst_texture")
```

There does not appear to be correlation, but we can double check with statistics

```{r}
cor(cancer$se_texture, cancer$worst_texture)
```

Hmm, those two features might be weakly correlated, but I will keep them anyway. However, I have to keep in mind they may be correlated, in which case I may end up not using both.

Let us separate training and test data.
```{r}
set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))
training <- cancer[indexes,] 
testing <- cancer[-indexes,] 
rm(indexes)

```


Now run a logistic regression using the four determined features.
```{r}
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture + worst_texture, data=training, family=binomial)

```

Look at the results of the logistic regression.
```{r}
summary(diagnose)
```

Using the training data, let us look at the confusion matrix. I will use 0.45 instead of 0.5 as the boundary because I want to err on the side of catching a cancer.

```{r}
predictTrain = predict(diagnose, type="response")
table(training$Diagnosis, predictTrain >= 0.45)

```

The sensitivity 
```{r}
t1 = table(training$Diagnosis, predictTrain >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity

```

And the specificity
```{r}
t1[1]/(t1[1] + t1[3]) # specificity
```

And the accuracy
```{r}
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
```

Really good! But that was using the training data. I really need to get the confusion matrix using the testing data.

```{r}
predictTest = predict(diagnose, type="response", newdata=testing)
table(testing$Diagnosis, predictTest >= 0.45)

```

The metrics are
```{r}
t1 = table(testing$Diagnosis, predictTest >= 0.45)
t1[4]/(t1[2] + t1[4]) # sensitivity

```

```{r}
t1[1]/(t1[1] + t1[3]) # specificity
```

And the accuracy is 
```{r}
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
```

94% That is really good


Maybe we can do even better without our possible correlation. 
```{r}
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + se_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)

```

The accuracy is computed as follows:
```{r}
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
rm(t1)

```

A bit worse, but maybe different features again
```{r}
diagnose = glm(Diagnosis ~ mean_concave_points + mean_fractal_dimension + worst_texture, data=training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
rm(t1)

```

That was the best so far! I am going with mean_concave_points + mean_fractal_dimension + worst_texture. I could try other features (I did not look at the 5th principal component, but I could choose any)

Now let us try again but this time with the principal components. 
Create a graph to determine how many principal components I should use.

```{r}
fviz_screeplot(pcaCancer, ncp=5)
```

Looks like two or maybe three principal components would be enough.

Again, I need training and test data
```{r}
set.seed(23)
indexes <- sample(1:nrow(cancer), size=0.75*nrow(cancer))

# Because we are using the principal components, from pcaCancer, we need to include the target variable
Diagnosis <- cancer[indexes,]$Diagnosis
training <- data.frame(Diagnosis, pcaCancer$x[indexes,] )

Diagnosis <- cancer[-indexes,]$Diagnosis
testing <- data.frame(Diagnosis, pcaCancer$x[-indexes,] )
rm(Diagnosis)

```

Now run the logistic regression using the first three principal components
```{r warning=FALSE}
diagnose = glm(Diagnosis ~ PC1 + PC2 + PC3, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
rm(t1)

```


Maybe we can do better with only two principal components
```{r warning=FALSE}
diagnose = glm(Diagnosis ~ PC1 + PC2, data = training, family=binomial)
predictTest = predict(diagnose, type="response", newdata=testing)
t1 = table(testing$Diagnosis, predictTest >= 0.45)
(t1[1] + t1[4])/(t1[1] + t1[2] + t1[3] + t1[4]) # accuracy
rm(t1)

```

WOW! 94% accuracy - with only two principal components! 

I suspect that the lack of a clear contributor to the 1st principal component should have convinced us to use the principal components anyway.

Analyzing Twitter about the Philadelphia Flyers Part 3

12 April ’18

Twitter Analysis of Philly Flyers

This post is a continuation of my previous two posts on Twitter analysis of the Philadelphia Flyers during most of March 2018. Again, it will be presented from a R Notebook.

Here is part 3 for the Philly Flyers R Notebook Analysis: <!DOCTYPE html>

Flyers Twitter Users

Part 3 of the Philadelphia Flyers Twitter Activity

The previous two posts examined the Twitter users that tweet about the Philly Flyers and looked at the tweets themselves. This analysis will be somewhat shorter than the other two in that it will just look at how the users described themselves on their Twitter account.

Specifically I ask, 6. Is there anything that people tweet about the Flyers that can be used for marketing? 7. Is there any characteristic to describe the Flyers’ Twitter users that can be used to target advertizing?

As this is a R Notebook, all the code is in R, version 3.4.4 (2018-03-15) to be exact.

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. The dates of the collections were March 11, March 20, and March 26, 2018. See the first of the three posts on this topic for the code I used to scrape Twitter.

First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
library(tidytext)  # For Twitter text manipulation
library(wordcloud)
library("RColorBrewer") # Because I want to print with Flyers colors!
data(stop_words)

Read the three datasets into memory and combine into one master dataset. Then clean the datasets. For more information on the data preparation, see the previous posts.

# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")
# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)
rm(tw11, users11, tw20, users20, rstats_tweets)
# Remove duplicates, due to overlapping dates in the individual datasets.
tw <- unique(tw)
users <- unique(users)
# Remove users that do not (or should not) contribute value to this study.
users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]
# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)
users <- users[select,]
rm(select)
# Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some celebrity names
users <- users[!users$verified,] # Save only nonverified accounts
# Now select only the tweets that belong to these user_ids
tw <- tw[tw$user_id %in% users$user_id,]
# Save only the tweets that are in English (at least for now)
tw <- tw[tw$lang=="en",]

Let us look at the Twitter users self-reported descriptions. To do so, we need to clean up the text. By that I mean remove references to screen names, hashtags, spaces, numbers, punctuations, and urls. Because some users used non-native characters in their descriptions, we need to replace non-native characteris with native equivalents.

clean_tweet <- gsub('\\n', '', users$description) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ") %>% stringi::stri_trans_general( "latin-ascii")
# stringi::stri_trans_general( "latin-ascii") is needed to remove non-native characters because they cause trouble in output

6. Is there anything that people tweet about the Flyers that can be used for marketing?

Now the that the text has been cleaned, we will look at the words used in tweets, and then we will consider the tweets as a whole.

tweets <- data_frame(text=clean_tweet) %>% unnest_tokens(word, text)
tweets <- tweets %>% anti_join(stop_words)
tweets %>% count(word, sort = TRUE) 

The fact that “flyers” is the number 2 most occuring word makes me question whether all of these users are regular fans (as opposed to agents of the Philly Flyers). I tried to eliminate professional Flyers support organizations, but maybe I missed a few.

Graph the words that occur at least 100 times.

tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("Popular Words") + ylab("Number Occurences") +  
  labs(title="Most Popular Words in Twitter Users Description") +
  coord_flip()

It is interesting that many Twitter users indicate Eagles or Sixers also. Perhaps the Flyers could increase the number of tweets to Sixers and Eagles fans in order to increase attendance at Flyers games? A future study could look at Twitter users who tweet about these other Philadelphia sports teams. Maybe there is something to be gained from analyzing their tweets? Maybe there is a commonality among Philadelphia sports fans that could be exploited to promte Flyers games?

Also I find it interesting that “music” appears quite often. There might be something to explore related to music and Twitter also?

7. Is there any characteristic to describe the Flyers’ Twitter users that can be used to target advertizing?

Let us look at the popular words in the users’ descriptions in the form of a Word Cloud. Due to the way a word cloud presents, we can see more words easier than we can with the bar graph. For this word cloud, I set the maximum number of words to display at 50 since any more causes an error of too many words to display.

wordcloud(tweets$word, max.words = 50, random.order = FALSE)

Looking past the words we saw in the previous graphs, we see several references to family, such as “father,” “dad,” and “husband.” Also we see “university,” “college,” “temple,” “penn,” and “psu.” These two areas of family and college students may be additional areas that could be focused on to increase Flyers game attendance.

I think I have analyzed the Twitter data as much as I can, or at least as much as I want to do. I guess it is time for me to find my next project.

---
title: "Flyers Twitter Users"
author: "Dr. Clifton Baldwin"
output: html_notebook
---

# Part 3 of the Philadelphia Flyers Twitter Activity

The previous two posts examined the Twitter users that tweet about the Philly Flyers and looked at the tweets themselves. This analysis will be somewhat shorter than the other two in that it will just look at how the users described themselves on their Twitter account.

Specifically I ask,
6. Is there anything that people tweet about the Flyers that can be used for marketing?
7. Is there any characteristic to describe the Flyers' Twitter users that can be used to target advertizing?

As this is a R Notebook, all the code is in R, version 3.4.4 (2018-03-15) to be exact. 

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. The dates of the collections were March 11, March 20, and March 26, 2018. See the first of the three posts on this topic for the code I used to scrape Twitter.

First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

```{r message=FALSE}
library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
library(tidytext)  # For Twitter text manipulation
library(wordcloud)
library("RColorBrewer") # Because I want to print with Flyers colors!
data(stop_words)
```

Read the three datasets into memory and combine into one master dataset. Then clean the datasets. For more information on the data preparation, see the previous posts.
```{r}
# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")

# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)
rm(tw11, users11, tw20, users20, rstats_tweets)

# Remove duplicates, due to overlapping dates in the individual datasets.
tw <- unique(tw)
users <- unique(users)

# Remove users that do not (or should not) contribute value to this study.
users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]

# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)

users <- users[select,]
rm(select)

# Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some celebrity names
users <- users[!users$verified,] # Save only nonverified accounts

# Now select only the tweets that belong to these user_ids
tw <- tw[tw$user_id %in% users$user_id,]

# Save only the tweets that are in English (at least for now)
tw <- tw[tw$lang=="en",]

```

Let us look at the Twitter users self-reported descriptions. To do so, we need to clean up the text. By that I mean remove references to screen names, hashtags, spaces, numbers, punctuations, and urls. Because some users used non-native characters in their descriptions, we need to replace non-native characteris with native equivalents.

```{r}
clean_tweet <- gsub('\\n', '', users$description) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ") %>% stringi::stri_trans_general( "latin-ascii")
# stringi::stri_trans_general( "latin-ascii") is needed to remove non-native characters because they cause trouble in output

```

## 6. Is there anything that people tweet about the Flyers that can be used for marketing?

Now the that the text has been cleaned, we will look at the words used in tweets, and then we will consider the tweets as a whole.
```{r, message=FALSE, warning=TRUE}
tweets <- data_frame(text=clean_tweet) %>% unnest_tokens(word, text)

tweets <- tweets %>% anti_join(stop_words)

tweets %>% count(word, sort = TRUE) 
```

The fact that "flyers" is the number 2 most occuring word makes me question whether all of these users are regular fans (as opposed to agents of the Philly Flyers). I tried to eliminate professional Flyers support organizations, but maybe I missed a few.

Graph the words that occur at least 100 times.
```{r}
tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("Popular Words") + ylab("Number Occurences") +  
  labs(title="Most Popular Words in Twitter Users Description") +
  coord_flip()
```

It is interesting that many Twitter users indicate Eagles or Sixers also. Perhaps the Flyers could increase the number of tweets to Sixers and Eagles fans in order to increase attendance at Flyers games? A future study could look at Twitter users who tweet about these other Philadelphia sports teams. Maybe there is something to be gained from analyzing their tweets? Maybe there is a commonality among Philadelphia sports fans that could be exploited to promte Flyers games?

Also I find it interesting that "music" appears quite often. There might be something to explore related to music and Twitter also?

## 7. Is there any characteristic to describe the Flyers' Twitter users that can be used to target advertizing?
Let us look at the popular words in the users' descriptions in the form of a Word Cloud. Due to the way a word cloud presents, we can see more words easier than we can with the bar graph. For this word cloud, I set the maximum number of words to display at 50 since any more causes an error of too many words to display.

```{r}
wordcloud(tweets$word, max.words = 50, random.order = FALSE)
```

Looking past the words we saw in the previous graphs, we see several references to family, such as "father," "dad," and "husband." Also we see "university," "college," "temple," "penn," and "psu." These two areas of family and college students may be additional areas that could be focused on to increase Flyers game attendance.

I think I have analyzed the Twitter data as much as I can, or at least as much as I want to do. I guess it is time for me to find my next project.