Predicting Food Inspection Results
16 April ’20
Predicting Food Facility Inspections
I did not get a chance to submit this exploration and model to the MachineHack challenge, but it makes for a good post.
<!DOCTYPE html>
Predicting Food Inspections
Dr. Clifton Baldwin
Machine Hack Challenge for Food Inspections
March - April 2020
Objective:
Given data from past inspections, build a predictive model that is capable of predicting the outcome of an inspection conducted in a facility based on the given set of features
The Data
Before proceeding, load the required R libraries
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
[30m── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 3.2.1 [32m✔[30m [34mpurrr [30m 0.3.2
[32m✔[30m [34mtibble [30m 2.1.3 [32m✔[30m [34mdplyr [30m 0.8.3
[32m✔[30m [34mtidyr [30m 0.8.3 [32m✔[30m [34mstringr[30m 1.4.0
[32m✔[30m [34mreadr [30m 1.3.1 [32m✔[30m [34mforcats[30m 0.4.0[39m
[30m── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(readxl) # used to read the given Excel file
library(lubridate) # used to handle the Date column
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
From the provided documentation Features : ID: A unique id for each inspection Date: The date at which the inspection was done in a particular facility LicenseNo: De-identified license number for a particular facility FacilityID: De-identified unique facility id for a facility FacilityName: The encoded name of a facility Type: The type of the facility being inspected Street: The encoded street where the facility is located City: The encoded city where the facility is located State: The encoded state where the facility is located LocationID: An encoded location feature. Reason: The primary reason for the inspection SectionViolations: Laws violated by the facility RiskLevel: The level of risk the facility possesses to the consumers. Geo_Loc: De-identified geo location of the facility
Target : Inspection_Results: The result of the inspection The inspection results can have any of the following values : 0:’FACILITY CHANGED’ 1:’FAIL’ 2:’FURTHER INSPECTION REQUIRED’, 3:’INSPECTION OVERRULED’ 4:’PASS’ 5:’PASS(CONDITIONAL)’ 6:’SHUT-DOWN’
Now load the training dataset as a data frame and take a look
df <- read_excel("Data_Train.xlsx", sheet = 1)
glimpse(df)
Observations: 147,443
Variables: 15
$ ID [3m[38;5;246m<dbl>[39m[23m 3.110349e+13, 1.008900e+13, 4.014897e+13, 3.715771e+13, 4.747805e+13, 2.523429e+13, 1.764017e+13,…
$ Date [3m[38;5;246m<chr>[39m[23m "26-04-2010", "21-06-2009", "01-05-2013", "28-09-2015", "09-12-2015", "07-02-2016", "04-04-2011",…
$ LicenseNo [3m[38;5;246m<dbl>[39m[23m 4744, 2973, 18223, 20825, 2136, 13009, 9621, 32170, 27027, 7131, 26522, 3350, 15323, 34228, 37390…
$ FacilityID [3m[38;5;246m<dbl>[39m[23m 8123, 12268, 1112, 20007, 16867, 7942, 3791, 24054, 24785, 2892, 5574, 12919, 2017, 24693, 26965,…
$ FacilityName [3m[38;5;246m<dbl>[39m[23m 7715, 11664, 969, 19115, 10409, 7547, 3576, 8766, 23652, 2707, 5307, 12263, 1829, 23563, 5722, 11…
$ Type [3m[38;5;246m<chr>[39m[23m "RESTAURANT", "GROCERY STORE", "RESTAURANT", "RESTAURANT", "RESTAURANT", "RESTAURANT", "RESTAURAN…
$ Street [3m[38;5;246m<dbl>[39m[23m 15522, 3057, 14988, 3661, 7876, 12275, 2133, 5006, 4554, 5899, 16017, 4538, 17881, 14317, 4306, 1…
$ City [3m[38;5;246m<chr>[39m[23m "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-11235901", "id-112…
$ State [3m[38;5;246m<chr>[39m[23m "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134", "id_1890134",…
$ LocationID [3m[38;5;246m<dbl>[39m[23m 81876, 81862, 81883, 81859, 81886, 81877, 81887, 81851, 81892, 81877, 81864, 81867, 81862, 81905,…
$ Reason [3m[38;5;246m<chr>[39m[23m "CANVASS", "COMPLAINT", "CANVASS", "CANVASS RE-INSPECTION", "COMPLAINT", "CANVASS", "CANVASS RE-I…
$ SectionViolations [3m[38;5;246m<dbl>[39m[23m 33, 33, NA, 31, 30, 18, 32, NA, NA, 32, NA, 33, NA, 16, 2, 32, 33, 18, 21, NA, 32, 32, 33, NA, 32…
$ RiskLevel [3m[38;5;246m<chr>[39m[23m "High", "High", "High", "Medium", "High", "High", "High", "Low", "High", "High", "Medium", "High"…
$ Geo_Loc [3m[38;5;246m<chr>[39m[23m "locid16406", "locid878", "locid3368", "locid11839", "locid12264", "locid3935", "locid8876", "loc…
$ Inspection_Results [3m[38;5;246m<chr>[39m[23m "4", "4", "6", "4", "4", "1", "4", "1", "4", "4", "6", "4", "3", "1", "5", "4", "4", "1", "5", "1…
Most of the columns are not in the appropriate format for this analysis. Although I might not need to use ID, I know it should be a factor instead of a number.
# Convert ID to type factor
df$ID <- factor(df$ID)
Change Date from character to date. In doing so, Date fails with parsing errors. Investigate…
x <- dmy(df$Date)
53 failed to parse.
ids <- df[is.na(x),"ID"]
ids <- ids %>% pull()
df %>% filter(ID %in% ids) %>% select(Date)
There are 53 February 29 dates during non-Leap Years! Since I cannot ask anyone what these dates should be, I guess I will make them March 1?
df <- df %>%
mutate(Date = case_when(ID %in% ids ~ str_replace(Date, "29-02","01-03"), TRUE ~ Date))
Now Date will convert from character to date
df$Date <- dmy(df$Date)
There are many other categorical variables that are stored as double or character. Convert these remaining appropriate (categorical) features to factors
df$LicenseNo <- factor(df$LicenseNo)
df$FacilityID <- factor(df$FacilityID)
df$FacilityName <- factor(df$FacilityName)
df$Type <- factor(df$Type)
df$City <- factor(df$City)
df$State <- factor(df$State) # State should be unique, but Street may not (same street name in different states?)
df$LocationID <- factor(df$LocationID)
df$Reason <- factor(df$Reason)
df$SectionViolations <- factor(df$SectionViolations)
df$RiskLevel <- factor(df$RiskLevel)
df$Geo_Loc <- factor(df$Geo_Loc)
Inspection_Results needs to be a factor also, but I want to make it an Integer for some histograms first.
Analysis
I wonder if inspections take place throughout the year, and if so is there a seasonal effect? A count of obsevations per month should show if any months are lacking.
sapply(1:12, function(x) { length(which(month(df$Date)==x)) })
[1] 11267 10543 12863 12455 13708 13596 11073 13372 13708 13522 11471 9865
Ok, at least 10 thousand obsevations per month. Well, December has 9,865 but close enough since it is not unusual that there are less inspections around the holidays. Although inspections do not appear seasonal in the dataset, perhaps time of year has an impact on inspection results. I will create a variable to indicate cold months from warm months.
df$winter <- ifelse((month(df$Date) %in% c(10,11,12,1,2,3)), 1L, 0L)
df$winter <- factor(df$winter)
How many different Type are there
length(levels(df$Type))
[1] 409
Hmm, 499 different factors. How many are NA?
sum(is.na(df$Type))
[1] 3485
There are 3485 without a Type. Investigate what is happening when no Type
nas <- df[is.na(df$Type),]
sum(is.na(nas$SectionViolations))
[1] 3401
Almost all (3401 out of 3485 that have no Type have no SectionViolations. Although interesting, do these NA values have any impact on the Inspection_Results, which is the target variable. Let us look at a histogram of all Inspection_Results
hist(as.integer(df$Inspection_Results), main = "Histogram of All Inspection_Results", xlab = "Inspection Results")
Now let us compare the histogram with Type observations of only NA and SectionViolations of NA
hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results with Type NA and SectionViolations NA", xlab = "Inspection Results")
Interesting! Why are most of the missing SectionViolations the ones that are ShutDown??
sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results != "6"))
[1] 806390
sum(which(nas[is.na(nas$SectionViolations),]$Inspection_Results == "6"))
[1] 4978711
There are 806,390 observations that are not “ShutDown” and 4,978,711 that are ShutDown. I am thinking I cannot remove these NA obsevations without introducing bias, but I should look at teh data without those observations first.
nas <- df[!is.na(df$Type),]
hist(as.integer(nas$Inspection_Results), main = "Histogram of Inspection_Results without Type = NA", xlab = "Inspection Results")
What if we just remove the SectionViolations that are NA
hist(as.integer(nas[is.na(nas$SectionViolations),]$Inspection_Results), main = "Inspection_Results without Type = NA and with SectionViolations NA", xlab = "Inspection Results")
It appears we lose many of the “PASS” (i.e. “5”). So I am not going to remove the NA values at this time. Of the non-NA Type, how many are unique (i.e. only one obsevarion)
df %>%
filter(!is.na(Type)) %>%
group_by(Type) %>%
summarise(n = n()) %>%
filter(n < 2) %>% nrow()
[1] 71
71 unique observations of Type. How many additional ones have less than 10 observations
df %>%
filter(!is.na(Type)) %>%
group_by(Type) %>%
summarise(n = n()) %>%
filter(n < 10 & n > 1) %>% nrow()
[1] 245
Another 245 Types, which means 316 have less than 10 observations. Going to the other extreme, how many have more than 20 observations per Type
df %>%
filter(!is.na(Type)) %>%
group_by(Type) %>%
summarise(n = n()) %>%
filter(n > 20) %>% nrow()
[1] 53
53 Types have more than 20 observations. How mnany total observations of data is that
df %>%
filter(!is.na(Type)) %>%
group_by(Type) %>%
summarise(n = n()) %>%
filter(n > 20) %>% summarise(sum(n))
NA
There are 142,350 observations with more than 20 observations per Type. So do I keep or exclude the NA’s and the rare Types?? For now, I am keeping all the observations. Moving on to City,
levels(df$City)
[1] "id-11235901" "id-11275913"
Only two cities in the dataset. Are they from different states?
levels(df$State)
[1] "id_1890134" "id_1890135"
Also two factors. Although I am sure there is one city per state, I should verify it.
length(which(df$State[which(df$City == "id-11235901")] == 'id_1890135'))
[1] 19
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890135'))
[1] 22
Wait, what?? What about the two cities in the other state
length(which(df$State[which(df$City == "id-11235901")] == 'id_1890134'))
[1] 147177
length(which(df$State[which(df$City == "id-11275913")] == 'id_1890134'))
[1] 225
It appears there are a total of four, not two, combinations of City and State. Running a chi-square test to determine independence
chisq.test(df$City, df$State, correct = FALSE)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: df$City and df$State
X-squared = 7016.5, df = 1, p-value < 2.2e-16
So City and State are independent! How about Reason
levels(df$Reason)
[1] "CANVASS" "CANVASS RE-INSPECTION"
[3] "COMPLAINT" "COMPLAINT RE-INSPECTION"
[5] "COMPLAINT-FIRE" "CONSULTATION"
[7] "LICENSE" "LICENSE RE-INSPECTION"
[9] "LICENSE-TASK FORCE" "OUT OF BUSINESS"
[11] "RECENT INSPECTION" "SHORT FORM COMPLAINT"
[13] "SHORT FORM FIRE-COMPLAINT" "SUSPECTED FOOD POISONING"
[15] "SUSPECTED FOOD POISONING RE-INSPECTION" "TAG REMOVAL"
[17] "TASK FORCE LIQUOR 1475"
17 levels. How many are NA
sum(is.na(df$Reason))
[1] 0
And no NA for Reason. Next, SectionViolations
length(levels(df$SectionViolations))
[1] 61
61 levels of SectionViolations. How many are NA
sum(is.na(df$SectionViolations))
[1] 39068
39,068 are NA (about 26%). If we take out the NA SectionViolations, let us look at the histogram
hist(as.integer(df[!is.na(df$SectionViolations),]$Inspection_Results), main = "Inspection_Results without SectionViolations of NA", xlab = "Inspection Results")
We lose the 2, 6, and most of the 3’s. So the presence of SectionViolations indicates the result is probably 1, 4, or 5 (PASS). So I guess we cannot exclude them.
Look at the RiskLevel
levels(df$RiskLevel)
[1] "High" "Low" "Medium" "Uncertain"
Note 4 levels with one being “Uncertain”. How many are NA
sum(is.na(df$RiskLevel))
[1] 0
No RiskLevel are NA. How many are “Uncertain” though
length(which(df$RiskLevel=="Uncertain"))
[1] 21
Only 21 are “Uncertain”
Finally, let us look at Geo_loc
length(levels(df$Geo_Loc))
[1] 16316
No! Over 16k levels! I think I will use City, Reason, and RiskLevel, and try also Type and SectionViolations.
I need to make sure the target variable is clean
df$Inspection_Results <- factor(df$Inspection_Results)
levels(df$Inspection_Results)
[1] "0" "1" "2" "3" "4" "5" "6"
Validation dataset
I need to break the train dataset into a training and validation datasets
indexes = sample(1:nrow(df), size=0.75*nrow(df))
training = df[indexes,]
testing = df[-indexes,]
rm(indexes)
Modeling
Of course there are many options, but an appropriate model needs to be a classification model that accepts categorical features. So I cannot select features using Principal Component Aanalysis since that requires numerical data. The choice I am selecting is a Naive Bayes Classification model.
Load the R library that has a Naive Bayes function
library(e1071)
I will use set.seed so that each model run starts off “the same.” Therefore the only change should be the features input to the model. For the first Naive Bayes model, I will use City + Reason + RiskLevel + Type + SectionViolations + winter as the features.
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations + winter, data=training)
Using the test dataset, predict using the model
Predict <- predict(model,newdata = testing )
Create a confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7669352
rm(model, Predict, cm)
76% accurate I am not sure if we prefer a higher sensitivity or higher specificity, where Sensitivity is the true positive rate or probability of detection and Specificity is the true negative rate or proportion of actual negatives that are correctly identified as such. Furthermore, I am not sure if we care more about certain results than others. Therefore I will just use accuracy as the measure of the model.
Let us try the model without the winter feature
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type + SectionViolations, data=training)
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )
#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7663113
rm(model, Predict, cm)
Almost no difference in accuracy. Assuming a simpler model is a better model, I will omit winter. I am skeptical of SectionViolations and will try the model without it.
set.seed(7)
model = naiveBayes(Inspection_Results ~ City + Reason + RiskLevel + Type, data=training)
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )
#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
(sum(cm) - sum(diag(cm))) / sum(cm) #Accuracy
[1] 0.4508559
rm(model, Predict, cm)
I was wrong!! SectionViolations is important! I will put SectionViolations back in as a feature but try to omit City.
set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + Type + SectionViolations, data=training)
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )
#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7664198
rm(model, Predict, cm)
It appears City has no impact on the model. Now I will try the model without Type
set.seed(7)
model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training)
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )
#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7748026
rm(model, Predict, cm)
Only marginal, but this model is the best so far. Maybe I need to do something with the Type with NA. Try the model with Type without NA’s
set.seed(7)
model = training %>%
filter(!is.na(Type)) %>%
naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations + Type, data=.)
# Predict using the model and the validation dataset
Predict <- predict(model,newdata = testing )
#Confusion matrix to check accuracy
cm <- table(Predict, testing$Inspection_Results )
sum(diag(cm)) / sum(cm) # Accuracy
[1] 0.7483519
rm(model, Predict, cm)
As I suspected earlier, we lose something when Type of NA is omitted. I guess the NA’s are predictive, but Type still not helpful. From what I tried, the best Naive Bayes model gets around 77% accuracy with the model = naiveBayes(Inspection_Results ~ Reason + RiskLevel + SectionViolations, data=training).
Just an aside, this “challenge” was on MachineHack but had ended before I found it. Of course it does not really matter since my results are not as good as the top models submitted to the challenge.