Modeling the Cases of Covid-19 in the United States

Using public data obtained from Johns Hopkins, I modeled the spread of CoVid-19 up to April 3, 2020. As usual, it will be presented from a R Notebook.

<!DOCTYPE html>

Exponential Model of CoVid-19 in the US

Confirmed Cases in the US

While staying at home to mitigate the spread of this virus, I decided to model the spread of CoVid-19, mostly for personal therapeutic reasons. Modeling the data makes me feel like I am doing something other than anxiously watching it. No one should trust my predictions! My model is not very sophisticated but appears to have a good fit. Of course the steps being taken to mitigate this virus as well as the potential for new therapies could impact the predictions.

CoVid-19 Data

The data was downloaded from https://github.com/CSSEGISandData/COVID-19 on April 2, 2020. Johns Hopkins University updates the data on this Github site daily.

Once the data is downloaded locally, I use readr from the tidyverse to load the data into R.

covid19 <- read_csv("time_series_covid19_confirmed_global.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )

## See spec(...) for full column specifications.

Since each country is handling the pandemic a bit differently, I single out the US.

We have heard that China may be providing false statistics. It has been reported that Italy does not have enough tests and can only test the very sick. Logically if they test only the very sick, they are going to have a lower Confirmed Cases count and a higher percentage of mortality. Although we know arguably some of the problems with the US Confirmed Case count, I am interested primarily in the US situation. Of course I care about the rest of the world, but the rest of the world does not directly impact my day to day activities. Therefore this analysis will use only the US ConfirmedCases and Fatalities.

Assumptions

The main assumption of this prediction is that the trend will continue on its current trajectory. If social distancing, stay-at-home, and any other policies have any effectivity, this prediction will miss the mark. Of course any new treatments and therapies could also alter the prediction.

A limiting factor is only people with positive Coronavirus tests are counted as Confirmed Cases, but many people that may have a milder case of the virus are not tested. Once a person is sick, it may be a day or more before that person gets tested, and then it takes a few days before results are known. Therefore the Confirmed Cases may be lagging by almost a week.

Clean the Data

Using the piping feature from the Tidyverse, the US data can be extracted and cleaned in one line. The US data is extracted, the dataset format is transformed into columns from rows, the appropriate rows are extracted, and the date and confirmed case columns are set to the correct type format.

covid19 <- subset(covid19, `Country/Region` == 'US') %>% gather() %>% slice(5:n()) %>% 
  mutate(key=as.Date(key, format = "%m/%d/%y"), value=as.integer(value))

With the date in the appropriate format, a graph can be produced of the confirmed cases over time.

covid19 %>% 
  ggplot(aes(x=key, y= value)) + geom_line() + ggtitle("US Confirmed Cases") +
  xlab('Dates') + ylab('Count')

Looking at this graph shows an exponential curve. The first confirmed case is recorded as January 22, and the case count did not increase by much for several weeks. It makes sense that the cases went from 1 to 2 and then to 5 over the initial days. In any case, I want to model the data against the exponential curve in order to forecast forward.

I model the data without the use of R machine learning libraries. First compute lambda for confirmed cases by taking the log of the latest value and subtract the log of the first value. Since the first value is 1 and log(1) = 0, I can use just the log of the latest value. The result is divided by the number of days, which turns out to be the number of observations. This calculation is implemented in R.

lambda = covid19 %>% select(value) %>% tail(n=1) %>% pull() %>% as.integer() %>% log2() 
x2 = covid19 %>% nrow()
Lambda = lambda / x2
rm(lambda, x2)

The computed exponent is 0.249, which indicates the cases double every 4 days or so.

With the exponent, lambda, I can forecast the confirmed cases for the next two weeks. I use the equation 2 ^ (lambda * x), where x is the day number starting from day 1. I use a base of 2 so that it is easy to determine how frequent the confirmed cases double.

dates = c(covid19$key[1:14], covid19$key + 14L) # Extend out two weeks (14 days)
x <- 1:length(dates)
predictions <- data.frame(x=x, key=dates, value=2 ^ (Lambda * x))

Graph the confirmed cases and the predicted cases together.

covid19 %>%  mutate(Type = 'Actual_US') %>% bind_rows(predictions %>% mutate(Type = 'Prediction')) %>% 
  ggplot(aes(x = key, y = value, color = Type)) + 
  geom_line() + ggtitle("US Actual vs Predicted Cases") +   xlab('Dates') + ylab('Count')

As one can see, the prediction follows very closely up to today. I guess we will see if the predictions hold in two weeks.

Fatalities in the US

Let us do the exact same thing with the fatalities data in the US. Since the code is exactly the same, I will do it in one block.

rm(covid19, predictions)
covid19 <- read_csv("time_series_covid19_deaths_global.csv") # Fatality data

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )

## See spec(...) for full column specifications.

covid19 <- subset(covid19, `Country/Region` == 'US') %>% gather() %>% slice(5:n()) %>% 
  mutate(key=as.Date(key, format = "%m/%d/%y"), value=as.integer(value))

# Code the exponential model
lambda = covid19 %>% select(value) %>% tail(n=1) %>% pull() %>% as.integer() %>% log2() 
x2 = covid19 %>% nrow()
mortality = lambda / x2

dates = c(covid19$key[1:14], covid19$key + 14L)
x <- 1:length(dates)
predictions <- data.frame(x=x, key=dates, value=2 ^ (mortality * x))

rm(x, x2, dates, lambda)

The computed exponent for fatalities is 0.174, which indicates the cases double every 5 or 6 days. At least at this time, the fatalitity rate is not as high as the rate of confirmed cases, which is curious. Perhaps there is a lag, although I hope not. In any case, the predictions in this post are just the result of my assumptions and calculations, and I hope they are wrong.

The graph of actual versus forecasted fatalities.

covid19 %>%  mutate(Type = 'Actual_US') %>% bind_rows(predictions %>% mutate(Type = 'Prediction')) %>% 
  ggplot(aes(x = key, y = value, color = Type)) + 
  geom_line() + ggtitle("US Actual vs Predicted Fatalities") +   xlab('Dates') + ylab('Count')

The prediction of fatalities is not as good of a fit as confirmed cases. Hopefully the prediction is incorrect because the mortality rate is reduced soon. But another possibility is the stage of the pandemic. I ran this code a few weeks ago, and the confirmed cases graph look similar. Now that the disease has spread, the model fits the data better. Although I hope my forecast is wrong, it is possible the data will fit better in a few weeks.

Clifton_Baldwin

Data Science Journal of Clif Baldwin

Modeling Covid-19 in the US

3 April ’20