Clifton_Baldwin

Data Science Journal of Clif Baldwin

Analyzing Tweets about the Philadelphia Flyers Part 2

8 April ’18

Twitter Analysis of Philly Flyers

This post is a continuation of my previous post on Twitter analysis of the Philadelphia Flyers during most of March 2018. Again, it will be presented from a R Notebook.

Here is part 2 for the Philly Flyers R Notebook Analysis: <!DOCTYPE html>

Flyers Tweets - The Tweets

Part 2 of My Analysis of the Philadelphia Flyers Twitter Activity

The previous post examined the Twitter users that tweet about the Philly Flyers. In this post, we will look at the tweets themselves.

Through my position at Stockton University, I have heard that the Philadelphia Flyers are looking for ways to increase ticket sales for games. I determined several data research questions related to how word can be spread for the Philadelphia Flyers.

Specifically, I asked five questions that I want to answer (or attempt to answer) with my study. The first three questions were addressed in the previous post. Then I determined what Twitter data I wanted to analyze in an attempt to address my questions. Those data are tweets with the hashtags #Flyers, #FlyersNation, and #LETSGOFLYERS. Of course I could have added more hashtags, such as #PhillyFlyers, but the chosen hashtags resulted in sufficient data for an initial analysis. Thirdly, I determined what I wanted to measure in that data, and I divided my data questions into two groups. The first group is concerned mostly with the Twitter users and was the main subject of the previous post. This post will focus on the text of the tweets. Lastly I will present the results from the analysis in an attempt to learn something about using Twitter to attract fans who may then attend the games.

For this post, I asked the following questions: 4. How do events (wins vs losses, opponents) impact the amount of tweets? 5. What are the common characteristics of highly retweeted tweets?

I believe answering these questions will provide insight into when and what to tweet about the Flyers. If I were looking for ways to promote the Flyers, I might try to use what I learn to improve the use of social media.

As this is a R Notebook, all the code is in R, version 3.4.4 (2018-03-15) to be exact.

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. The dates of the collections were March 11, March 20, and March 26, 2018. See the previous post for the code I used to scrape Twitter.

First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
library(tidytext)  # For Twitter text manipulation
library(lubridate)  # for date manipulation
library(reshape2)  # for mutate()
library(scales) # for date_breaks() in the ggplot - scale_x_date()
library(stringr) # for string manipulations
library("RColorBrewer") # Because I want to print with Flyers colors!

Read the three datasets into memory and combine into one master dataset. Then clean the datasets. For more information on the data preparation, see the previous post.

# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")
# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)
rm(tw11, users11, tw20, users20, rstats_tweets)
# Remove duplicates, due to overlapping dates in the individual datasets.
tw <- unique(tw)
users <- unique(users)
### Clean up the data
# Remove users that do not (or should not) contribute value to this study.
users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]
# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)
users <- users[select,]
rm(select)
# Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some names (a spot check identifies the selected as broadcastsers and reporters)
users <- users[!users$verified,] # Save only nonverified accounts
# Now select only the tweets that belong to these user_ids
tw <- tw[tw$user_id %in% users$user_id,]
# Save only the tweets that are in English (at least for now)
tw <- tw[tw$lang=="en",]

Prepare a working dataset that groups the tweets by the hour. In other words, how many tweets are there each hour over the time period.

twperhr <- tw %>%
  group_by(Group.1=format(tw$created_at, "%Y-%m-%d %H")) %>%
  summarise(x=n()) 

The data extends over the time period from 2018-03-04 20:56:42 to 2018-03-26 13:00:21. During that time, the Flyers played ten games, but we may want to consider the game previous to the time period as well as what was expected at the end of the time period.

  • Flyers lost to the Panthers (1-4) on March 4 3:00pm
  • Flyers lost to the Penguins (2-5) on March 7 8:00pm
  • Flyers lost to the Bruins (2-3) on March 8 7:00pm
  • Flyers beat the Jets (2-1) on March 10 1:00pm
  • Flyers lost to the Golden Knights (2-3) on March 12 7:00pm
  • Flyers lost to the Blue Jackets (3-5) on March 15 7:00pm
  • Flyers beat the Hurricanes (4-2) on March 17 7:00pm
  • Flyers beat the Capitals (6-3) on March 18 5:00pm
  • Flyers lost to the Red Wings (4-5) on March 20 7:30pm
  • Flyers beat the Rangers (4-3) on March 22 7:00pm
  • Flyers lost to the Penguins (4-5) on March 25 12:30pm
  • Nothing scheduled for March 26
  • Next scheduled game against Dallas on March 27 8:30pm

Load this information into data vectors.

schedule = c("2018-03-07 20", "2018-03-08 19", "2018-03-10 13", "2018-03-12 19", "2018-03-15 19", "2018-03-17 19", "2018-03-18 17", "2018-03-20 19", "2018-03-22 19", "2018-03-25 12")
result <- c("Loss", "Loss", "Win", "Loss", "Loss", "Win", "Win", "Loss", "Win", "Loss")
opponent <- c(" \nPenguins", "Bruins", "Jets", "Golden\nKnights", "Blue\nJackets", "Hurricanes", " \nCapitals", "Red\nWings", "Rangers", "Penguins")
d1 <- format(as.Date(min(tw$created_at),format="%Y-%m-%d"), "%m-%d")
d2 <- format(as.Date(max(tw$created_at),format="%Y-%m-%d"), "%m-%d")

Printing the data as a table

date1 <- data.frame(Date = substr(schedule, 1, 10), Opponent = trimws(sub("\n", " ", opponent)), Result = result)
date1

Using the information gathered, we can create a graph of the time period.

4. How do events (wins vs losses, opponents) impact the amount of tweets?

# Determine the vector locations for game times
matches <- grep(paste(schedule,collapse="|"), twperhr$Group.1)
# Determine the average time from a game official start time that the tweets peak
avPeak <- paste("Average time from game start when tweets peak,", 
  mean(sapply(matches, function(i) 
    which(twperhr$x[(i-12):(i+12)]==max(twperhr$x[(i-12):(i+12)]))-13 )), 
  "hours", sep=" ")
  
# Create a graph of tweets over time and indicate when games occured
ggplot(data=twperhr, aes(x=seq_len(nrow(twperhr)), y=x)) + geom_line() +
    scale_x_continuous(breaks = grep(" 00", twperhr$Group.1), 
      labels = substr(twperhr$Group.1[grepl(" 00", twperhr$Group.1)], 6, 10) ) +
    theme(axis.text.x = element_text(color="darkorange", angle=45), 
          panel.background = element_rect(fill = "white", colour = "orange"),
          panel.grid.minor = element_blank()) +
    annotate("text", x = matches, y = twperhr[matches+6,]$x, 
             label = paste(opponent, "\n(", substr(result,1,1),")",sep=""), 
             colour = gsub("Loss", "red", gsub("Win", "darkblue", result))) +
    labs(title="Tweets Mentioning the Flyers", subtitle=paste(d1,"to",d2,sep=" "), 
         caption= avPeak, x = "Date", y = "Tweets")

# Clean up variables that are no longer needed
rm(matches, avPeak) 

Over the specified time, it does not appear that a win or a loss regularly impacts the number of tweets. The fact that a game is played has a huge impact, but not the outcome. And since the number of tweets peak after the conclusion of the game, the peaks are not the result of anticipation.

While the previous graph was by the hour, perhaps there is something to be gained by looking at the number of tweets per day.

ggplot(data = tw, aes(x = day(created_at))) +
  geom_bar(aes(fill = ..count..)) +
  theme(legend.position = "none") +
#  xlab("March") + ylab("Number of tweets") + 
  labs(title="Tweets Mentioning the Flyers", subtitle=paste(d1,"to",d2,sep=" "), 
         x = "March", y = "Number of Tweets") +
  scale_fill_gradient(low = "orange", high = "orangered2")

There may be something else going on to impact the number of tweets. Perhaps looking at the data by day of the week as well as when games are played would have some pattern?

5. What are the common characteristics of retweeted tweets?

Let us look at the tweets themselves. To do so, we need to clean them up. By that I mean remove references to screen names, hashtags, spaces, numbers, punctuations, and urls.

clean_tweet <- gsub('\\n', '', tw$text) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ")

First we will look at the words used in tweets, and then we will consider the tweets as a whole.

tweets <- data_frame(text=clean_tweet) %>% unnest_tokens(word, text)
data(stop_words)
tweets <- tweets %>% anti_join(stop_words)
tweets %>% count(word, sort = TRUE) 

Graph the words that occur at least 200 times.

tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 200) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("Popular Words") + ylab("Number Occurences") +  
  labs(title="Most Popular Words of Tweets") +
  coord_flip()

Using a sentiment score, analyze each tweet as a whole. To accomplish this, i am determining the sentiment of the words in the tweet, with positive sentiments getting a positive score and negative sentiments getting a negative score. Then I am summing the score for each tweet. Tweets of the form “that is not good” will get a sentiment score of 0, since “not” is negative and “good” is positive, but at least that should not sway the analysis. If “not good” got a score of positive due to the “good”, that would be bad. The sentiment scores will be determined by hour so we can see how the games impact the sentiments of the tweets.

#Determine the sentiments of the tweets
sentiment <- tibble(index = 1:nrow(tw),
                    created = tw$created_at,
                    text = clean_tweet) %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(index, created) %>%
  count(index, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(Time = round_date(created, unit="hours")) %>%
  mutate(score = positive - negative) %>%
  summarise(score = sum(score))
sentiment$sentiment <- factor(ifelse(sentiment$score > 0, "Negative", "Positive"), labels=c("Positive", "Negative"))

Now we can graph the tweets.

date2 <- format(sentiment$Time, "%Y-%m-%d %H")
for (i in seq_len(10)) { date2 <- sub(schedule[i], opponent[i], date2, ignore.case = TRUE) }
date2[grep("2018-03-", date2, ignore.case = TRUE)] <- ""
ggplot(sentiment, aes(x=as.Date(Time), y=score)) +
  geom_line(size = 1.5, alpha = 0.7, aes(colour = sentiment)) +
  labs(title="Sentiments of Tweets", subtitle=paste(d1,"to",d2,sep=" "), 
       x = "March", y = "Sentiment Score") +
  geom_text(y=rep(c(50, 75, 100), len = nrow(sentiment)), label=date2) +
  theme(legend.position="none")

rm(date2,i)  

The graph may be a little misleading since it appears the Bruins game causes a large spike in tweets. When I manually inspect the data, it appears that the large spike occurs after the conclusion of the March 7 Penguins game and just prior to the start of the March 8 Bruins game.

Moving on, let us now look at the characteristics of retweets. Since retweets, by definition, have multiple occurences, we want just one tweet to represent each retweet set.

clean_tweet <- tw[tw$status_id %in% unique(tw$retweet_status_id),]$text

Clean the text again.

# For removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls.
clean_tweet <- gsub('\\n', '', clean_tweet) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ")

Determine the sentiment of each tweet and graph. The top line graphs the positive sentiment, with higher numbers indicating a higher positive sentiment. The lower line graphs the negative sentiment tweets, with lower numbers indicating an increase in negative sentiments.

sentiment <- tibble(index = 1:length(clean_tweet),
  created = tw[tw$status_id %in% unique(tw$retweet_status_id),]$created_at,
              text = clean_tweet) %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(index, created) %>%
  count(index, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(Time = round_date(created, unit="hours")) %>%
  mutate(score = positive - negative) %>%
  summarise(score = sum(score))
  
sentiment$sentiment <- factor(ifelse(sentiment$score > 0, "Negative", "Positive"), labels=c("Positive", "Negative"))
ggplot(sentiment, aes(x=as.Date(Time), y=score)) +
  geom_line(size = 1.5, alpha = 0.7, aes(colour = sentiment)) +
  labs(title="Sentiments of Tweets", subtitle=paste(d1,"to",d2,sep=" "), 
       x = "March", y = "Sentiment Score") +
  theme(legend.position="none")

date1

Looking at the graph, it appears the loss to the Golden Knights on March 12 was followed by a large spike in positive sentiment with a reduced trend of negative sentiment. Did the loss of the one game cause a high level of encouragement for the upcoming Blue Jackets game? However, after losing to the Blue Jackets, there is a jump in negative sentiment.

I admit that these results do little to address the 5th question, and a more thorough analysis is needed. This further analysis would require more of the same, just in more detail. Maybe looking at what other events occured during the time? Definitley expanding the list of words since the top results are somewhat expected (e.g. game, flyers, win, goal). Another interesting result might be if the sentiment causes more or less retweets? That would be useful information if we wanted to spread the word better.

There is much more I could analyze from this data. For one thing, I should expand on my conclusions from analyzing the data. Since this study was “for fun,” I may or may not return to write a more thorough conclusion, but the reader is free to look at the analysis and form their own conclusions. I guess we will see if I continue with this study or get distracted by another “for fun” project in the next post.

---
title: "Flyers Tweets - The Tweets"
author: "Dr. Clifton Baldwin"
output: html_notebook
---

# Part 2 of My Analysis of the Philadelphia Flyers Twitter Activity

The previous post examined the Twitter users that tweet about the Philly Flyers. In this post, we will look at the tweets themselves.

Through my position at Stockton University, I have heard that the Philadelphia Flyers are looking for ways to increase ticket sales for games. I determined several data research questions related to how word can be spread for the Philadelphia Flyers.

Specifically, I asked five questions that I want to answer (or attempt to answer) with my study. The first three questions were addressed in the previous post. Then I determined what Twitter data I wanted to analyze in an attempt to address my questions. Those data are tweets with the hashtags #Flyers, #FlyersNation, and #LETSGOFLYERS. Of course I could have added more hashtags, such as #PhillyFlyers, but the chosen hashtags resulted in sufficient data for an initial analysis. Thirdly, I determined what I wanted to measure in that data, and I divided my data questions into two groups. The first group is concerned mostly with the Twitter users and was the main subject of the previous post. This post will focus on the text of the tweets. Lastly I will present the results from the analysis in an attempt to learn something about using Twitter to attract fans who may then attend the games.

For this post, I asked the following questions:
4. How do events (wins vs losses, opponents) impact the amount of tweets?
5. What are the common characteristics of highly retweeted tweets?

I believe answering these questions will provide insight into when and what to tweet about the Flyers. If I were looking for ways to promote the Flyers, I might try to use what I learn to improve the use of social media.

As this is a R Notebook, all the code is in R, version 3.4.4 (2018-03-15) to be exact. 

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. The dates of the collections were March 11, March 20, and March 26, 2018. See the previous post for the code I used to scrape Twitter.

First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

```{r message=FALSE}
library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
library(tidytext)  # For Twitter text manipulation
library(lubridate)  # for date manipulation
library(reshape2)  # for mutate()
library(scales) # for date_breaks() in the ggplot - scale_x_date()
library(stringr) # for string manipulations
library("RColorBrewer") # Because I want to print with Flyers colors!
```


Read the three datasets into memory and combine into one master dataset. Then clean the datasets. For more information on the data preparation, see the previous post.
```{r}
# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")

# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)

rm(tw11, users11, tw20, users20, rstats_tweets)

# Remove duplicates, due to overlapping dates in the individual datasets.
tw <- unique(tw)
users <- unique(users)
### Clean up the data
# Remove users that do not (or should not) contribute value to this study.
users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]

# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)

users <- users[select,]
rm(select)

# Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some names (a spot check identifies the selected as broadcastsers and reporters)
users <- users[!users$verified,] # Save only nonverified accounts

# Now select only the tweets that belong to these user_ids
tw <- tw[tw$user_id %in% users$user_id,]

# Save only the tweets that are in English (at least for now)
tw <- tw[tw$lang=="en",]

```

Prepare a working dataset that groups the tweets by the hour. In other words, how many tweets are there each hour over the time period.
```{r}
twperhr <- tw %>%
  group_by(Group.1=format(tw$created_at, "%Y-%m-%d %H")) %>%
  summarise(x=n()) 
```

The data extends over the time period from `r min(tw$created_at)` to `r max(tw$created_at)`. During that time, the Flyers played ten games, but we may want to consider the game previous to the time period as well as what was expected at the end of the time period.

- Flyers lost to the Panthers (1-4) on March 4 3:00pm
- Flyers lost to the Penguins (2-5) on March 7 8:00pm
- Flyers lost to the Bruins (2-3) on March 8 7:00pm
- Flyers beat the Jets (2-1) on March 10 1:00pm
- Flyers lost to the Golden Knights (2-3) on March 12 7:00pm
- Flyers lost to the Blue Jackets (3-5) on March 15 7:00pm
- Flyers beat the Hurricanes (4-2) on March 17 7:00pm
- Flyers beat the Capitals (6-3) on March 18 5:00pm
- Flyers lost to the Red Wings (4-5) on March 20 7:30pm
- Flyers beat the Rangers (4-3) on March 22 7:00pm
- Flyers lost to the Penguins (4-5) on March 25 12:30pm
- Nothing scheduled for March 26 
- Next scheduled game against Dallas on March 27 8:30pm

Load this information into data vectors.
```{r}
schedule = c("2018-03-07 20", "2018-03-08 19", "2018-03-10 13", "2018-03-12 19", "2018-03-15 19", "2018-03-17 19", "2018-03-18 17", "2018-03-20 19", "2018-03-22 19", "2018-03-25 12")

result <- c("Loss", "Loss", "Win", "Loss", "Loss", "Win", "Win", "Loss", "Win", "Loss")

opponent <- c(" \nPenguins", "Bruins", "Jets", "Golden\nKnights", "Blue\nJackets", "Hurricanes", " \nCapitals", "Red\nWings", "Rangers", "Penguins")

d1 <- format(as.Date(min(tw$created_at),format="%Y-%m-%d"), "%m-%d")
d2 <- format(as.Date(max(tw$created_at),format="%Y-%m-%d"), "%m-%d")
```

Printing the data as a table
```{r}
date1 <- data.frame(Date = substr(schedule, 1, 10), Opponent = trimws(sub("\n", " ", opponent)), Result = result)
date1
```

Using the information gathered, we can create a graph of the time period.

## 4. How do events (wins vs losses, opponents) impact the amount of tweets?

```{r}
# Determine the vector locations for game times
matches <- grep(paste(schedule,collapse="|"), twperhr$Group.1)
# Determine the average time from a game official start time that the tweets peak
avPeak <- paste("Average time from game start when tweets peak,", 
  mean(sapply(matches, function(i) 
    which(twperhr$x[(i-12):(i+12)]==max(twperhr$x[(i-12):(i+12)]))-13 )), 
  "hours", sep=" ")
  
# Create a graph of tweets over time and indicate when games occured
ggplot(data=twperhr, aes(x=seq_len(nrow(twperhr)), y=x)) + geom_line() +
    scale_x_continuous(breaks = grep(" 00", twperhr$Group.1), 
      labels = substr(twperhr$Group.1[grepl(" 00", twperhr$Group.1)], 6, 10) ) +
    theme(axis.text.x = element_text(color="darkorange", angle=45), 
          panel.background = element_rect(fill = "white", colour = "orange"),
          panel.grid.minor = element_blank()) +
    annotate("text", x = matches, y = twperhr[matches+6,]$x, 
             label = paste(opponent, "\n(", substr(result,1,1),")",sep=""), 
             colour = gsub("Loss", "red", gsub("Win", "darkblue", result))) +
    labs(title="Tweets Mentioning the Flyers", subtitle=paste(d1,"to",d2,sep=" "), 
         caption= avPeak, x = "Date", y = "Tweets")

# Clean up variables that are no longer needed
rm(matches, avPeak) 
```

Over the specified time, it does not appear that a win or a loss regularly impacts the number of tweets. The fact that a game is played has a huge impact, but not the outcome. And since the number of tweets peak after the conclusion of the game, the peaks are not the result of anticipation. 

While the previous graph was by the hour, perhaps there is something to be gained by looking at the number of tweets per day.

```{r}
ggplot(data = tw, aes(x = day(created_at))) +
  geom_bar(aes(fill = ..count..)) +
  theme(legend.position = "none") +
#  xlab("March") + ylab("Number of tweets") + 
  labs(title="Tweets Mentioning the Flyers", subtitle=paste(d1,"to",d2,sep=" "), 
         x = "March", y = "Number of Tweets") +
  scale_fill_gradient(low = "orange", high = "orangered2")
```

There may be something else going on to impact the number of tweets. Perhaps looking at the data by day of the week as well as when games are played would have some pattern? 

## 5. What are the common characteristics of retweeted tweets?

Let us look at the tweets themselves. To do so, we need to clean them up. By that I mean remove references to screen names, hashtags, spaces, numbers, punctuations, and urls.

```{r}
clean_tweet <- gsub('\\n', '', tw$text) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ")

```

First we will look at the words used in tweets, and then we will consider the tweets as a whole.
```{r, message=FALSE}
tweets <- data_frame(text=clean_tweet) %>% unnest_tokens(word, text)

data(stop_words)
tweets <- tweets %>% anti_join(stop_words)

tweets %>% count(word, sort = TRUE) 
```

Graph the words that occur at least 200 times.
```{r}
tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 200) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("Popular Words") + ylab("Number Occurences") +  
  labs(title="Most Popular Words of Tweets") +
  coord_flip()
```

Using a sentiment score, analyze each tweet as a whole. To accomplish this, i am determining the sentiment of the words in the tweet, with positive sentiments getting a positive score and negative sentiments getting a negative score. Then I am summing the score for each tweet. Tweets of the form "that is not good" will get a sentiment score of 0, since "not" is negative and "good" is positive, but at least that should not sway the analysis. If "not good" got a score of positive due to the "good", that would be bad. The sentiment scores will be determined by hour so we can see how the games impact the sentiments of the tweets.

```{r, message=FALSE}
#Determine the sentiments of the tweets
sentiment <- tibble(index = 1:nrow(tw),
                    created = tw$created_at,
                    text = clean_tweet) %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(index, created) %>%
  count(index, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(Time = round_date(created, unit="hours")) %>%
  mutate(score = positive - negative) %>%
  summarise(score = sum(score))

sentiment$sentiment <- factor(ifelse(sentiment$score > 0, "Negative", "Positive"), labels=c("Positive", "Negative"))

```

Now we can graph the tweets.
```{r}
date2 <- format(sentiment$Time, "%Y-%m-%d %H")
for (i in seq_len(10)) { date2 <- sub(schedule[i], opponent[i], date2, ignore.case = TRUE) }
date2[grep("2018-03-", date2, ignore.case = TRUE)] <- ""

ggplot(sentiment, aes(x=as.Date(Time), y=score)) +
  geom_line(size = 1.5, alpha = 0.7, aes(colour = sentiment)) +
  labs(title="Sentiments of Tweets", subtitle=paste(d1,"to",d2,sep=" "), 
       x = "March", y = "Sentiment Score") +
  geom_text(y=rep(c(50, 75, 100), len = nrow(sentiment)), label=date2) +
  theme(legend.position="none")

rm(date2,i)  

```

The graph may be a little misleading since it appears the Bruins game causes a large spike in tweets. When I manually inspect the data, it appears that the large spike occurs after the conclusion of the March 7 Penguins game and just prior to the start of the March 8 Bruins game. 

Moving on, let us now look at the characteristics of retweets. Since retweets, by definition, have multiple occurences, we want just one tweet to represent each retweet set.

```{r}
clean_tweet <- tw[tw$status_id %in% unique(tw$retweet_status_id),]$text
```

Clean the text again.
```{r}
# For removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls.
clean_tweet <- gsub('\\n', '', clean_tweet) %>% 
  str_replace_all("http\\S+\\s*","") %>%
  str_replace("RT @[a-z,A-Z,0-9]*: ","") %>%
  str_replace_all("#[a-z,A-Z]*","") %>%
  str_replace_all("@[a-z,A-Z]*","") %>%
  str_replace_all("[0-9]","") %>%
  str_replace_all(" "," ")

```

Determine the sentiment of each tweet and graph. The top line graphs the positive sentiment, with higher numbers indicating a higher positive sentiment. The lower line graphs the negative sentiment tweets, with lower numbers indicating an increase in negative sentiments. 
```{r, message=FALSE}
sentiment <- tibble(index = 1:length(clean_tweet),
  created = tw[tw$status_id %in% unique(tw$retweet_status_id),]$created_at,
              text = clean_tweet) %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(index, created) %>%
  count(index, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  group_by(Time = round_date(created, unit="hours")) %>%
  mutate(score = positive - negative) %>%
  summarise(score = sum(score))
  
sentiment$sentiment <- factor(ifelse(sentiment$score > 0, "Negative", "Positive"), labels=c("Positive", "Negative"))

ggplot(sentiment, aes(x=as.Date(Time), y=score)) +
  geom_line(size = 1.5, alpha = 0.7, aes(colour = sentiment)) +
  labs(title="Sentiments of Tweets", subtitle=paste(d1,"to",d2,sep=" "), 
       x = "March", y = "Sentiment Score") +
  theme(legend.position="none")

```

```{r}
date1
```

Looking at the graph, it appears the loss to the Golden Knights on March 12 was followed by a large spike in positive sentiment with a reduced trend of negative sentiment. Did the loss of the one game cause a high level of encouragement for the upcoming Blue Jackets game? However, after losing to the Blue Jackets, there is a jump in negative sentiment.

I admit that these results do little to address the 5th question, and a more thorough analysis is needed. This further analysis would require more of the same, just in more detail. Maybe looking at what other events occured during the time? Definitley expanding the list of words since the top results are somewhat expected (e.g. game, flyers, win, goal). Another interesting result might be if the sentiment causes more or less retweets? That would be useful information if we wanted to spread the word better. 

There is much more I could analyze from this data. For one thing, I should expand on my conclusions from analyzing the data. Since this study was "for fun," I may or may not return to write a more thorough conclusion, but the reader is free to look at the analysis and form their own conclusions. I guess we will see if I continue with this study or get distracted by another "for fun" project in the next post.

Analyzing Tweets about the Philadelphia Flyers

7 April ’18

Twitter Analysis of Philly Flyers

I decided to analyze the tweets that mention the Philadelphia Flyers during most of March 2018. At the moment, I have two R Notebooks. Well, actually one completed, which I am posting here, and one R Notebook I am working on, which I will post later.

Here is part 1 for the Philly Flyers R Notebook Analysis: <!DOCTYPE html>

Flyers Tweets - Tweeters

Through my position at Stockton University, I have heard that the Philadelphia Flyers are looking for ways to increase ticket sales for games. I determined several data research questions related to how word can be spread for the Philadelphia Flyers. As a start, I asked the following questions: 1. Who tweeted the most texts over the data collection time period? 2. Of those who tweeted, who had the most followers? 3. Of those who tweeted, Who had the most retweeted texts?

I believe answering these questions will provide the Twitter users who currently promote the Flyers and which ones have the potential to influence other Twitter users. If I were looking for ways to promote the Flyers, I might try to persuade these users further to use their social media contacts.

In a follow on study, I plan to ask questions related to the textual content of the texted tweets.

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. I used these hashtags based on anecdotal evidence only. Future studies may want to consider additional hashtags if there is any doubt that these hashtags are not truly representative. Twitter limits the number of tweets that can be obtained at any one time, which in part is why I scraped the data over three time periods. Although there are other workarounds, these three hashtags returned just shy of the limit of available tweets (15,000 tweets limit). Nonetheless it appears the data is sufficient for this initial study.

The dates of the collections were March 11, March 20, and March 26, 2018. The following is the code I used each time.

# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)

# Enter your Twitter credentials
appname <- "Flyers_Data"  # name I assigned my app in Twitter
key <- "XXXXXXXXX"  # I am not sharing my actual credentials
secret <- "XXXXXXXXXXXXXXXXXXX" # Anyone wishing to duplicate this code can sign up for their own Twitter app - it is free

# create token named "twitter_token"
twitter_token <- create_token(
  app = appname,
  consumer_key = key,
  consumer_secret = secret)
saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# Scrape tweets
rstats_tweets <- search_tweets2(q = "#Flyers OR #FlyersNation OR #LETSGOFLYERS", n = 15000, parse = TRUE, type="mixed")

# Save tweets to a RData file
save(rstats_tweets, file="rtweets2018MMDD.RData")

After all the data was scraped and saved to R dataset files, we can start working with the data. First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
── Attaching packages ──────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ─────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidytext)  # For Twitter text manipulation
library("RColorBrewer") # Because I want to print with Flyers colors!

Read the three datasets back into memory and combine into one master dataset.

# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")
# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)
# Delete the indivual datasets since we now have the master file
rm(tw11, users11, tw20, users20, rstats_tweets)

Due to the times when the data was scraped, there are some overlapping time periods in the data. For example, the scraping run of March 20 collected some tweets that were scraped on March 11. We do not want these duplicates.

users <- unique(users)

As I worked with the data, I found some users that I rather exclude from my analysis.

# user_id = "19618527" is the Philadelphia Flyers. We know the Flyers tweet for the team.
# user_id = "471268712" is PHLFlyersNation. We know the PHLFlyers tweet for the team.
# user_id = "154699499" is sportstalkphl We know the sportstalkphl tweet for the team.
# user_id = "426029765" is XFINITYLive We know XFINITYLive tweet for the team.
# user_id = "19276719" and "321035743" are not real people and directs to naughty websites
# user_id = "493658381" is FlyersNation. We know the Flyers tweet for the team.
# user_id = "938072969552826368" is the Philly Sports Network Flyers.
users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]

There are many people who tweet about the Flyers that do not live locally. I do not doubt they are fans, but they are probably not attending many games (due to their geographic locations). Perhaps they should be included in further analysis, but for now I am only keeping tweets by users who identify their location as Pennsylvania, New Jersey, or Delaware. This restriction may lose a few local people who do not identify their location as local, but it is unavoidable.

# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)
users <- users[select,]
rm(select)

Twitter allows certain users to be verified. Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some celebrity names (a spot check identifies the selected as broadcastsers and reporters). I am sure they are quite helpful in boosting Flyers ticket sales, but then there is no reason to keep them in the analysis. We know they already are working on boosting sales. For the purposes of this study, the verified accounts will not be considered.

users <- users[!users$verified,] # Save only nonverified accounts

I am sure the remaining users include a few bots that I might have missed, but the resulting group is a start for this study.

Twitter data is formatted so that it can be saved in two datasets. The users data and the tweets data. The users data includes the following features: user_id, name, screen_name, location, description, url, protected, followers_count, friends_count, listed_count, statuses_count, favourites_count, account_created_at, verified, profile_url, profile_expanded_url, account_lang, profile_banner_url, profile_background_url, profile_image_url. We need to select the tweets from only the users that remain in the users dataset.

# There are a few tweets repeated due to overlapping dates. Save only one instance of each.
tw <- unique(tw)
# Now select only the tweets that belong to the remaining user_ids
tw <- tw[tw$user_id %in% users$user_id,]
# Save only the tweets that are in English - at this time
tw <- tw[tw$lang=="en",]

The tweets dataset has the following features: status_id, created_at, user_id, screen_name, text, source, reply_to_status_id, reply_to_user_id, reply_to_screen_name, is_quote, is_retweet, favorite_count, retweet_count, hashtags, symbols, urls_url, urls_t.co, urls_expanded_url, media_url, media_t.co, media_expanded_url, media_type, ext_media_url, ext_media_t.co, ext_media_expanded_url, ext_media_type, mentions_user_id, mentions_screen_name, lang, quoted_status_id, quoted_text, retweet_status_id, retweet_text, place_url, place_name, place_full_name, place_type, country, country_code, geo_coords, coords_coords, bbox_coords, query.

The collected tweets span the time from 2018-03-04 20:56:42 to 2018-03-26 13:00:21. The resulting count of users who tweeted is 1957 and 7359 tweets, although many are retweets.

Now we can address the first data research question. ## 1. What Twitter user had the most tweets over the selected time period?

tw %>%
  count(user_id, sort = TRUE) %>%
  filter(n > 60) %>%
  inner_join(distinct(users[,c(1:3)], user_id, .keep_all = TRUE)) %>%
#  ggplot(aes(x = reorder(name, n), y = n)) + # If I wanted to show actual names
  ggplot(aes(x = reorder(user_id, n), y = n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Number of Tweets") +
  labs(title="Users with Most Tweets", subtitle="In Shades of Orange!") +
  coord_flip()

If I was doing this to get the actual names of the Twitter users, I could have listed the user names, as they are available in the dataset, but since this exercise is only academic, I print the user_id numbers, to maintain some anonymity.

Next let us look at the number of followers the users have. We could choose to see all regardless of followers. Viewing all users with their followers overloads the graph, and I do not want just a list of names. A spot check shows a large number of users have more than 20,000 followers. Specifically, 20 have more than 20,000 followers. If it turns out that these users are not real people, we could remove them from the users dataset or look at users with followers between 100 and 500, or whatever span we think appropriate.

2. Of those who tweeted, who had the most followers?

unique(users[,c(1,2,3,8)]) %>%
  filter(followers_count > 20000) %>%
  ggplot(aes(x = reorder(user_id, followers_count), y = followers_count)) +
  geom_col(aes(fill = followers_count)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Number of Followers") +
  labs(title="Users who Tweeted #Flyers with Most Followers", subtitle="More than 20,000 followers") +
  coord_flip()

Again, I could have printed the graph with user names instead of user ids, but I am protecting identities for this initial study.

Finally, let us find the originators of tweets that were highly retweeted. It is possible that others retweeted the retweeted texts, but the following chart considers only the original authors of the tweets.

3. Of those who tweeted, Who had the most retweeted text?

tw[!tw$is_retweet,c(3,13)]  %>%
  group_by(user_id) %>%
  summarise(n=max(retweet_count)) %>%
  filter(n > 10) %>%
  inner_join(distinct(users[,c(1,2,3,8,13)], user_id, .keep_all = TRUE)) %>%
  ggplot(aes(x = reorder(user_id, n), y = n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Retweet Count") +
  labs(title="Users who had the Most Retweets") +
  coord_flip()

This post has examined the Twitter users that tweet about the Philly Flyers. In the next post, we will look at the tweets themselves. I will post again soon.

---
title: "Flyers Tweets - Tweeters"
author: "Dr. Clifton Baldwin"
output: html_notebook
---

Through my position at Stockton University, I have heard that the Philadelphia Flyers are looking for ways to increase ticket sales for games. I determined several data research questions related to how word can be spread for the Philadelphia Flyers. As a start, I asked the following questions:
1. Who tweeted the most texts over the data collection time period?
2. Of those who tweeted, who had the most followers?
3. Of those who tweeted, Who had the most retweeted texts?

I believe answering these questions will provide the Twitter users who currently promote the Flyers and which ones have the potential to influence other Twitter users. If I were looking for ways to promote the Flyers, I might try to persuade these users further to use their social media contacts.

In a follow on study, I plan to ask questions related to the textual content of the texted tweets.

In March 2018, I scraped Twitter several times in order to gather all tweets that had the hashtags #Flyers, #FlyersNation, or #LETSGOFLYERS. I used these hashtags based on anecdotal evidence only. Future studies may want to consider additional hashtags if there is any doubt that these hashtags are not truly representative. Twitter limits the number of tweets that can be obtained at any one time, which in part is why I scraped the data over three time periods. Although there are other workarounds, these three hashtags returned just shy of the limit of available tweets (15,000 tweets limit). Nonetheless it appears the data is sufficient for this initial study.

The dates of the collections were March 11, March 20, and March 26, 2018. The following is the code I used each time.

```{r, eval=FALSE, echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)

# Enter your Twitter credentials
appname <- "Flyers_Data"  # name I assigned my app in Twitter
key <- "XXXXXXXXX"  # I am not sharing my actual credentials
secret <- "XXXXXXXXXXXXXXXXXXX" # Anyone wishing to duplicate this code can sign up for their own Twitter app - it is free

# create token named "twitter_token"
twitter_token <- create_token(
  app = appname,
  consumer_key = key,
  consumer_secret = secret)
saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# Scrape tweets
rstats_tweets <- search_tweets2(q = "#Flyers OR #FlyersNation OR #LETSGOFLYERS", n = 15000, parse = TRUE, type="mixed")

# Save tweets to a RData file
save(rstats_tweets, file="rtweets2018MMDD.RData")

```

After all the data was scraped and saved to R dataset files, we can start working with the data. First, several R libraries are needed. Note, I tried to use only high quality libraries, such as those developed by the RStudio group.

```{r}
library(rtweet) # for users_data()
library(tidyverse) # Instead of just ggplot2 and dplyr
library(tidytext)  # For Twitter text manipulation
library("RColorBrewer") # Because I want to print with Flyers colors!
```


Read the three datasets back into memory and combine into one master dataset.
```{r}
# Load the RData files that were saved after scraping Twitter
load(file="rtweets20180311.RData")
tw11 <- rstats_tweets
users11 <- users_data(rstats_tweets)
load(file="rtweets20180320.RData")
tw20 <- rstats_tweets
users20 <- users_data(rstats_tweets)
load(file="rtweets20180326.RData")

# Combine the two datasets
tw <- bind_rows(tw11, tw20, rstats_tweets)
users <- users_data(rstats_tweets)
users <- bind_rows(users11, users20, users)

# Delete the indivual datasets since we now have the master file
rm(tw11, users11, tw20, users20, rstats_tweets)

```

Due to the times when the data was scraped, there are some overlapping time periods in the data. For example, the scraping run of March 20 collected some tweets that were scraped on March 11. We do not want these duplicates.

```{r}
users <- unique(users)
```

As I worked with the data, I found some users that I rather exclude from my analysis. 
```{r}
# user_id = "19618527" is the Philadelphia Flyers. We know the Flyers tweet for the team.
# user_id = "471268712" is PHLFlyersNation. We know the PHLFlyers tweet for the team.
# user_id = "154699499" is sportstalkphl We know the sportstalkphl tweet for the team.
# user_id = "426029765" is XFINITYLive We know XFINITYLive tweet for the team.
# user_id = "19276719" and "321035743" are not real people and directs to naughty websites
# user_id = "493658381" is FlyersNation. We know the Flyers tweet for the team.
# user_id = "938072969552826368" is the Philly Sports Network Flyers.

users <- users[!(users$user_id %in% c("19618527", "471268712", "154699499", "426029765", "19276719", "493658381", "938072969552826368", "321035743")),]

```

There are many people who tweet about the Flyers that do not live locally. I do not doubt they are fans, but they are probably not attending many games (due to their geographic locations). Perhaps they should be included in further analysis, but for now I am only keeping tweets by users who identify their location as Pennsylvania, New Jersey, or Delaware. This restriction may lose a few local people who do not identify their location as local, but it is unavoidable.

```{r}
# Only analyze "local" tweeters - location identified as PA, NJ, or DE
select <- grepl("Phil", users$location, ignore.case = TRUE) | grepl("PA", users$location, ignore.case = FALSE) | grepl("NJ", users$location, ignore.case = FALSE) | grepl("DE", users$location, ignore.case = FALSE)

users <- users[select,]
rm(select)

```

Twitter allows certain users to be verified. Verified accounts include professional radio, TV, and news stations (e.g. NBC), and some celebrity names (a spot check identifies the selected as broadcastsers and reporters). I am sure they are quite helpful in boosting Flyers ticket sales, but then there is no reason to keep them in the analysis. We know they already are working on boosting sales. For the purposes of this study, the verified accounts will not be considered.

```{r}
users <- users[!users$verified,] # Save only nonverified accounts
```

I am sure the remaining users include a few bots that I might have missed, but the resulting group is a start for this study. 

Twitter data is formatted so that it can be saved in two datasets. The users data and the tweets data. The users data includes the following features: `r names(users)`. We need to select the tweets from only the users that remain in the users dataset.

```{r}
# There are a few tweets repeated due to overlapping dates. Save only one instance of each.
tw <- unique(tw)

# Now select only the tweets that belong to the remaining user_ids
tw <- tw[tw$user_id %in% users$user_id,]

# Save only the tweets that are in English - at this time
tw <- tw[tw$lang=="en",]

```

The tweets dataset has the following features: `r names(tw)`.

The collected tweets span the time from `r min(tw$created_at)` to `r max(tw$created_at)`. The resulting count of users who tweeted is `r length(unique(tw$user_id))` and `r nrow(tw)` tweets, although many are retweets.

Now we can address the first data research question.
## 1. What Twitter user had the most tweets over the selected time period?

```{r, message=FALSE}
tw %>%
  count(user_id, sort = TRUE) %>%
  filter(n > 60) %>%
  inner_join(distinct(users[,c(1:3)], user_id, .keep_all = TRUE)) %>%
#  ggplot(aes(x = reorder(name, n), y = n)) + # If I wanted to show actual names
  ggplot(aes(x = reorder(user_id, n), y = n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Number of Tweets") +
  labs(title="Users with Most Tweets", subtitle="In Shades of Orange!") +
  coord_flip()
```

If I was doing this to get the actual names of the Twitter users, I could have listed the user names, as they are available in the dataset, but since this exercise is only academic, I print the user_id numbers, to maintain some anonymity.

Next let us look at the number of followers the users have. We could choose to see all regardless of followers. Viewing all users with their followers overloads the graph, and I do not want just a list of names. A spot check shows a large number of users have more than 20,000 followers. Specifically, `r nrow(users[users$followers_count > 20000,])` have more than 20,000 followers. If it turns out that these users are not real people, we could remove them from the users dataset or look at users with followers between 100 and 500, or whatever span we think appropriate.

## 2. Of those who tweeted, who had the most followers?

```{r, message=FALSE}
unique(users[,c(1,2,3,8)]) %>%
  filter(followers_count > 20000) %>%
  ggplot(aes(x = reorder(user_id, followers_count), y = followers_count)) +
  geom_col(aes(fill = followers_count)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Number of Followers") +
  labs(title="Users who Tweeted #Flyers with Most Followers", subtitle="More than 20,000 followers") +
  coord_flip()
```

Again, I could have printed the graph with user names instead of user ids, but I am protecting identities for this initial study.

Finally, let us find the originators of tweets that were highly retweeted. It is possible that others retweeted the retweeted texts, but the following chart considers only the original authors of the tweets.

## 3. Of those who tweeted, Who had the most retweeted text?

```{r, message=FALSE}
tw[!tw$is_retweet,c(3,13)]  %>%
  group_by(user_id) %>%
  summarise(n=max(retweet_count)) %>%
  filter(n > 10) %>%
  inner_join(distinct(users[,c(1,2,3,8,13)], user_id, .keep_all = TRUE)) %>%
  ggplot(aes(x = reorder(user_id, n), y = n)) +
  geom_col(aes(fill = n)) +
  scale_fill_distiller(palette="Oranges") +
  theme(legend.position = "none") +
  xlab("User_Ids") + ylab("Retweet Count") +
  labs(title="Users who had the Most Retweets") +
  coord_flip()
```

This post has examined the Twitter users that tweet about the Philly Flyers. In the next post, we will look at the tweets themselves. I will post again soon.

Baseball's Pythagorean Expectation for Curling

11 March ’18

Baseball’s Pythagorean Expectation

I have been reading Mathletics by Wayne L. Winston and watching the 2018 Winter Olympics. I thought it might be fun to combine the Pythagorean Expectation, developed for baseball but applied to various other sports also, to the game of curling. I wrote the code and ran the program prior to the Women’s medal matches. Spoiler alert - Sweden won the Gold medal (with Korean getting the Silver) and Japan won the Bronze medal.

Here is my R Notebook: <!DOCTYPE html>

R Notebook on Women’s Curling 2018 Olympics

An analysis of Women’s Curling at the 2018 Winter Olympics in South Korea. The technique to predict future matches is based on the Baseball Pythagorean Theorem. The exponents for the Pythagorean Expetctaion were tested against games played to date on February 21, 2018 from 1.9 to 4.1 (by 0.1). Empirically 3.4 had the lowest mean-squared-error of that set

library('rvest')
participants <- "Women" # Set for output reasons
#Specifying the url for desired website to be scrapped
url <- 'http://results.worldcurling.org/Championship/DisplayResults?tournamentId=561&associationId=0&teamNumber=0&drawNumber=0'
exponent <- 3.4 # The exponent to use for the Pythagorean Expectation, determined empirically 

The website www.worldcurling.org is accessed for game data

#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.game-table')
num_games = length(rank_data_html)

The dowloaded webpage is scraped for the appropriate data

games <- matrix(nrow = num_games, ncol = 4)
# Scrape the data from the webpage
for (i in seq_len(num_games)) {
  this_result <- rank_data_html[i]
  teams <- html_text(html_nodes(this_result,".game-team"))
  scores <- html_text(html_nodes(this_result,".game-total"))
  games[i,c(1,2)] <- trimws(gsub('\r\n', '', teams))
  games[i,c(3,4)] <- strtoi(trimws(gsub('\r\n', '', scores)))
}
rm(i, scores, teams, this_result)

At this point, data for all Women’s Curling games at the Olympics has been retrieved and stored in a matrix named games. We will split this matrix into a matrix of only the games that have already completed and the games scheduled to be held (as of February 22, 2018)

completeGames <- games[!is.na(games)[,3],]
# Get a list of the teams that are playing
teams <- unique(games[,1])
if (!is.na(pmatch("To Be", teams))) teams = teams[-length(teams)]
# Take a subset of the games that are scheduled to come
scheduledGames <- matrix(nrow=nrow(games[is.na(games)[,3],]), ncol = 6)
scheduledGames[,1] <- games[is.na(games)[,3],1]
scheduledGames[,4] <- games[is.na(games)[,3],2]
scheduledGames <- scheduledGames[!grepl("To Be", scheduledGames[,1]),]
# Get a list of the teams that are scheduled to play
upcomingTeams <- unique(c(scheduledGames[,1],scheduledGames[,4]))

Based on each team’s record, determine the percentage of games they have won.

standings = matrix(nrow=length(teams), ncol = 5)
count <- 1L
for(team in teams) {
  wins <- sum(strtoi(completeGames[grepl(team, completeGames[,1]),3]) > strtoi(completeGames[grepl(team, completeGames[,1]),4])) +
    sum(strtoi(completeGames[grepl(team, completeGames[,2]),4]) > strtoi(completeGames[grepl(team, completeGames[,2]),3]))
  losses <- sum(strtoi(completeGames[grepl(team, completeGames[,1]),3]) < strtoi(completeGames[grepl(team, completeGames[,1]),4])) +
    sum(strtoi(completeGames[grepl(team, completeGames[,2]),4]) < strtoi(completeGames[grepl(team, completeGames[,2]),3]))
  standings[count,1] <- team
  standings[count,2] <- wins
  standings[count,3] <- losses
#  standings[count,4] <- wins / (wins + losses)
  standings[count,4] <- paste0(formatC(100 * (wins / (wins + losses)), format = "f", digits = 2), "%")
  count <- count + 1L
  print(paste(team, wins, losses, sep = ","))
}
[1] "Japan,5,4"
[1] "Olympic Athlete From Russia,2,7"
[1] "Denmark,1,8"
[1] "Switzerland,4,5"
[1] "Canada,4,5"
[1] "China,4,5"
[1] "Great Britain,6,3"
[1] "United States of America,4,5"
[1] "Korea,8,1"
[1] "Sweden,7,2"
# Clean up
rm(count, wins, losses, team)

No we can compute the Pythagorean Expectation for each team.

count <- 0L
for (team in teams) {
    count <- count + 1L
    scored <- sum(strtoi(completeGames[grepl(team, completeGames[,1]),3])) +
        sum(strtoi(completeGames[grepl(team, completeGames[,2]),4]))
    allowed <- sum(strtoi(completeGames[grepl(team, completeGames[,1]),4])) +
        sum(strtoi(completeGames[grepl(team, completeGames[,2]),3]))
    pythagorean <- (scored^exponent) / ((scored^exponent) + (allowed^exponent))
    pythagorean <- paste0(formatC(100 * pythagorean, format = "f", digits = 2), "%")
    standings[count,5] <- pythagorean
    print(paste(team, pythagorean, sep=" = "))
}
[1] "Japan = 55.94%"
[1] "Olympic Athlete From Russia = 14.41%"
[1] "Denmark = 22.45%"
[1] "Switzerland = 57.34%"
[1] "Canada = 61.84%"
[1] "China = 39.02%"
[1] "Great Britain = 57.22%"
[1] "United States of America = 37.60%"
[1] "Korea = 85.97%"
[1] "Sweden = 72.67%"
standings <- as.data.frame(standings)
names(standings) <- c("Team", "Wins", "Losses", "Standing", "Pythagorean")
print(standings)
rm(count, pythagorean, allowed, scored, team)

Finally, prepare and list the upcoming games with each team’s standings and Pythagorean Expectation.

for (team in upcomingTeams) {
    scheduledGames[grep(team, scheduledGames[,1]),2] <- as.character(standings[grep(team, standings[,1]),4])
    scheduledGames[grep(team, scheduledGames[,1]),3] <- as.character(standings[grep(team, standings[,1]),5])
    scheduledGames[grep(team, scheduledGames[,4]),5] <- as.character(standings[grep(team, standings[,1]),4])
    scheduledGames[grep(team, scheduledGames[,4]),6] <- as.character(standings[grep(team, standings[,1]),5])
}
rm(team)
scheduledGames <- as.data.frame(scheduledGames)
names(scheduledGames) <- c("Team1", "Standings1", "Pythagorean1", "Team2", "Standings2", "Pythagorean2")
print(scheduledGames)

Interpreting this table, Sweden has been playing at a 72.67% level while winning 77.78% of their games. The other team, Great Britain has been playing at a 57.22% level. It is thought that these percentages are a better indication of their records than their actual record because the Pythagorean Expectation takes the score into account. A team that wins by several points is playing better than a team that barely wins by one point.

LS0tDQp0aXRsZTogIlIgTm90ZWJvb2sgb24gV29tZW4ncyBDdXJsaW5nIDIwMTggT2x5bXBpY3MiDQphdXRob3I6ICJEci4gQ2xpZnRvbiBCYWxkd2luIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KQW4gYW5hbHlzaXMgb2YgV29tZW4ncyBDdXJsaW5nIGF0IHRoZSAyMDE4IFdpbnRlciBPbHltcGljcyBpbiBTb3V0aCBLb3JlYS4gVGhlIHRlY2huaXF1ZSB0byBwcmVkaWN0IGZ1dHVyZSBtYXRjaGVzIGlzIGJhc2VkIG9uIHRoZSBCYXNlYmFsbCBQeXRoYWdvcmVhbiBUaGVvcmVtLiANClRoZSBleHBvbmVudHMgZm9yIHRoZSBQeXRoYWdvcmVhbiBFeHBldGN0YWlvbiB3ZXJlIHRlc3RlZCBhZ2FpbnN0IGdhbWVzIHBsYXllZCB0byBkYXRlIG9uIEZlYnJ1YXJ5IDIxLCAyMDE4IGZyb20gMS45IHRvIDQuMSAoYnkgMC4xKS4gRW1waXJpY2FsbHkgMy40IGhhZCB0aGUgbG93ZXN0IG1lYW4tc3F1YXJlZC1lcnJvciBvZiB0aGF0IHNldA0KDQpgYGB7cn0NCmxpYnJhcnkoJ3J2ZXN0JykNCnBhcnRpY2lwYW50cyA8LSAiV29tZW4iICMgU2V0IGZvciBvdXRwdXQgcmVhc29ucw0KDQojU3BlY2lmeWluZyB0aGUgdXJsIGZvciBkZXNpcmVkIHdlYnNpdGUgdG8gYmUgc2NyYXBwZWQNCnVybCA8LSAnaHR0cDovL3Jlc3VsdHMud29ybGRjdXJsaW5nLm9yZy9DaGFtcGlvbnNoaXAvRGlzcGxheVJlc3VsdHM/dG91cm5hbWVudElkPTU2MSZhc3NvY2lhdGlvbklkPTAmdGVhbU51bWJlcj0wJmRyYXdOdW1iZXI9MCcNCg0KZXhwb25lbnQgPC0gMy40ICMgVGhlIGV4cG9uZW50IHRvIHVzZSBmb3IgdGhlIFB5dGhhZ29yZWFuIEV4cGVjdGF0aW9uLCBkZXRlcm1pbmVkIGVtcGlyaWNhbGx5IA0KDQpgYGANCg0KVGhlIHdlYnNpdGUgd3d3LndvcmxkY3VybGluZy5vcmcgaXMgYWNjZXNzZWQgZm9yIGdhbWUgZGF0YQ0KDQpgYGB7cn0NCiNSZWFkaW5nIHRoZSBIVE1MIGNvZGUgZnJvbSB0aGUgd2Vic2l0ZQ0Kd2VicGFnZSA8LSByZWFkX2h0bWwodXJsKQ0KDQojVXNpbmcgQ1NTIHNlbGVjdG9ycyB0byBzY3JhcGUgdGhlIHJhbmtpbmdzIHNlY3Rpb24NCnJhbmtfZGF0YV9odG1sIDwtIGh0bWxfbm9kZXMod2VicGFnZSwnLmdhbWUtdGFibGUnKQ0KDQpudW1fZ2FtZXMgPSBsZW5ndGgocmFua19kYXRhX2h0bWwpDQoNCmBgYA0KDQpUaGUgZG93bG9hZGVkIHdlYnBhZ2UgaXMgc2NyYXBlZCBmb3IgdGhlIGFwcHJvcHJpYXRlIGRhdGENCg0KYGBge3J9DQpnYW1lcyA8LSBtYXRyaXgobnJvdyA9IG51bV9nYW1lcywgbmNvbCA9IDQpDQoNCiMgU2NyYXBlIHRoZSBkYXRhIGZyb20gdGhlIHdlYnBhZ2UNCmZvciAoaSBpbiBzZXFfbGVuKG51bV9nYW1lcykpIHsNCiAgdGhpc19yZXN1bHQgPC0gcmFua19kYXRhX2h0bWxbaV0NCiAgdGVhbXMgPC0gaHRtbF90ZXh0KGh0bWxfbm9kZXModGhpc19yZXN1bHQsIi5nYW1lLXRlYW0iKSkNCiAgc2NvcmVzIDwtIGh0bWxfdGV4dChodG1sX25vZGVzKHRoaXNfcmVzdWx0LCIuZ2FtZS10b3RhbCIpKQ0KICBnYW1lc1tpLGMoMSwyKV0gPC0gdHJpbXdzKGdzdWIoJ1xyXG4nLCAnJywgdGVhbXMpKQ0KICBnYW1lc1tpLGMoMyw0KV0gPC0gc3RydG9pKHRyaW13cyhnc3ViKCdcclxuJywgJycsIHNjb3JlcykpKQ0KfQ0KDQpybShpLCBzY29yZXMsIHRlYW1zLCB0aGlzX3Jlc3VsdCkNCg0KYGBgDQoNCkF0IHRoaXMgcG9pbnQsIGRhdGEgZm9yIGFsbCBXb21lbidzIEN1cmxpbmcgZ2FtZXMgYXQgdGhlIE9seW1waWNzIGhhcyBiZWVuIHJldHJpZXZlZCBhbmQgc3RvcmVkIGluIGEgbWF0cml4IG5hbWVkIGdhbWVzLiBXZSB3aWxsIHNwbGl0IHRoaXMgbWF0cml4IGludG8gYSBtYXRyaXggb2Ygb25seSB0aGUgZ2FtZXMgdGhhdCBoYXZlIGFscmVhZHkgY29tcGxldGVkIGFuZCB0aGUgZ2FtZXMgc2NoZWR1bGVkIHRvIGJlIGhlbGQgKGFzIG9mIEZlYnJ1YXJ5IDIyLCAyMDE4KQ0KDQpgYGB7ciBDb21wbGV0ZWRHYW1lc30NCmNvbXBsZXRlR2FtZXMgPC0gZ2FtZXNbIWlzLm5hKGdhbWVzKVssM10sXQ0KDQojIEdldCBhIGxpc3Qgb2YgdGhlIHRlYW1zIHRoYXQgYXJlIHBsYXlpbmcNCnRlYW1zIDwtIHVuaXF1ZShnYW1lc1ssMV0pDQppZiAoIWlzLm5hKHBtYXRjaCgiVG8gQmUiLCB0ZWFtcykpKSB0ZWFtcyA9IHRlYW1zWy1sZW5ndGgodGVhbXMpXQ0KDQpgYGANCg0KYGBge3IgU2NoZWR1bGVkR2FtZXN9DQojIFRha2UgYSBzdWJzZXQgb2YgdGhlIGdhbWVzIHRoYXQgYXJlIHNjaGVkdWxlZCB0byBjb21lDQpzY2hlZHVsZWRHYW1lcyA8LSBtYXRyaXgobnJvdz1ucm93KGdhbWVzW2lzLm5hKGdhbWVzKVssM10sXSksIG5jb2wgPSA2KQ0Kc2NoZWR1bGVkR2FtZXNbLDFdIDwtIGdhbWVzW2lzLm5hKGdhbWVzKVssM10sMV0NCnNjaGVkdWxlZEdhbWVzWyw0XSA8LSBnYW1lc1tpcy5uYShnYW1lcylbLDNdLDJdDQpzY2hlZHVsZWRHYW1lcyA8LSBzY2hlZHVsZWRHYW1lc1shZ3JlcGwoIlRvIEJlIiwgc2NoZWR1bGVkR2FtZXNbLDFdKSxdDQoNCiMgR2V0IGEgbGlzdCBvZiB0aGUgdGVhbXMgdGhhdCBhcmUgc2NoZWR1bGVkIHRvIHBsYXkNCnVwY29taW5nVGVhbXMgPC0gdW5pcXVlKGMoc2NoZWR1bGVkR2FtZXNbLDFdLHNjaGVkdWxlZEdhbWVzWyw0XSkpDQoNCmBgYA0KDQpCYXNlZCBvbiBlYWNoIHRlYW0ncyByZWNvcmQsIGRldGVybWluZSB0aGUgcGVyY2VudGFnZSBvZiBnYW1lcyB0aGV5IGhhdmUgd29uLg0KDQpgYGB7cn0NCnN0YW5kaW5ncyA9IG1hdHJpeChucm93PWxlbmd0aCh0ZWFtcyksIG5jb2wgPSA1KQ0KY291bnQgPC0gMUwNCmZvcih0ZWFtIGluIHRlYW1zKSB7DQogIHdpbnMgPC0gc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDFdKSwzXSkgPiBzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywxXSksNF0pKSArDQogICAgc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDJdKSw0XSkgPiBzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywyXSksM10pKQ0KICBsb3NzZXMgPC0gc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDFdKSwzXSkgPCBzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywxXSksNF0pKSArDQogICAgc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDJdKSw0XSkgPCBzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywyXSksM10pKQ0KICBzdGFuZGluZ3NbY291bnQsMV0gPC0gdGVhbQ0KICBzdGFuZGluZ3NbY291bnQsMl0gPC0gd2lucw0KICBzdGFuZGluZ3NbY291bnQsM10gPC0gbG9zc2VzDQojICBzdGFuZGluZ3NbY291bnQsNF0gPC0gd2lucyAvICh3aW5zICsgbG9zc2VzKQ0KICBzdGFuZGluZ3NbY291bnQsNF0gPC0gcGFzdGUwKGZvcm1hdEMoMTAwICogKHdpbnMgLyAod2lucyArIGxvc3NlcykpLCBmb3JtYXQgPSAiZiIsIGRpZ2l0cyA9IDIpLCAiJSIpDQogIGNvdW50IDwtIGNvdW50ICsgMUwNCiAgcHJpbnQocGFzdGUodGVhbSwgd2lucywgbG9zc2VzLCBzZXAgPSAiLCIpKQ0KfQ0KDQojIENsZWFuIHVwDQpybShjb3VudCwgd2lucywgbG9zc2VzLCB0ZWFtKQ0KYGBgDQoNCk5vIHdlIGNhbiBjb21wdXRlIHRoZSBQeXRoYWdvcmVhbiBFeHBlY3RhdGlvbiBmb3IgZWFjaCB0ZWFtLg0KDQpgYGB7cn0NCmNvdW50IDwtIDBMDQpmb3IgKHRlYW0gaW4gdGVhbXMpIHsNCiAgICBjb3VudCA8LSBjb3VudCArIDFMDQogICAgc2NvcmVkIDwtIHN1bShzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywxXSksM10pKSArDQogICAgICAgIHN1bShzdHJ0b2koY29tcGxldGVHYW1lc1tncmVwbCh0ZWFtLCBjb21wbGV0ZUdhbWVzWywyXSksNF0pKQ0KICAgIGFsbG93ZWQgPC0gc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDFdKSw0XSkpICsNCiAgICAgICAgc3VtKHN0cnRvaShjb21wbGV0ZUdhbWVzW2dyZXBsKHRlYW0sIGNvbXBsZXRlR2FtZXNbLDJdKSwzXSkpDQogICAgcHl0aGFnb3JlYW4gPC0gKHNjb3JlZF5leHBvbmVudCkgLyAoKHNjb3JlZF5leHBvbmVudCkgKyAoYWxsb3dlZF5leHBvbmVudCkpDQogICAgcHl0aGFnb3JlYW4gPC0gcGFzdGUwKGZvcm1hdEMoMTAwICogcHl0aGFnb3JlYW4sIGZvcm1hdCA9ICJmIiwgZGlnaXRzID0gMiksICIlIikNCiAgICBzdGFuZGluZ3NbY291bnQsNV0gPC0gcHl0aGFnb3JlYW4NCiAgICBwcmludChwYXN0ZSh0ZWFtLCBweXRoYWdvcmVhbiwgc2VwPSIgPSAiKSkNCn0NCg0Kc3RhbmRpbmdzIDwtIGFzLmRhdGEuZnJhbWUoc3RhbmRpbmdzKQ0KbmFtZXMoc3RhbmRpbmdzKSA8LSBjKCJUZWFtIiwgIldpbnMiLCAiTG9zc2VzIiwgIlN0YW5kaW5nIiwgIlB5dGhhZ29yZWFuIikNCg0KcHJpbnQoc3RhbmRpbmdzKQ0Kcm0oY291bnQsIHB5dGhhZ29yZWFuLCBhbGxvd2VkLCBzY29yZWQsIHRlYW0pDQoNCmBgYA0KDQpGaW5hbGx5LCBwcmVwYXJlIGFuZCBsaXN0IHRoZSB1cGNvbWluZyBnYW1lcyB3aXRoIGVhY2ggdGVhbSdzIHN0YW5kaW5ncyBhbmQgUHl0aGFnb3JlYW4gRXhwZWN0YXRpb24uDQoNCmBgYHtyfQ0KZm9yICh0ZWFtIGluIHVwY29taW5nVGVhbXMpIHsNCiAgICBzY2hlZHVsZWRHYW1lc1tncmVwKHRlYW0sIHNjaGVkdWxlZEdhbWVzWywxXSksMl0gPC0gYXMuY2hhcmFjdGVyKHN0YW5kaW5nc1tncmVwKHRlYW0sIHN0YW5kaW5nc1ssMV0pLDRdKQ0KICAgIHNjaGVkdWxlZEdhbWVzW2dyZXAodGVhbSwgc2NoZWR1bGVkR2FtZXNbLDFdKSwzXSA8LSBhcy5jaGFyYWN0ZXIoc3RhbmRpbmdzW2dyZXAodGVhbSwgc3RhbmRpbmdzWywxXSksNV0pDQogICAgc2NoZWR1bGVkR2FtZXNbZ3JlcCh0ZWFtLCBzY2hlZHVsZWRHYW1lc1ssNF0pLDVdIDwtIGFzLmNoYXJhY3RlcihzdGFuZGluZ3NbZ3JlcCh0ZWFtLCBzdGFuZGluZ3NbLDFdKSw0XSkNCiAgICBzY2hlZHVsZWRHYW1lc1tncmVwKHRlYW0sIHNjaGVkdWxlZEdhbWVzWyw0XSksNl0gPC0gYXMuY2hhcmFjdGVyKHN0YW5kaW5nc1tncmVwKHRlYW0sIHN0YW5kaW5nc1ssMV0pLDVdKQ0KfQ0Kcm0odGVhbSkNCg0Kc2NoZWR1bGVkR2FtZXMgPC0gYXMuZGF0YS5mcmFtZShzY2hlZHVsZWRHYW1lcykNCm5hbWVzKHNjaGVkdWxlZEdhbWVzKSA8LSBjKCJUZWFtMSIsICJTdGFuZGluZ3MxIiwgIlB5dGhhZ29yZWFuMSIsICJUZWFtMiIsICJTdGFuZGluZ3MyIiwgIlB5dGhhZ29yZWFuMiIpDQoNCnByaW50KHNjaGVkdWxlZEdhbWVzKQ0KDQpgYGANCg0KDQpJbnRlcnByZXRpbmcgdGhpcyB0YWJsZSwgYHIgc2NoZWR1bGVkR2FtZXNbMSwxXWAgaGFzIGJlZW4gcGxheWluZyBhdCBhIGByIHNjaGVkdWxlZEdhbWVzWzEsM11gIGxldmVsIHdoaWxlIHdpbm5pbmcgYHIgc2NoZWR1bGVkR2FtZXNbMSwyXWAgb2YgdGhlaXIgZ2FtZXMuIFRoZSBvdGhlciB0ZWFtLCBgciBzY2hlZHVsZWRHYW1lc1sxLDRdYCBoYXMgYmVlbiBwbGF5aW5nIGF0IGEgYHIgc2NoZWR1bGVkR2FtZXNbMSw2XWAgbGV2ZWwuIEl0IGlzIHRob3VnaHQgdGhhdCB0aGVzZSBwZXJjZW50YWdlcyBhcmUgYSBiZXR0ZXIgaW5kaWNhdGlvbiBvZiB0aGVpciByZWNvcmRzIHRoYW4gdGhlaXIgYWN0dWFsIHJlY29yZCBiZWNhdXNlIHRoZSBQeXRoYWdvcmVhbiBFeHBlY3RhdGlvbiB0YWtlcyB0aGUgc2NvcmUgaW50byBhY2NvdW50LiBBIHRlYW0gdGhhdCB3aW5zIGJ5IHNldmVyYWwgcG9pbnRzIGlzIHBsYXlpbmcgYmV0dGVyIHRoYW4gYSB0ZWFtIHRoYXQgYmFyZWx5IHdpbnMgYnkgb25lIHBvaW50Lg==

Introduction to my Data Science Blog

9 March ’18

Introduction to my Blog

This data science blog is built on the Jekyll Travelog theme (both elegant and downright simple). The theme source code can be found on Github here.

Introduction

I am a systems engineer and an adjunct professor of data science. As I cover concepts in data science (or learn new ones), I have been writing code, mostly in R but sometimes in Python, to demonstrate to my classes or better learn certain techniques. I am establishing this blog so I can post code and material.

Bio

Dr. Clifton Baldwin is an adjunct professor for the Data Science and Strategic Analytics Masters degree, Stockton University, Galloway, NJ, a program he helped establish. In addition, Dr. Baldwin is the Southern New Jersey Regional Director for the Delaware Valley Chapter of INCOSE INCOSE. He has over 26 years of experience working in software and systems engineering with the first ten years of his career as a “systems analyst” (what we would now call “data scientist”) at the U.S. Bureau of Economic Analysis [BEA] (https://www.bea.gov). His research interests include system of systems and complex systems modeling. He holds a BA degree in Mathematics from Rutgers University, a MS degree in Information Systems from Johns Hopkins University, and a PhD in Systems Engineering from Stevens Institute of Technology.

Constructing my Data Science Blog

8 March ’18

This data science blog is built on the Jekyll Travelog theme (both elegant and downright simple).

The theme source code can be found on Github here.

I am just getting it set up. So for the moment, Under Construction!