Data Exploration Collected at a Local Cemetery by the DSSA Class
22 November ’17
The Stockton DSSA class walked to the local cemetery to collect data on October 31, 2017. Each team of students uploaded their data, in whatever format they used. An optional project was to read in the various datasets, clean them as needed, and analyze the data. Not many of the students undertook this project, but I decided it was a good opportunity for me to play with the data and use my findings as an example for the class.
Although I am posting this blog post in March 2018, I dated it November 2017, when I wrote the code and did the analysis.
Here is my R Notebook: <!DOCTYPE html>
R Notebook on Halloween Data
Dr. Clifton Baldwin
November 17, 2017
Hammonton Cemetery Data
Collected October 31, 2017
Revised Report (corrections) November 22, 2017
- First, determine what question we want to ask of the data? The question will help us determine what variables/columns we want to preserve.
- Read all the applicable datasets into R.
- For the desired columns, standardize their names.
- Standardize the data type for each column (i.e. all datasets’ DOB should be date fields)
- Do the analysis.
- State any limitations.
- Make conclusions that are supported by the analysis.
Data Question
My question is, are lifespans (ages) increasing in recent years? In other words, do people live longer now than previously? To answer my question, I need the date of birth (DOB), and the final age, which can be computed using the Date of Death (DOD).
Data Cleaning
There are 11 datasets available. The first dataset is the data I collected. It should be the same as the data that Dr. Manson collected. So I will only read my dataset, which makes a total of ten datasets to import.
data1 <- read_csv("Halloween data.csv", skip=6)
Parsed with column specification:
cols(
First_Name = col_character(),
Middle_Name = col_character(),
Last_Name = col_character(),
DoB = col_character(),
DoD = col_character(),
Age = col_integer()
)
To address my data question, I believe I need only FirstName, LastName, DOB (Date of Birth), DOD (Date of Death), and age. Actually I may only need DOB and DOD, and then I can compute age.
I will want to combine all the datasets into one master dataset for analysis. To standardize my data frames, I will rename the columns with special attention to the FirstName, LastName, DOB, DOD, and Age.
names(data1) <- c("FirstName","MiddleName","LastName","DOB","DOD","Age")
I want the DOB and DOD to be dates instead of strings. Using the Tidyverse package Lubridate, I can parse the columns into dates. I have to load the library Lubridate separately, since it does not automatically load with the tidyverse. Since some observations are in m/d/yyyy format and some observations are just the 4-digit year, I need to tell the parser these two formats.
data1$DOB <- parse_date_time(data1$DOB, c("mdy", "y"))
data1$DOD <- parse_date_time(data1$DOD, c("mdy", "y"))
For a few observations, I have only the date of death and the age. I can compute the year of birth from that information.
data1$DOB <- if_else(is.na(data1$DOB), parse_date_time(year(data1$DOD - years(data1$Age)), c("y")), data1$DOB)
Finally, I want to save only the four columns FirstName, LastName, DOB, DOD. I can calculate age later.
data1 <- data1[,c(1,3,4,5)]
Each of the ten datasets has its own quirks since there was no standard format provided to the data collectors. The next dataset has some type of metadata stored in columns after column 7. I will save only columns 1 through 4, since those columns contain my desired data, and then parse the desired dates.
data2 <- read_delim("Cemetary Data 103117 again", delim="\t")
Missing column names filled in: 'X8' [8], 'X9' [9], 'X11' [11], 'X12' [12]Parsed with column specification:
cols(
FirstName = col_character(),
LastName = col_character(),
DOB = col_character(),
DOD = col_character(),
MF = col_character(),
MultiTomb = col_character(),
Epitaph = col_character(),
X8 = col_character(),
X9 = col_character(),
`Data Collectors: Melissa Laurino, Louis Discenza` = col_character(),
X11 = col_character(),
X12 = col_character()
)
data2 <- data2[,c(1:4)] # Remove the metadata and unwanted data.
data2$DOB <- parse_date_time(data2$DOB, c("mdy", "y"))
1 failed to parse.
data2$DOD <- parse_date_time(data2$DOD, c("mdy", "y"))
1 failed to parse.
After examining the raw data, there is on DOB listed as “02/14/000” and one DOD listed as “11/00/0000” (not the same observation). When they are parsed, the failure to parse turns them into NO, which is appropriate here.
The next dataset does not have DOB or DOD but separates the month, day, and year in the data. I need to paste together the full dates for the DOB and DOD columns.
data3 <- read_csv("CemeteryData.csv")
Parsed with column specification:
cols(
`Stone#` = col_integer(),
FirstName = col_character(),
MiddleName = col_character(),
LastName = col_character(),
Suffix = col_character(),
FromDD = col_character(),
FromMM = col_character(),
FromYYYY = col_integer(),
ToDD = col_integer(),
ToMM = col_character(),
ToYYYY = col_integer(),
StoneSize = col_character(),
NumOfIndividuals = col_integer(),
OtherInfo = col_character()
)
data3$DOB <- ifelse(is.na(data3$FromMM), data3$FromYYYY, paste(data3$FromMM,data3$FromDD,data3$FromYYYY, sep = "/"))
data3$DOD <- ifelse(is.na(data3$ToMM), data3$ToYYYY, paste(data3$ToMM,data3$ToDD,data3$ToYYYY, sep = "/"))
data3$DOB <- parse_date_time(data3$DOB, c("mdy", "y"))
data3$DOD <- parse_date_time(data3$DOD, c("mdy", "y"))
data3 <- data3[,c(2,4,15,16)]
The next three datasets are similar enough that I can run the same set of commands on them.
data4 <- read_csv("Cemetery Data 10.31.2017 - Sheet1.csv", skip=4)
Parsed with column specification:
cols(
`Last Name` = col_character(),
`First Name` = col_character(),
`Male/Female` = col_character(),
`Date of Birth` = col_character(),
`Date of Death` = col_character(),
`Military?` = col_character(),
Notes = col_character()
)
names(data4) <- c("LastName","FirstName","Male/Female","DOB","DOD","Military","Notes")
data4$DOB <- parse_date_time(data4$DOB, c("mdy", "y"))
data4$DOD <- parse_date_time(data4$DOD, c("mdy", "y"))
data4 <- data4[,c(1,2,4,5)]
data5 <- read_csv("cemeterydata2.csv")
Parsed with column specification:
cols(
Obs = col_integer(),
Last = col_character(),
First = col_character(),
Dob = col_integer(),
Dod = col_integer(),
Time = col_time(format = ""),
military = col_character(),
Branch = col_character(),
War = col_character()
)
names(data5) <- c("Obs","LastName","FirstName","DOB","DOD","Timestamp","Military","Branch","War")
data5$DOB <- parse_date_time(data5$DOB, c("mdy", "y"))
data5$DOD <- parse_date_time(data5$DOD, c("mdy", "y"))
data5 <- data5[,c(2,3,4,5)]
data6 <- read_csv("cemeterydataxlsx.csv") # cemeterydata.xlsx converted to CSV file
Parsed with column specification:
cols(
Obs = col_integer(),
Last = col_character(),
First = col_character(),
Dob = col_integer(),
Dod = col_integer(),
Time = col_time(format = ""),
military = col_character(),
Branch = col_character(),
War = col_character()
)
names(data6) <- c("Obs","LastName","FirstName","DOB","DOD","Timestamp","Military","Branch","War")
data6$DOB <- parse_date_time(data6$DOB, c("mdy", "y"))
data6$DOD <- parse_date_time(data6$DOD, c("mdy", "y"))
data6 <- data6[,c(2,3,4,5)]
The next dataset has a problem with “N/A” instead of the value NA. “N/A” is not recognized by R and saved as a string. First I will convert “N/A” to NA, where applicable, and then address the dates as I have been doing.
data7 <- read_csv("GraveyardData.csv")
Parsed with column specification:
cols(
`First Name` = col_character(),
`Last Name` = col_character(),
YOB = col_character(),
YOD = col_character(),
Relationship = col_character(),
`Group#` = col_integer(),
`Gravestone Grouping Type` = col_character(),
`Affliation/Religion` = col_character()
)
names(data7) <- c("FirstName", "LastName", "DOB","DOD","Relationship","Group#","Gravestone Grouping Type", "Affliation/Religion")
data7$DOB <- if_else(data7$DOB == "N/A", NA_character_, data7$DOB)
data7$DOB <- parse_date_time(data7$DOB, c("mdy", "y"))
data7$DOD <- if_else(data7$DOD == "N/A", NA_character_, data7$DOD)
data7$DOD <- parse_date_time(data7$DOD, c("mdy", "y"))
data7 <- data7[,c(1:4)]
The next dataset was an Excel file. I could use a special package in R to handle Excel files, but it was MUCH easier to open it in Excel and save it as a CSV. Since I had to do this only once, I preferred the manual approach. Also, similar to the last dataset, the missing values are represented with a “-”
data8 <- read_csv("Cemetary Data 103117.csv") # Saved with a csv extension but in Excel format - corrected using Excel
Parsed with column specification:
cols(
First = col_character(),
Last = col_character(),
DOB = col_character(),
DOD = col_character(),
`Male/Female` = col_character(),
`Single/Married/Children` = col_character(),
`Tombstone S/M/L/XL` = col_character()
)
names(data8) <- c("FirstName","LastName","DOB","DOD","Male/Female","Single/Married/Children","Tombstone S/M/L/XL")
data8$DOB <- if_else(data8$DOB == "-", NA_character_, data8$DOB)
data8$DOB <- parse_date_time(data8$DOB, c("mdy", "y"))
data8$DOD <- if_else(data8$DOD == "-", NA_character_, data8$DOD)
data8$DOD <- parse_date_time(data8$DOD, c("mdy", "y"))
data8 <- data8[,c(1:4)]
Nothing too unusual about the next daatset.
data9 <- read_csv("Hammonton-Graveyard-10-31-2017.csv")
Missing column names filled in: 'X1' [1], 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8]Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_character(),
X3 = col_character(),
X4 = col_integer(),
X5 = col_integer(),
X6 = col_character(),
X7 = col_character(),
X8 = col_character()
)
data9 <- data9[,c(2:7)]
names(data9) <- c("FirstName","LastName","DOB","DOD","Marker", "Military")
data9$DOB <- parse_date_time(data9$DOB, c("mdy", "y"))
data9$DOD <- parse_date_time(data9$DOD, c("mdy", "y"))
data9 <- data9[,c(1:4)]
Finally, I wrote a Python script to read Kancy-Cemetery-data (I wrote one for Manson-data also, but I do not need that data since I collected the same observations). The Python script Kancy-Cemetery-data.py saves the output as Kancy-Cemetery-data.csv. Since I wrote the Python script, I know the columns are named the way I want, but the dates need to be parsed from string to date fields still.
data10 <- read_csv("Kancy-Cemetery-data.csv")
Parsed with column specification:
cols(
FirstName = col_character(),
LastName = col_character(),
DOB = col_character(),
DOD = col_character()
)
data10$DOB <- parse_date_time(data10$DOB, c("mdy", "y"))
data10$DOD <- parse_date_time(data10$DOD, c("mdy", "y"))
Now I will combine all the datasets into one large dataset and discard the individual datasets.
dataAll <- bind_rows(data1,data2,data3,data4,data5,data6,data7,data8,data9,data10)
rm(data1, data2, data3, data4, data5, data6, data7, data8, data9, data10)
glimpse(dataAll)
Observations: 495
Variables: 4
$ FirstName <chr> "John", "Mary", "Pierino", "Frances", "Ann", "F...
$ LastName <chr> "Harris", "Smith", "Gallucci", "Blanc", "Miller...
$ DOB <dttm> 1829-01-01, 1824-07-22, 1919-01-29, 1800-06-20...
$ DOD <dttm> 1867-07-08, 1895-07-12, 1990-12-22, 1879-03-05...
Now, I am ready to explore!
Analysis
As an exploratory move, I compute years lived, also known as age.
dataAll$age <- interval(dataAll$DOB, dataAll$DOD) / years(1)
At least one observation, row 475, has a negative value for age.
dataAll[475,]
People assume that time is a strict progression of cause to effect, but actually, from a nonlinear, non-subjective viewpoint, it’s more like a big ball of wibbly-wobbly, timey-wimey… stuff. In other words, this observation 475 appears to have a problem. I do not want to correct the data, since my “correction” might be wrong. I know the person could not have been born in 1985 and died in 1917, but maybe the DOB and DOD are switched? Or possibly this person was born in 1885 and not 1985? In any case, I should discard this observation.
Now I can start looking into the question about age. I want to take the average of the non-NA ages where the age is more than 0
mean(subset(dataAll[complete.cases(dataAll$age),], age > 0)$age)
[1] 69.44706
That code can be simplified, and I should include those with age 0 (in other words, less than a year old).
mean(subset(dataAll, age >= 0)$age, na.rm = TRUE)
[1] 68.99415
These statistics are not getting me closer to answering my question. Let us see the mean age by decade
dataAll$decade <- (year(dataAll$DOB) %/% 10) * 10
decades <- subset(dataAll[complete.cases(dataAll$age),], age >= 0)
# Show average age by decade
ageByDecade <- aggregate(decades$age, by=list(decades$decade), FUN=mean, na.rm=TRUE)
kable(ageByDecade, digits=2, row.names = FALSE, col.names = c("Decade", "Mean_Age"))
Decade | Mean_Age |
---|---|
1800 | 78.71 |
1810 | 87.00 |
1820 | 68.87 |
1830 | 71.85 |
1840 | 72.30 |
1850 | 74.50 |
1860 | 62.52 |
1870 | 70.03 |
1880 | 70.14 |
1890 | 72.07 |
1900 | 74.92 |
1910 | 70.50 |
1920 | 73.38 |
1930 | 58.38 |
1940 | 48.80 |
1950 | 35.18 |
1960 | 33.89 |
1970 | 24.90 |
1980 | 28.88 |
1990 | 7.85 |
rm(decades, ageByDecade)
These statistics may be appropriate, but I see there is only data from 1800 to 1990. It makes sense that the recent decades have a shorter lifespan, since many people are still alive from the 1930’s forward. Also, how many observations have DOB in the 1830’s or earlier?
nrow(subset(dataAll, year(DOB) <= 1830))
[1] 7
Seven is not a large enough of a sample. Then really I have data only for 1840’s to 1920’s. In other words, I do not think we have enough data (long enough timespan) to address the data question.
To look at the number of observations per decade, display a histogram.
hist(dataAll$decade, breaks=20, xlab="decade", main="Count of Data Observations per Decade")
The histogram supports there are a majority of the DOBs between the 1840’s to 1930’s. So I get one more decade than I thought. But no one in the 1930’s can be older than 87 when they died, because otherwise they would still be alive. That will cause a bias for that decade, and I prefer to keep only 1840’s to 1920’s. That eliminates people living past 97 years old, but I will accept that.
To visualize any useful results, I will explore the data with some graphs. First using the base R function for graphing, plot DOB against age.
plot( dataAll$DOB, dataAll$age, main="Years Lived by DOB", xlab="Date of Birth", ylab="Age" )
To get a prettier graph, repeat the plot using the qplot function from the ggplot2 package.
qplot( DOB, age, data=dataAll, color=age, main="Years Lived by DOB", xlab="Date of Birth", ylab="Age" )
These graphs included rows with incomplete observations. Let us look at it again with all NAs removed.
qplot( DOB, age, data=dataAll[complete.cases(dataAll),], color=age, main="Years Lived by DOB", xlab="Date of Birth", ylab="Age" )
Apparently the missing values are not the problem. It is the negative age again. So I will remove the NAs and any age less than zero.
qplot( DOB, age, data=subset(dataAll, age>=0), color=age, main="Years Lived by DOB", xlab="Date of Birth", ylab="Age" )
As previously discussed, we should look just at the span of people born in the 1840’s to 1920’s. To make the chart interesting, I will color code the chart by age. To do so, create a function to generate a continuous color palette.
rbPal <- colorRampPalette(c('red','blue'))
Next, add a column of color values based on the y values (i.e. age).
dataAll$Col <- rbPal(10)[as.numeric(cut(dataAll$age,breaks = 10))]
rm(rbPal) # Remove rbPal since it has served its purpose.
In order to avoid a useless (in this case) legend, I need to use ggplot. Here is the graph of observations where DOB was between 1830 and 1930 and age is greater than or equal to zero.
ggplot( data=subset(dataAll, age>=0 & year(DOB) < 1930 & year(DOB) > 1830),
aes(x=DOB, y=age, color=Col)) +
geom_point( ) +
labs(title="Years Lived by DOB", x="Date of Birth", y="Age" ) +
theme(legend.position="none")
The colors help visualize the (mostly) evenly spread across the decades.
Examining the graph does not show any trend in age versus date of birth, but I can see there are many more data observations around the turn of the 20th century than earlier in the 19th century or later in the 20th century. Is that a result of my sampling technique (bias sample)? Or is that the distribution of graves in this cemetery? And yet another lesson on data collection - if we had a plan on collecting the data, it should have included sampling randomly from the entire cemetery.
I will use ggplot() again for DOD vs. age. Unlike DOB, there should be no “year” restriction on using DOD, since obviously no one is still alive after their DOD. Using DOD gives us a longer span potentially to see a trend. In the following case, I could have used qplot(), but ggplot() provides many more options to make the graph look better.
# There is an easier way (I realize) to get the colors
# where I do not have to create a function to generate a continuous color palette
# I can just use scale_colour_gradient()
ggplot(subset(dataAll, age>=0), aes(x=DOD, y=age, color=age)) +
geom_point(size=2, shape=23) +
labs(title="Years Lived by DOD", x="Date of Death", y = "Age") +
theme_classic() +
scale_colour_gradientn(colours=rainbow(10)) +
theme(legend.position="none")
Let us include only those people who made it to their first birthday, at least, and add a regression line (and remove the colors).
ggplot(subset(dataAll, age>0), aes(x=DOD, y=age)) +
geom_point(size=2, shape=23) +
geom_smooth(method='lm') +
labs(title="Years Lived by DOD", x="Date of Death", y = "Age") +
theme_classic() + theme(legend.position="none")
Note: Regression will be covered in the Machine Learning class in the Spring!
It does appear there may be more deaths at older ages in recent years. Prior to 1920, there are no deaths of people older than 90 years. After 1925, there are 43 out of 413 observations.
pct <- nrow(subset(dataAll, age>90 & year(DOD) > 1925)) / nrow(subset(dataAll, age>=0 & year(DOD) > 1925))
Or about 10.4% living to at least 90 years after 1920 compared to 0 living to age 90 before 1920. Do these results indicate it is not as important when one was born but rather when they passed away? I am not sure that makes sense, except we get a longer span of useful observations when we look at the Date of Death compared to the span of useful data from Date of Births.
Limitations
Is that enough evidence to state people are living longer in recent years (with any statistical significance)? A sample of 413 is a sufficient size, but our collection method (i.e. no real plan!) may have inadvertently introduced bias. For one thing, we know our results are only indicative of those in southern New Jersey. Actually, it is only indicative of those that were laid to rest in the Hammonton cemetery. Does this cemetery cost more (or less) than others in the area? After 1920, was it more likely for healthier people to buy land in the Hammonton cemetery? Does one have to have been a resident of Hammonton, and if so did something happen in Hammonton after 1920 to cause residents to live longer? Perhaps one had to be a member of a certain local Church for this cemetery, and the camaraderie from membership of this group helped expand its members’ lives? Did the rules for the cemetery change at any point in time, perhaps due to expanding the amount of land available? Thinking further, are these results indicative of the cemetery or is there a bias within the cemetery? Did the data collectors concentrate in certain areas of the cemetery, perhaps due to available lighting or access to a path or road? Did the data collectors inadvertently focus on older (or newer) grave markers? It appeared that families were grouped together. Is there some other grouping of which we are not aware? There are many things to consider if we want a random sample from just this cemetery.
Conclusion
I conclude that the results indicate, from the available observations of those laid to rest at the Hammonton cemetery, more people have been living longer in recent decades than in earlier decades, but more research is needed to confirm this trend to any more general group of people, such as New Jersey residents or maybe Americans. This “more research” would include a well-defined question and a plan to collect a random sample of data to address the question. If the question is concerned with New Jersey residents or Americans in general, the plan would have to sample from many more than one cemetery in New Jersey.