Creating data visualisation beyond default.
In this Take-home exercise, we will apply the skills learnt in Visual Analytics Lesson 1 to reveal the demographics of the City of Engagement, Ohio USA by using appropriate static statistical graphics methods. We will process the data using tidyverse family of packages. We will then use ggplot2 and its extensions to prepare the statistical graphics to understand the demographics of the city of Engagement, Ohio USA.
The dataset used is available for download here.
Following are the steps taken to undesrtand the demographics of the region.
We will first load the required packages using the below code chunk
packages = c('tidyverse','ggdist','gghalves','ggthemes','hrbrthemes','ggridges','patchwork')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the data
folder by using read_csv()
function of readr
package into R and save it as a tibble dataframe called
participants_data.
participants_data <- read_csv("data/Participants.csv")
Lets take a look at the data. Brief glance at the columns available in the Participants data shows that it contains household size, age, education level, have kids and joviality. All these are useful attributes to understand the demographics of a region. Let us look at each of these attributes individually to analyse their usefulness further.
head(participants_data,10,7)
# A tibble: 10 x 7
participantId householdSize haveKids age educationLevel
<dbl> <dbl> <lgl> <dbl> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege
2 1 3 TRUE 25 HighSchoolOrCollege
3 2 3 TRUE 35 HighSchoolOrCollege
4 3 3 TRUE 21 HighSchoolOrCollege
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 HighSchoolOrCollege
7 6 3 TRUE 26 HighSchoolOrCollege
8 7 3 TRUE 27 Bachelors
9 8 3 TRUE 20 Bachelors
10 9 3 TRUE 35 Bachelors
# ... with 2 more variables: interestGroup <chr>, joviality <dbl>
The code chunk below creates a histogram of age by using geom_histogram() of ggplot2.From the visualization we can see that the age of the participants ranges from 15 to 60 approximately.
ggplot(data=participants_data,
aes(x = age)) +
geom_histogram(bins=20,
color = "grey25",
fill = "grey90")+
ggtitle("Distribution of Age")+
ylab("No. of Participants") +
theme_ipsum(grid ="Y")
To get the definite minimum and maximum age values, let us use a boxplot.
ggplot(data = participants_data,
aes(y = "", x = age)) +
geom_boxplot() +
stat_summary(geom = "point",
fun = mean,
colour = "red",
size = 4) +
ylab("")
From the boxplot, it is clear that the dataset contains participants in the range 10 to 60 which is understandably the desired range to consider for city planning. This means that we do not get an idea about the proportion of kids below the age of 10 and senior residents above the age of 60 from this dataset.
The mean and median of the data also coincide indicating that the Participants in the age range 10-60 are symmetrically distributed.
Let us further update the age histogram to show the median and mean and use it as the final visualization for age demographic.
p1 <- ggplot(data=participants_data,
aes(x = age)) +
geom_histogram(bins=20,
color = "grey25",
fill = "light blue")+
ggtitle("Distribution of Age")+
ylab("No. of\nParticipants") +
theme_ipsum(axis_title_size = 12,
grid ="Y") +
theme(axis.title.y = element_text(angle=0))+
geom_vline(aes(xintercept=mean(age,na.rm=T)),
color="red",
linetype="dashed",
size=1) +
geom_vline(aes(xintercept=median(age,na.rm=T)),
color="grey30",
linetype="dashed",
size=1)+
geom_text((aes(x= mean(age,na.rm=T)+1,
label="mean",
y = 75,
angle =90)),
color= "red")+
geom_text((aes(x= median(age,na.rm=T)-1,
label="median",
y = 75,
angle =90)),
color = "grey30")
print(p1)
Let us look at the education level of participants using a bar graph. The code chunk below generates a bar graph of the education levels of the participants.
ggplot(data=participants_data,
aes(x=reorder(educationLevel,educationLevel,function(x)length(x))))+
geom_bar() +
coord_flip() +
ylab("No of Participants") +
xlab("Education Level") +
theme(axis.title.y=element_text(angle =0,
margin=margin(r=-70)),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()) +
geom_text(stat="count",
aes(label=paste0(..count..,", ",
round(..count../sum(..count..)*100,
1),"%")),
hjust=1,
color = "white")
Though the graph provides an idea aboout the distribution of Education Level among the participants, the Education Levels can be considered ordinal with Low as the lowest level and graduate as the highest level. Let us view the above graph after fixing the order of the Education Level. We will also change the colors to go with the Age distribution histogram.
level_order <- c('Low','HighSchoolOrCollege','Bachelors','Graduate')
p2 <- ggplot(data=participants_data,
aes(x=factor(educationLevel,levels = level_order)))+
geom_bar(color = "grey25",
fill="light blue") +
coord_flip() +
ylab("No of Participants") +
xlab("Education Level") +
theme_ipsum(axis_title_size = 12, grid ="X")+
theme(axis.title.y=element_text(angle =0,
margin=margin(r=-70)),
axis.line.x = element_line(color="grey25", size = 0.02)) +
geom_text(stat="count",
aes(label=paste0(..count..,", ",
round(..count../sum(..count..)*100,
1),"%")),
hjust=1,
color = "grey25",
size=3.5)+
ggtitle("Distribution of Education")
print(p2)
The below code chunk generates a bar graph indicating the distribution of household sizes among the participants.
From the bar graph we can see that majority of the participants come from households with 2 members. Closely behind are households with 1 and 2 members.
p3 <- ggplot(data=participants_data,
aes(x=householdSize))+
geom_bar(color = "grey25",
fill="light blue") +
ylab("No of\nParticipants") +
xlab("Household Size") +
ylim(0,400) +
theme_ipsum(axis_title_size = 12, grid ="Y")+
theme(axis.title.y=element_text(angle =0,
margin=margin(r=20))) +
geom_text(stat="count",
aes(label=paste0(..count..,", ",
round(..count../sum(..count..)*100,
1),"%")),
vjust=-0.5,
color = "grey25",
size=3.5)+
ggtitle("Distribution of Household Size")
print(p3)
p4 <- ggplot(data=participants_data,
aes(x=haveKids))+
geom_bar(color = "grey25",
fill="light blue") +
ylab("No of\nParticipants") +
xlab("Have Kids") +
ylim(0,750) +
theme_ipsum(axis_title_size = 12, grid ="Y")+
theme(axis.title.y=element_text(angle =0,
margin=margin(r=20))) +
geom_text(stat="count",
aes(label=paste0(..count..,", ",
round(..count../sum(..count..)*100,
1),"%")),
vjust=-0.5,
color = "grey25",
size=3.5)+
ggtitle("Participants with and without kids")
print(p4)
Let us understand the joviality distribution of participants using a density plot
p5 <- ggplot(participants_data,
aes(x=joviality))+
geom_density(color = "grey25",
fill="light Blue")+
ggtitle("Distribution of Joviality")+
theme_ipsum(axis_title_size = 12, grid="Y")+
theme(axis.title.y = element_text(angle=0))+
scale_y_continuous(expand = c(0,0),limits = c(0,1.5))
print(p5)
From the Density plot below, we can see that more participants with kids have higher joviality than those without kids.
ggplot(participants_data,
aes(x = joviality, y = haveKids)) +
geom_density_ridges(rel_min_height =0.01, scale = 0.95,
color = "grey25",
fill = "light blue")+
ylab("Have Kids") +
xlim(0,1)+
theme_ipsum(axis_title_size = 12,
grid = "X")+
theme(axis.title.y=element_text(angle =0,
margin=margin(r=-30)),
axis.line.x = element_line(color="grey25", size = 0.02))+
ggtitle("Joviality by kids")
From the Density plot below, we can see that more participants with 3 household members have higher joviality than those 2 or 1 members.
ggplot(participants_data,
aes(x = joviality, y = factor(householdSize))) +
geom_density_ridges(rel_min_height =0.01, scale = 0.95,
color = "grey25",
fill = "light blue")+
ylab("Household Size") +
xlim(0,1)+
theme_ipsum(axis_title_size = 12,
grid = "X")+
theme(axis.title.y=element_text(angle =0,
margin=margin(r=-10)),
axis.line.x = element_line(color="grey25", size = 0.02))+
ggtitle("Joviality by Household Size")
Let us see if a violin plot gives more insights about joviality of participants from different household sizes. It is more obvious from violin plot that more participants with 3 household members have a higher joviality index.
p6 <- ggplot(data = participants_data,
aes(y = joviality,
x= factor(householdSize))) +
geom_violin(color ="grey25",
fill="light blue") +
xlab("Household Size")+
ylab("Joviality")+
theme_ipsum(axis_title_size=12, grid ="Y")+
theme(axis.title.y = element_text(angle=0))+
ggtitle("Joviality index by Household size")
print(p6)
We will use a bar chart to see if the Interest Group of Participants depends on whether they have kids or not. Plotting the no. of participants does not give us a true picture because from the previous visualizations, we have seen that almost 2/3rd of the participants do not have kids. Let us take a look at the same graph using percentage.
ggplot(data=participants_data,
aes(x=interestGroup, fill=haveKids))+
geom_bar(position="dodge")+
theme_ipsum(grid = "Y",axis_title_size =12)+
ylab("No. of\nParticipants")+
xlab("Interest Group")+
theme(axis.title.y = element_text(angle=0))
We can draw similar insights from the percent bar graph, that participants are symmetrically distributed amoong the interest groups and the fact whether they have kids or not does not seem to influence much.
ggplot(data=participants_data,
aes(x=interestGroup, fill=haveKids))+
geom_bar(position="fill")+
theme_ipsum(grid = "Y",axis_title_size =12)+
ylab("Percent of\nParticipants")+
xlab("Interest Group")+
theme(axis.title.y = element_text(angle=0))
Selecting the graphs from above which better depict the demographic of Engagement, Ohio USA, let us combine them into a single Visual which can be used to represent the city. From the below visualizations, we can infer the following
patchwork <- (p1 + p2)/(p3 + p4)/(p5 + p6)
patchwork + plot_annotation(
title = "Demographics of Engagement, Ohio, USA",
subtitle = "The below plots reveal the Demographics of the city",
theme = theme(plot.title = element_text(size = 22, face="bold"),
plot.background = element_rect(fill = NA,
colour = 'black',
size = 1),
plot.margin = margin(2,2,1,1,"cm"))
)
Population of Engagement Ohio USA is symmetrically distributed within the range of 10 - 60, and we do not have data about the population outside this range.
More than 50% of the population has high school level education with less than 10% having lower educational qualification.
Households have size of 1,2 or 3 with 2 being the most common.
70% of the population do not have kids.
Joviality index of the residents is on the lower side with higher joviality seen in people from larger households or households with kids.