Overview

In this Take-home exercise, we will apply the skills learnt in Visual Analytics Lesson 1 to reveal the demographics of the City of Engagement, Ohio USA by using appropriate static statistical graphics methods. We will process the data using tidyverse family of packages. We will then use ggplot2 and its extensions to prepare the statistical graphics to understand the demographics of the city of Engagement, Ohio USA.

Dataset

The dataset used is available for download here.

Step by Step Description

Following are the steps taken to undesrtand the demographics of the region.

Getting Started

We will first load the required packages using the below code chunk

packages = c('tidyverse','ggdist','gghalves','ggthemes','hrbrthemes','ggridges','patchwork')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing Data

The code chunk below imports Participants.csv from the data folder by using read_csv() function of readr package into R and save it as a tibble dataframe called participants_data.

participants_data <- read_csv("data/Participants.csv")

Overview of the data

Lets take a look at the data. Brief glance at the columns available in the Participants data shows that it contains household size, age, education level, have kids and joviality. All these are useful attributes to understand the demographics of a region. Let us look at each of these attributes individually to analyse their usefulness further.

head(participants_data,10,7)

# A tibble: 10 x 7
   participantId householdSize haveKids   age educationLevel     
           <dbl>         <dbl> <lgl>    <dbl> <chr>              
 1             0             3 TRUE        36 HighSchoolOrCollege
 2             1             3 TRUE        25 HighSchoolOrCollege
 3             2             3 TRUE        35 HighSchoolOrCollege
 4             3             3 TRUE        21 HighSchoolOrCollege
 5             4             3 TRUE        43 Bachelors          
 6             5             3 TRUE        32 HighSchoolOrCollege
 7             6             3 TRUE        26 HighSchoolOrCollege
 8             7             3 TRUE        27 Bachelors          
 9             8             3 TRUE        20 Bachelors          
10             9             3 TRUE        35 Bachelors          
# ... with 2 more variables: interestGroup <chr>, joviality <dbl>

Understanding the distribution of age

The code chunk below creates a histogram of age by using geom_histogram() of ggplot2.From the visualization we can see that the age of the participants ranges from 15 to 60 approximately.

ggplot(data=participants_data,
       aes(x = age)) +
  geom_histogram(bins=20,
                 color = "grey25",
                 fill = "grey90")+
  ggtitle("Distribution of Age")+
  ylab("No. of Participants") +
  theme_ipsum(grid ="Y")

To get the definite minimum and maximum age values, let us use a boxplot.

ggplot(data = participants_data,
      aes(y = "", x = age)) +
  geom_boxplot() +
  stat_summary(geom = "point",
              fun = mean,
              colour = "red",
              size = 4) +
  ylab("")

From the boxplot, it is clear that the dataset contains participants in the range 10 to 60 which is understandably the desired range to consider for city planning. This means that we do not get an idea about the proportion of kids below the age of 10 and senior residents above the age of 60 from this dataset.

The mean and median of the data also coincide indicating that the Participants in the age range 10-60 are symmetrically distributed.

Let us further update the age histogram to show the median and mean and use it as the final visualization for age demographic.

p1 <- ggplot(data=participants_data,
       aes(x = age)) +
  geom_histogram(bins=20,
                 color = "grey25",
                 fill = "light blue")+
  ggtitle("Distribution of Age")+
  ylab("No. of\nParticipants") +
  theme_ipsum(axis_title_size = 12,
              grid ="Y") +
  theme(axis.title.y = element_text(angle=0))+
  geom_vline(aes(xintercept=mean(age,na.rm=T)),
              color="red",
              linetype="dashed",
              size=1) +
  geom_vline(aes(xintercept=median(age,na.rm=T)),
              color="grey30",
              linetype="dashed",
              size=1)+
  geom_text((aes(x= mean(age,na.rm=T)+1,
                 label="mean",
                 y = 75,
                 angle =90)),
            color= "red")+
  geom_text((aes(x= median(age,na.rm=T)-1,
                 label="median",
                 y = 75,
                 angle =90)),
            color = "grey30")
print(p1)

Understanding the distribution of Education Level

Let us look at the education level of participants using a bar graph. The code chunk below generates a bar graph of the education levels of the participants.

ggplot(data=participants_data,
      aes(x=reorder(educationLevel,educationLevel,function(x)length(x))))+
  geom_bar() +
  coord_flip() +
  ylab("No of Participants") +
  xlab("Education Level") +
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=-70)),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()) +
  geom_text(stat="count",
    aes(label=paste0(..count..,", ",
    round(..count../sum(..count..)*100,
      1),"%")),
    hjust=1,
    color = "white")

Though the graph provides an idea aboout the distribution of Education Level among the participants, the Education Levels can be considered ordinal with Low as the lowest level and graduate as the highest level. Let us view the above graph after fixing the order of the Education Level. We will also change the colors to go with the Age distribution histogram.

level_order <- c('Low','HighSchoolOrCollege','Bachelors','Graduate')
p2 <- ggplot(data=participants_data,
      aes(x=factor(educationLevel,levels = level_order)))+
  geom_bar(color = "grey25",
           fill="light blue") +
  coord_flip() +
  ylab("No of Participants") +
  xlab("Education Level") +
  theme_ipsum(axis_title_size = 12, grid ="X")+
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=-70)),
        axis.line.x = element_line(color="grey25", size = 0.02)) +
  geom_text(stat="count",
    aes(label=paste0(..count..,", ",
    round(..count../sum(..count..)*100,
      1),"%")),
    hjust=1,
    color = "grey25",
    size=3.5)+
  ggtitle("Distribution of Education")

print(p2)

Understanding the household size of participants

The below code chunk generates a bar graph indicating the distribution of household sizes among the participants.

From the bar graph we can see that majority of the participants come from households with 2 members. Closely behind are households with 1 and 2 members.

p3 <- ggplot(data=participants_data,
      aes(x=householdSize))+
  geom_bar(color = "grey25",
           fill="light blue") +
  ylab("No of\nParticipants") +
  xlab("Household Size") +
  ylim(0,400) +
  theme_ipsum(axis_title_size = 12, grid ="Y")+
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=20))) +
  geom_text(stat="count",
    aes(label=paste0(..count..,", ",
    round(..count../sum(..count..)*100,
      1),"%")),
    vjust=-0.5,
    color = "grey25",
    size=3.5)+
  ggtitle("Distribution of Household Size")

print(p3)

Understanding the participants with and without kids

p4 <- ggplot(data=participants_data,
      aes(x=haveKids))+
  geom_bar(color = "grey25",
           fill="light blue") +
  ylab("No of\nParticipants") +
  xlab("Have Kids") +
  ylim(0,750) +
  theme_ipsum(axis_title_size = 12, grid ="Y")+
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=20))) +
  geom_text(stat="count",
    aes(label=paste0(..count..,", ",
    round(..count../sum(..count..)*100,
      1),"%")),
    vjust=-0.5,
    color = "grey25",
    size=3.5)+
  ggtitle("Participants with and without kids")

print(p4)

Understanding the joviality of participants

Let us understand the joviality distribution of participants using a density plot

p5 <- ggplot(participants_data,
       aes(x=joviality))+
  geom_density(color = "grey25",
               fill="light Blue")+
  ggtitle("Distribution of Joviality")+
  theme_ipsum(axis_title_size = 12, grid="Y")+
  theme(axis.title.y = element_text(angle=0))+
  scale_y_continuous(expand = c(0,0),limits = c(0,1.5))

print(p5)

Understanding the joviality of participants with and without kids

From the Density plot below, we can see that more participants with kids have higher joviality than those without kids.

ggplot(participants_data,
       aes(x = joviality, y = haveKids)) +
  geom_density_ridges(rel_min_height =0.01, scale = 0.95,
                    color = "grey25",
                    fill = "light blue")+
  ylab("Have Kids") +
  xlim(0,1)+
  theme_ipsum(axis_title_size = 12,
              grid = "X")+
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=-30)),
        axis.line.x = element_line(color="grey25", size = 0.02))+
  ggtitle("Joviality by kids")

Understanding the joviality of participants by household size

From the Density plot below, we can see that more participants with 3 household members have higher joviality than those 2 or 1 members.

ggplot(participants_data,
       aes(x = joviality, y = factor(householdSize))) +
  geom_density_ridges(rel_min_height =0.01, scale = 0.95,
                    color = "grey25",
                    fill = "light blue")+
  ylab("Household Size") +
  xlim(0,1)+
  theme_ipsum(axis_title_size = 12,
              grid = "X")+
  theme(axis.title.y=element_text(angle =0,
                                  margin=margin(r=-10)),
        axis.line.x = element_line(color="grey25", size = 0.02))+
  ggtitle("Joviality by Household Size")

Let us see if a violin plot gives more insights about joviality of participants from different household sizes. It is more obvious from violin plot that more participants with 3 household members have a higher joviality index.

p6 <- ggplot(data = participants_data,
  aes(y = joviality,
  x= factor(householdSize))) +
  geom_violin(color ="grey25",
              fill="light blue") +
  xlab("Household Size")+
  ylab("Joviality")+
  theme_ipsum(axis_title_size=12, grid ="Y")+
  theme(axis.title.y = element_text(angle=0))+
  ggtitle("Joviality index by Household size")
print(p6)

Understanding the Interest Groups of Participants

We will use a bar chart to see if the Interest Group of Participants depends on whether they have kids or not. Plotting the no. of participants does not give us a true picture because from the previous visualizations, we have seen that almost 2/3rd of the participants do not have kids. Let us take a look at the same graph using percentage.

ggplot(data=participants_data,
      aes(x=interestGroup, fill=haveKids))+
  geom_bar(position="dodge")+
  theme_ipsum(grid = "Y",axis_title_size =12)+
  ylab("No. of\nParticipants")+
  xlab("Interest Group")+
  theme(axis.title.y = element_text(angle=0))

We can draw similar insights from the percent bar graph, that participants are symmetrically distributed amoong the interest groups and the fact whether they have kids or not does not seem to influence much.

ggplot(data=participants_data,
      aes(x=interestGroup, fill=haveKids))+
  geom_bar(position="fill")+
  theme_ipsum(grid = "Y",axis_title_size =12)+
  ylab("Percent of\nParticipants")+
  xlab("Interest Group")+
  theme(axis.title.y = element_text(angle=0))

Conclusion

Selecting the graphs from above which better depict the demographic of Engagement, Ohio USA, let us combine them into a single Visual which can be used to represent the city. From the below visualizations, we can infer the following

patchwork <- (p1 + p2)/(p3 + p4)/(p5 + p6)
patchwork + plot_annotation(
  title = "Demographics of Engagement, Ohio, USA",
  subtitle = "The below plots reveal the Demographics of the city",
  theme = theme(plot.title = element_text(size = 22, face="bold"),
                plot.background = element_rect(fill = NA,
                                               colour = 'black',
                                               size = 1),
                plot.margin = margin(2,2,1,1,"cm"))
)

Population of Engagement Ohio USA is symmetrically distributed within the range of 10 - 60, and we do not have data about the population outside this range.
More than 50% of the population has high school level education with less than 10% having lower educational qualification.
Households have size of 1,2 or 3 with 2 being the most common.
70% of the population do not have kids.
Joviality index of the residents is on the lower side with higher joviality seen in people from larger households or households with kids.

Take-home Exercise 01