In this computing club session, we will cover visualizations in R. We will explore a variety of graphing methods in ggplot as well as learn about alternative graphing options in the ggpubr and survminer packages.
Some key points:
+
(we will cover this)
This is a sample selection of some of the most common components that I find myself using in practice. For a more comprehensive list please refer to: https://ggplot2.tidyverse.org/reference/
+
(%+%
: add components to a plotNow we will walk through a series of graphs using the ggplot syntax. We will first begin by building a graph one step at a time, drawing on the inherent layering options within the package.
Install either the tidyverse
or ggplot2
packages
#install.packages("tidyverse") OR install.packages("ggplot2")
library(ggplot2)
library(rmarkdown)
setwd("/Users/Vicky/Desktop")
Let’s use the midwest
dataset
The midwest
dataset describes the demographic information of midwest counties (437 rows and 28 variables):
data("midwest")
head(midwest)
midwestdat <- midwest
Let’s build a boxplot one step at a time. We are interested in visualizing the distribution of the percent of professionals among each of the midwest states.
Call the dataset
ggplot(midwestdat)
Designate the variables for the x and y axes
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof)
Add the data points in point form, using jitter to increase the spread
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25)
Add a different color for the data belonging to each of the states
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25) +
aes(col = state)
Overlay boxplots on each of the states
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25) +
aes(col = state) +
geom_boxplot(alpha = .25)
Add title, label axes, and change theme
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25) +
aes(col = state) +
geom_boxplot(alpha = .25)+
theme_bw() +
xlab("State") +
ylab("Percent Professionals") +
labs(colour = "State") +
ggtitle("Distribution of the Percentage of Professionals by Midwest State")
CENTER the title!
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25) +
aes(col = state) +
geom_boxplot(alpha = .25)+
theme_bw() +
xlab("State") +
ylab("Percent Professionals") +
labs(colour = "State") +
ggtitle("Distribution of the Percentage of Professionals by Midwest State") +
theme(plot.title = element_text(hjust = 0.5))
Different color scheme
ggplot(midwestdat) +
aes(x = state) +
aes(y = percprof) +
geom_jitter(alpha = .5, height = 0, width = .25) +
aes(col = state) +
geom_boxplot(alpha = .25)+
theme_bw() +
xlab("State") +
ylab("Percent Professionals") +
labs(colour = "State") +
ggtitle("Distribution of the Percentage of Professionals by Midwest State") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_brewer(palette="Dark2")
Next we want to visualize the population densities of the midwest states.
Call the dataset
ggplot(data = midwestdat)
Designate the variables for the x and y axes
ggplot(data = midwestdat) +
aes(x = state) +
aes(y = popdensity)
Introduce the variable to fill the bars with
ggplot(data = midwestdat) +
aes(x = state, y = popdensity) +
aes(fill = state)
Add in bars using geom_col()
ggplot(data = midwestdat) +
aes(x = state, y = popdensity) +
aes(fill = state) +
geom_col()
But what if we wanted the states listed horizontally on the y-axis? Use coord_flip()!
ggplot(data = midwestdat) +
aes(x = state, y = popdensity) +
aes(fill = state) +
geom_col() +
coord_flip()
Change title, axes, and legend
ggplot(data = midwestdat) +
aes(x = state, y = popdensity) +
aes(fill = state) +
geom_col() +
coord_flip() +
xlab("State") +
ylab("Population Density") +
scale_fill_discrete(name = "State") +
ggtitle("Population Densities of Midwest States") +
theme(plot.title = element_text(hjust = 0.5))
Almost there, just need to change the x axis labels and the colors. We also want to get rid of the unappealing scale on the x-axis.
options(scipen=10000)
ggplot(data = midwestdat) +
aes(x = state, y = popdensity) +
aes(fill = state) +
geom_col() +
coord_flip() +
xlab("State") +
ylab("Population Density") +
scale_fill_discrete(name = "State") +
ggtitle("Population Densities of Midwest States") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette="YlGnBu")
Let’s use the mpg
dataset
The mpg
dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car (234 rows and 11 variables):
data("mpg")
head(mpg)
mpgdat <- mpg
We want to construct a grid of scatterplots of city miles per gallon vs. highway miles per gallon stratified by cylinder type and clustered by drive type.
Call the dataset
ggplot(mpgdat)
Designate the variables for the x and y axes
library(forcats)
ggplot(mpgdat) +
aes(x = hwy) +
aes(y = cty)
Specify the variable to stratify by or facet and impose free y axis scale
ggplot(mpgdat) +
aes(x = hwy) +
aes(y = cty) +
facet_wrap(~ cyl, scales = "free_y", nrow = 2)
Add in the data points and color/cluster by drive type
ggplot(mpgdat) +
aes(x = hwy) +
aes(y = cty) +
facet_wrap(~ cyl, scales = "free_y", nrow = 2) +
geom_jitter(size = 1, mapping = aes(col = fct_inorder(drv)), width = 1, height = .5)
Let’s change the facet labels, customize the legend labels, add titles, and change the colors.
mpgdat$drv <- as.factor(mpgdat$drv)
levels(mpgdat$drv) <- c("4wd", "Front-wheel drive", "Rear wheel drive")
ggplot(mpgdat) +
aes(x = hwy) +
aes(y = cty) +
facet_wrap(~ cyl, scales = "free_y", nrow = 2) +
geom_jitter(size = 1, mapping = aes(col = drv), width = 1, height = .5) +
xlab("Highway MPG") +
ylab("City MPG") +
ggtitle("City miles per gallon vs. highway miles per gallon, stratified by cylinder type and clustered by drive type") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Drive Type")) +
scale_colour_brewer(palette="Paired")
The title is not quite right. I like to use “” to force the title onto two lines.
mpgdat$drv <- as.factor(mpgdat$drv)
levels(mpgdat$drv) <- c("4wd", "Front-wheel drive", "Rear wheel drive")
ggplot(mpgdat) +
aes(x = hwy) +
aes(y = cty) +
facet_wrap(~ cyl, scales = "free_y", nrow = 2) +
geom_jitter(size = 1, mapping = aes(col = drv), width = 1, height = .5) +
xlab("Highway MPG") +
ylab("City MPG") +
ggtitle("City miles per gallon vs. highway miles per gallon, \n stratified by cylinder type and clustered by drive type") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Drive Type")) +
scale_colour_brewer(palette="Paired")
Now let’s build two more plots step by step.
Let’s use the diamonds
dataset
The diamonds
dataset describes prices and other attributes of almost 54,000 round cut diamonds (53940 rows and 10 variables):
data(diamonds)
head(diamonds)
diamonddat <- diamonds
Which layers do we need to plot price vs. carat and color by clarity?
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
geom_point()
Change the axes and plot titles?
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
geom_point() +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5))
Change the legend title and color scheme?
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
geom_point() +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Clarity")) +
scale_colour_brewer(palette="RdPu")
Add a smoother to visualize the trend in the scatterplot
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
geom_point() +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_colour_brewer(palette="RdPu") +
geom_smooth()
Add smoother for each clarity group
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
geom_point() +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Clarity")) +
scale_colour_brewer(palette="RdPu") +
geom_smooth()
Remove SE
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
geom_point() +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Clarity")) +
scale_colour_brewer(palette="RdPu") +
geom_smooth(se = FALSE)
Display smoothing curves without the data points
ggplot(diamonddat) +
aes(x = carat) +
aes(y = price) +
aes(col = clarity) +
ylab("Price (USD)") +
xlab("Carat (diamond weight)") +
ggtitle("Relationship of Price vs. Carat by Diamond Clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(color=guide_legend("Clarity")) +
scale_colour_brewer(palette="RdPu") +
geom_smooth(se = FALSE)
ggplot(diamonddat, aes(x=price)) + geom_density()
ggplot(diamonddat, aes(x=price, color=clarity)) + geom_density() + scale_color_brewer(palette = "Spectral")
ggplot(diamonddat, aes(x=color, y=price, fill = color)) + geom_boxplot() + scale_fill_brewer(palette = "YlGnBu")
ggplot(diamonddat, aes(x=color, y=price, fill = color)) + geom_boxplot() + scale_fill_brewer(palette = "YlGnBu") + scale_y_log10()
ggplot(diamonddat, aes(x=color, y=price, fill = color)) + geom_violin() + scale_fill_brewer(palette = "YlGnBu") + scale_y_log10()
ggplot(diamonddat, aes(x=color, y=price, fill = color)) + geom_violin(trim = F) + geom_boxplot(width = 0.1, fill = "white") + scale_fill_brewer(palette = "YlGnBu") + scale_y_log10()
Using the diamonds
dataset, create this graph:
ggplot(diamonddat, aes(x=price, fill = clarity)) + geom_histogram(binwidth=200) +
facet_wrap(~ clarity, scale="free_y") +
xlab("Price (USD)") +
ylab("Frequency") +
ggtitle("Frequency of round cut diamond prices, stratified by clarity") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(name = "Clarity", palette = "Spectral")
data("WorldPhones")
head(WorldPhones)
phonedat <- WorldPhones
library(reshape2)
phonedat <- melt(phonedat)
colnames(phonedat) = c("Year", "Continent", "Phones")
ggplot(phonedat, aes(x=Year, y=Phones, color=Continent)) + geom_line()
ggplot(phonedat, aes(x=Year, y=Phones, color=Continent)) + geom_line() + scale_y_log10() + scale_x_continuous(breaks=seq(1951,1961,1))
Let’s use the mtcars
dataset:
This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Contains 32 observations and 11 variables
library(ggcorrplot)
data("mtcars")
head(mtcars)
mtdat <- mtcars
corr <- round(cor(mtcars), 1)
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="square",
colors = c("palevioletred2", "thistle2", "darkolivegreen4"),
title="Correlogram of the variables of the `mtcars` dataset",
ggtheme=theme_bw) +
theme(plot.title = element_text(hjust = 0.5))
(Similiar to earlier example, but we will add p-values here)
library(ggpubr)
data("ToothGrowth")
head(ToothGrowth)
toothdat <- ToothGrowth
rownames(mtdat)
ggboxplot(toothdat, x = "dose", y = "len",
color = "dose", palette =c("palevioletred2", "steelblue2", "olivedrab"),
add = "jitter", shape = "dose")
selectedcomp <- list(c("0.5", "1"), c("1", "2"), c("0.5", "2"))
ggboxplot(toothdat, x = "dose", y = "len",
color = "dose", palette =c("palevioletred2", "steelblue2", "olivedrab"),
add = "jitter", shape = "dose") +
stat_compare_means(comparisons = selectedcomp) + stat_compare_means(label.y = 50)
ggviolin(toothdat, x = "dose", y = "len", fill = "dose",
palette = c("palevioletred2", "steelblue2", "olivedrab"),
add = "boxplot", add.params = list(fill = "white")) +
stat_compare_means(comparisons = selectedcomp, label = "p.signif") +
stat_compare_means(label.y = 50)
mtdat$cyl <- as.factor(mtdat$cyl)
mtdat$type <- rownames(mtdat)
ggbarplot(mtdat, x = "type", y = "mpg",
fill = "cyl",
color = "white",
palette = c("palevioletred2", "steelblue2", "olivedrab"),
sort.val = "desc",
sort.by.groups = FALSE,
x.text.angle = 90
)
ggbarplot(mtdat, x = "type", y = "mpg",
fill = "cyl",
color = "white",
palette = c("palevioletred2", "steelblue2", "olivedrab"),
sort.val = "asc",
sort.by.groups = TRUE,
x.text.angle = 90
)
ggdotchart(mtdat, x = "type", y = "mpg",
color = "cyl",
palette = c("palevioletred2", "steelblue2", "olivedrab"),
sorting = "ascending",
add = "segments",
ggtheme = theme_pubr()
)
ggdotchart(mtdat, x = "type", y = "mpg",
color = "cyl",
palette =c("palevioletred2", "steelblue2", "olivedrab"),
sorting = "descending",
add = "segments",
rotate = TRUE,
group = "cyl",
dot.size = 6,
label = round(mtdat$mpg),
font.label = list(color = "white", size = 9,
vjust = 0.5),
ggtheme = theme_pubr()
)
ggdotchart(mtdat, x = "type", y = "mpg",
color = "cyl",
palette = c("palevioletred2", "steelblue2", "olivedrab"),
sorting = "descending",
rotate = TRUE,
dot.size = 2,
y.text.col = TRUE,
ggtheme = theme_pubr()) + theme_cleveland()
ggscatter(mtdat, x = "wt", y = "mpg",
add = "reg.line",
conf.int = TRUE,
color = "cyl", palette = c("palevioletred2", "steelblue2", "olivedrab"),
shape = "cyl") + stat_cor(aes(color = cyl), label.x = 3)
ggscatter(mtdat, x = "wt", y = "mpg",
add = "reg.line",
color = "cyl", palette = c("palevioletred2", "steelblue2", "olivedrab"),
shape = "cyl",
fullrange = TRUE,
rug = TRUE) +
stat_cor(aes(color = cyl), label.x = 3)
#ellipse = TRUE: Draw ellipses around groups
#ellipse.level: The size of the concentration ellipse in normal probability. Default is 0.95
#ellipse.type: Ellipse types
ggscatter(mtdat, x = "wt", y = "mpg",
color = "cyl", palette = c("palevioletred2", "steelblue2", "olivedrab"),
shape = "cyl",
ellipse = TRUE)
ggscatter(mtdat, x = "wt", y = "mpg",
color = "cyl", palette = c("palevioletred2", "steelblue2", "olivedrab"),
shape = "cyl",
ellipse = TRUE,
mean.point = TRUE,
star.plot = TRUE)
#label: the name of the column containing point labels
#font.label: a list which can contain the combination of the following elements: the size (e.g.: 14), the style (e.g.: “plain”, “bold”, “italic”, “bold.italic”) and the color (e.g.: “red”) of labels
#label.select: character vector specifying some labels to show
#repel = TRUE: avoid label overlapping
ggscatter(mtdat, x = "wt", y = "mpg",
color = "cyl", palette = c("palevioletred2", "steelblue2", "olivedrab"),
label = "type", repel = TRUE)
library(survival)
library(survminer)
survival <- survfit(Surv(time, status) ~ adhere, data = colon)
ggsurvplot(survival, data = colon,
palette = c("steelblue2", "olivedrab"),
pval = TRUE, pval.coord = c(500, 0.4),
risk.table = TRUE)
Further ggpubr and Plotly!!
https://timogrossenbacher.ch/2016/12/beautiful-thematic-maps-with-ggplot2-only/
https://evamaerey.github.io/ggplot_flipbook/ggplot_flipbook_xaringan.html#1
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/
http://varianceexplained.org/RData/code/code_lesson2/
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Animated%20Bubble%20Plot