This vignette illustrates how to use a MAD-FC transform to visualize fold change measurements in boxplots and violin plots. MAD-FC is a new transform for fold changes that distorts linear fold change measurements to match the scale of positive fold change measurements. For more information, please refer to the publication.
B. A. Corliss, Y. Wang, F. P. Driscoll, H. Shakeri, P. E. Bourne, MAD-FC: A Fold Change Visualization with Readability, Proportionality, and Symmetry (2023), doi:10.48550/arXiv.2303.10829.
Dataset Availability: This vignette uses a dataset from the RTCGA and RTCGA.mRNA packages that is included with DESeq2.
Code Reference: This tutorial uses code from the introductory vignette for DESeq2. “Analyzing RNA-seq data with DESeq2” by Michael I. Love, Simon Anders, and Wolfgang Huber.
Primary Packages: tidyverse, cowplot, BiocManager, ggpubr
Supporting Packages: RTCGA, RTCGA.mRNA
The MAD-FC source file is also required (mirrored_axis_distortion.R)
# Install required Base Packages
base_packages <- c("tidyverse", "cowplot","BiocManager","ggpubr")
install.packages(setdiff(base_packages, rownames(installed.packages())))
# Install required Bioconductor Packages
biocm_packages <- c("RTCGA", "RTCGA.mRNA")
bioc_installs <- setdiff(biocm_packages, rownames(installed.packages()))
if (length(bioc_installs)) {BiocManager::install(bioc_installs) }
# Load base packages
lapply(base_packages, library, character.only = TRUE)
# Load Bioconductor packages packages
lapply(biocm_packages, library, character.only = TRUE)
source(paste0(dirname(getwd()), "/R/mirrored_axis_distortion.R"))We processing the data with the following steps:
# Extract expression for 5 genes: GATA3, PTEN, XBP1, ESR1 and MUC1
# From the BRCA.mRNA, OV.mRNA, LUSC.mRNA datasets
expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA,
extract.cols = c("GATA3", "PTEN", "XBP1","ESR1", "MUC1"))
expr$dataset <- gsub(pattern = ".mRNA", replacement = "", expr$dataset)
# Convert dataframe of RNA expresison from wide to long format
df_exp <- expr %>% pivot_longer(cols = !c(bcr_patient_barcode, dataset), names_to = "Gene", values_to = "log2_fc")
df_exp$fc <- 2^df_exp$log2_fc
df_exp$mad_fc <- contract1(fc_to_mfc(df_exp$fc))
head(df_exp)## # A tibble: 6 × 6
## bcr_patient_barcode dataset Gene log2_fc fc mad_fc
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA GATA3 2.87 7.31 6.31
## 2 TCGA-A1-A0SD-01A-11R-A115-07 BRCA PTEN 1.36 2.57 1.57
## 3 TCGA-A1-A0SD-01A-11R-A115-07 BRCA XBP1 2.98 7.91 6.91
## 4 TCGA-A1-A0SD-01A-11R-A115-07 BRCA ESR1 3.08 8.48 7.48
## 5 TCGA-A1-A0SD-01A-11R-A115-07 BRCA MUC1 1.65 3.14 2.14
## 6 TCGA-A1-A0SE-01A-11R-A084-07 BRCA GATA3 2.17 4.49 3.49
We now will visualize the same dataset with a boxplot where fold changes are linear, log2, and MAD-FC transformed. Each type of transform has unique characteristics for how it displays the fold change values.
We first produce a boxplot where fold changes are mapped to a linear scale (raw fold change units, y-axis). Notice that with this plot, negative fold changes are compressed between [0,1) and positive fold changes are between (1,inf). This limits the linear plot’s usefulness in comparing the magnitude of the positive and negative fold change values.
# Fold Change Linear Plot
# Linear FC Plot
g1 <- ggplot(data = df_exp, aes(x = Gene, y = fc, color = dataset)) +
geom_hline(yintercept = 1, size = .3, color = "grey70") +
geom_boxplot(size =1, outlier.size = .3,outlier.alpha = 0.5) +
ylab("FC") + xlab("Gene Name") +
theme_classic(base_size = 16) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g1Next we produce a boxplot of the same data where fold changes are mapped to a log2 scale. Log plots exhibit symmetry between positive and negative fold changes, allowing them to be compared based on their relative position to zero on the log scale (the point of no change). However, the non-linearity of the log transform makes it difficult to compare the magnitude among collections of positive fold change data points and separately, negative fold changes. It is also difficult to recover the original fold change value before the log transform.
# Log FC Plot
g2 <- ggplot(data = df_exp, aes(x = Gene, y = log2_fc, color = dataset)) +
geom_hline(yintercept = 0, size = .3, color = "grey70") +
geom_boxplot(size =1, outlier.size = .3,outlier.alpha = 0.5) +
ylab(expression(log[2]~(FC))) + xlab("Gene Name") +
theme_classic(base_size = 16) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g2Finally we produce a boxplot of the same data where fold changes are mapped to a MAD-FC scale (raw fold change units with negative fold change scale distorted to match the positive fold change scale). MAD plots are symmetrical and linear by design, making it easier to compare datapoints with different fold change direction, datapoints with same direction. The linear mapping makes it easy to recover the original datapoint position from the plot.
To create a MAD plot, first create a linear plot like above, and then call the function. gg_rev_axis_mfc(gg, axes, num_format) Where gg is the ggplot object, axis is the axis to be transformed, and num_format is the format used for the plot tick labels. This function extracts the axis labels from the gg plot object and reverses the MAD-FC transform to display the original FC units.
# MAD FC Plot
g3 <- ggplot(data = df_exp, aes(x = Gene, y = mad_fc, color = dataset)) +
geom_hline(yintercept = 1, size = .3, color = "grey70") +
geom_boxplot(size =1, outlier.size = .3,outlier.alpha = 0.5) +
scale_y_continuous(breaks = seq(-70,70,10),
labels = seq(-70,70,10)) +
ylab("FC") + xlab("Gene Name") +
theme_classic(base_size = 16) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g3 <- gg_revaxis_mfc(g2,'y', num_format = "fraction")
g3We now will visualize the same dataset with a violin plot where fold changes are linear, log2, and MAD-FC transformed. Each type of transform has unique characteristics for how it displays the fold change values.
We first produce a violin plot where fold changes are mapped to a linear scale (raw fold change units, y-axis). Notice that with this plot, negative fold changes are compressed between [0,1) and positive fold changes are between (1,inf). This limits the linear plot’s usefulness in comparing the magnitude of the positive and negative fold change values.
# Linear Plot
g1 <- ggplot(data = df_exp, aes(x = Gene, y = fc, color = dataset)) +
geom_hline(yintercept = 1, color = "grey70") +
geom_violin(scale = "width", size = 1) +
ylab("FC") + xlab("Gene Name") +
theme_classic(base_size = 16) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g1Next we produce a violin plot of the same data where fold changes are mapped to a log2 scale (raw fold change units). Log plots exhibit symmetry with positive and negative fold changes, allowing them to be compared based on their relative position to zero on the log scale (the point of no change). However, the non-linearity of the log transform makes it difficult to compare the magnitude among collections of positive fold change data points and separately, negative fold changes. It is also difficult to recover the original fold change value before the log transform.
# Log Plot
g2 <- ggplot(data = df_exp, aes(x = Gene, y = log2_fc, color = dataset)) +
geom_hline(yintercept = 0, size = .3, color = "grey70") +
geom_violin(scale = "width", size =1) +
ylab(expression(log[2]~(FC))) + xlab("Gene Name") +
theme_classic(base_size = 16) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g2Finally we produce a violin plot of the same data where fold changes are mapped to a MAD-FC scale (raw fold change units with negative fold change scale distorted to match the positive fold change scale). MAD plots are symmetrical and linear by design, making it easier to compare datapoints with different fold change direction, datapoints with same direction. The linear mapping makes it easy to recover the original datapoint position from the plot.
# MAD-FC Plot
g3 <- ggplot(data = df_exp, aes(x = Gene, y = mad_fc, color = dataset)) +
geom_hline(yintercept = 1, color = "grey70") +
geom_violin(scale = "width", size = 1) +
scale_y_continuous(breaks = seq(-60,60,20),
labels = seq(-60,60,20)) +
ylab("FC") + xlab("Gene Name") +
theme_classic(base_size = 8) +
theme(panel.grid.minor = element_blank(),legend.position = "none")
g3 <- gg_revaxis_mfc(g2,'y', num_format = "fraction")
g3To create a MAD plot, first create a linear plot like above, and then
call the function. gg_rev_axis_mfc(gg, axes, num_format) Where
gg is the ggplot object, axis is the axis to be
transformed, and num_format is the format used for the plot
tick labels. This function extracts the axis labels from the gg plot
object and reverses the MAD-FC transform to display the original FC
units. TEST
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RTCGA.mRNA_1.28.0 RTCGA_1.30.0 ggpubr_0.6.0
## [4] BiocManager_1.30.22 cowplot_1.1.1 lubridate_1.9.2
## [7] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
## [10] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [13] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 xfun_0.40 bslib_0.5.1 rstatix_0.7.2
## [5] lattice_0.21-8 tzdb_0.4.0 bitops_1.0-7 vctrs_0.6.3
## [9] tools_4.3.1 generics_0.1.3 fansi_1.0.4 pkgconfig_2.0.3
## [13] Matrix_1.6-1.1 data.table_1.14.8 assertthat_0.2.1 lifecycle_1.0.3
## [17] farver_2.1.1 compiler_4.3.1 munsell_0.5.0 carData_3.0-5
## [21] htmltools_0.5.6.1 sass_0.4.7 RCurl_1.98-1.12 yaml_2.3.7
## [25] pillar_1.9.0 car_3.1-2 jquerylib_0.1.4 cachem_1.0.8
## [29] survminer_0.4.9 viridis_0.6.4 abind_1.4-5 km.ci_0.5-6
## [33] rvest_1.0.3 tidyselect_1.2.0 digest_0.6.33 stringi_1.7.12
## [37] labeling_0.4.3 ggthemes_4.2.4 splines_4.3.1 fastmap_1.1.1
## [41] grid_4.3.1 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
## [45] XML_3.99-0.14 survival_3.5-5 utf8_1.2.3 broom_1.0.5
## [49] withr_2.5.1 scales_1.2.1 backports_1.4.1 timechange_0.2.0
## [53] httr_1.4.7 rmarkdown_2.25 gridExtra_2.3 ggsignif_0.6.4
## [57] zoo_1.8-12 hms_1.1.3 evaluate_0.22 knitr_1.44
## [61] KMsurv_0.1-5 viridisLite_0.4.2 survMisc_0.5.6 rlang_1.1.1
## [65] xtable_1.8-4 glue_1.6.2 xml2_1.3.5 rstudioapi_0.15.0
## [69] jsonlite_1.8.7 R6_2.5.1
Bruce Corliss, 10/16/2023