Overview
In Noori et al. (2020), we performed a systematic review and meta-analysis of human CNS transcriptomic datasets in the public Gene Expression Omnibus (GEO) and ArrayExpress repositories across Alzheimer’s disease (AD), Lewy body diseases (LBD), and the amyotrophic lateral sclerosis and frontotemporal dementia (ALS-FTD) disease spectrum. Here, we provide the source code to generate the data quality reports and perform differential expression analyses for these datasets. Please see our accompanying publication in Data In Brief for additional information.
Setup
Install R to run our code. Requisite packages are loaded.
# base functions for R/Bioconductor
library(Biobase)
# query Gene Expression Omnibus or ArrayExpress
library(GEOquery)
library(ArrayExpress)
# data pre-processing
library(oligo)
library(arrayQualityMetrics)
library(sva)
# differential expression analysis
library(limma)
# probe annotation
library(org.Hs.eg.db)
library(AnnotationDbi)
# data visualization
library(ggplot2)
library(EnhancedVolcano)
Query Repository
Specified transcriptomics datasets are retrieved by querying either the Gene Expression Omnibus or ArrayExpress repositories. The following script performs differential expression analysis on GSE1297, however, this analysis is consistent across all included datasets.
# load series and platform data from GEO
getGEO("GSE1297", GSEMatrix =TRUE, AnnotGPL=TRUE)
gset =if (length(gset) > 1) idx = grep("GPL96", attr(gset, "names")) else idx = 1
gset[[idx]]
gset =
# clean column names
fvarLabels(gset) = make.names(fvarLabels(gset))
# sample groups generated by GEO2R utility
"1XX111XXX11X0X00000XXXX0XX1X00X"
gsms = c()
sml =for (i in 1:nchar(gsms)) { sml[i] = substr(gsms,i,i) }
# eliminate samples marked as "X"
which(sml != "X")
sel = sml[sel]
sml = gset[ ,sel] gset =
RMA Normalization
Missing values are removed.
apply(gset, 1, function(x) sum(is.na(x)))
sumNA = which(sumNA > 0)
idxNA =if(length(idxNA) > 0) {gset = gset[-idxNA, ]}
Robust Multichip Average (RMA) normalization is performed using the oligo
package (Carvalho and Irizarry, 2010; Irizarry et al., 2003).
# subset expression matrix from ExpressionSet
exprs(gset)
ex =
# detect if normalization is required
as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
qx = (qx[5] > 100) ||
LogC = (qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
# perform RMA normalization
if (LogC) {
apply(ex, 2, as.numeric)
ex = oligo::basicRMA(ex, rownames(gset))
exRMA =exprs(gset) = exRMA
}
## Background correcting
## Normalizing
## Calculating Expression
Disease labels are generated and the ExpressionSet
is reordered.
# map group label to AD vs. CTRL
sml; dis[dis == 1] = "AD"; dis[dis == 0] = "CTRL"
dis = factor(dis, levels=c("CTRL", "AD"))
dis = order(dis)
rel =
# reorder ExpressionSet and label
gset[, rel]; sml = sml[rel]; dis = dis[rel]
gset =pData(gset)$Condition = dis
Quality Control
The arrayQualityMetrics
package is used to generate data quality reports. Outliers are detected via boxplots, MA plots, and inter-array distance comparison (Kauffmann et al., 2009; Kauffmann and Huber, 2010). The full data quality report for GSE1297 is available here.
# generate quality control report
arrayQualityMetrics(gset, outdir = "QC", reporttitle ="GSE1297 QC - CA1, GPL96", force = TRUE)
qc =
# identify outliers
qc$arrayTable[, 3:5]
array = apply(array == "", 1, all)
keep = rownames(pData(gset))[!keep] remove =
Following the above normalization, the expression matrix is visualized in a boxplot which indicates aberrant samples (Wickham, 2016).
# parse data for boxplot
exprs(gset)
ex = rep(rownames(pData(gset)), each = nrow(gset))
samples = factor(samples, levels = unique(samples))
samples = rep(as.character(dis), each = nrow(gset))
pheno = data.frame(samples, c(ex), pheno)
bp ="outlier"]] = bp$samples %in% remove
bp[[
# create boxplot
ggplot(bp, aes(x = samples, y = ex, fill = pheno, linetype = outlier)) +
box = geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
scale_fill_manual(values = c("#FE7B72", "#84B2D7")) +
scale_linetype_manual(values = c("solid", "dashed")) +
labs(title = "GSE1297 Box Plot - CA1, GPL96",
x = "Samples", y = "Probe Intensity", fill = "Condition", linetype = "Outlier?") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face="bold"),
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold"),
legend.title = element_text(size=10, face="bold"))
Samples which fail to pass one or more of the three outlier detection steps in arrayQualityMetrics
are omitted.
gset[, keep]
gset = sml[keep]
sml = dis[keep] dis =
Probes with low expression are capped at the 20th percentile. This threshold is represented on the final boxplot.
# cap probes with low-expression
exprs(gset) # with outliers removed
filter = quantile(c(filter), 0.20)
thresh =which(filter < thresh)] = thresh
filter[exprs(gset) = filter
# boxplot with low-expression threshold
+ geom_hline(yintercept=thresh, linetype="dashed", color = "red", size = 0.5) box
Surrogate Variable Analysis
Model matrices are created.
# set group names
paste("G", sml, sep="")
sml = as.factor(sml)
fl =$description = fl
gset
# create full model matrix
model.matrix(~ description, gset)
mod =
# create null model matrix
model.matrix(~1, gset) mod0 =
The number of significant surrogate variables (\(\le\) 4) is estimated using the asymptotic approach proposed by Leek (2011), then SVA is performed (Leek et al., 2012; Leek and Storey, 2007).
# estimate # of surrogate variables
num.sv(exprs(gset), mod, method="leek")
n.sv =if(n.sv > 4) { n.sv = 4 }
# perform SVA
sva(exprs(gset), mod, mod0, n.sv=n.sv) svobj =
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
cbind(mod, svobj$sv) modSv =
Differential Expression Analysis
Differential expression analysis is performed using the limma
package (Phipson et al., 2016; Ritchie et al., 2015).
# generate linear model fit
lmFit(gset, modSv)
fit =
# compute contrasts
cbind("C1"=c(0,1,rep(0,svobj$n.sv)))
cont.matrix = contrasts.fit(fit, cont.matrix)
fit2 =
# empirical Bayes statistics for DE
eBayes(fit2, 0.01) fit2 =
The ranked table of all differentially-expressed genes (DEGs) with confidence intervals included is created. Then, the interquartile range (IQR) is calculated from the raw expression matrix.
# get DE genes
topTable(fit2, adjust="fdr", sort.by="B", number=nrow(fit2), confint=TRUE)
tT =
# calculate IQR
exprs(gset)[rownames(tT), ]
ex = apply(ex, 1, function(x) IQR(x, na.rm = TRUE))
IQR =$IQR = IQR tT
Probe Annotation
The annotation package specified below corresponds to Affymetrix Human Genome U133A Array (i.e., GPL96) for GSE1297. Annotation packages for other microarrays are available from R/Bioconductor. Probes wihtout Entrez ID mapping are removed (Maglott et al., 2007).
# load annotation package
library(hgu133a.db)
# perform mapping and remove duplicate probes
AnnotationDbi::select(hgu133a.db, tT$ID, c("SYMBOL", "ENTREZID", "GENENAME"))
annot = annot[!duplicated(annot$PROBEID), ]
annot =
# join DE genes with annotation
cbind(annot[, c("PROBEID", "ENTREZID", "SYMBOL")], tT[, c("P.Value", "logFC", "CI.L", "CI.R", "AveExpr", "IQR")])
dat =colnames(dat) = c("Probe", "Entrez", "Symbol", "pVal", "logFC", "CI.L", "CI.R", "AveExpr", "IQR")
# remove probes without Entrez ID mapping
dat[!is.na(dat$Entrez), ] dat =
In the event that multiple probes map to the same gene, the single probe with the greatest interquartile range (IQR) is retained (Walsh et al., 2015; Wang et al., 2012).
# group by ENTREZ ID, then order by IQR (greatest to least)
dat[order(dat$Entrez, -dat$IQR), ]
dat =
# retain probe with the greatest IQR only
dat[!duplicated(dat$Entrez), ] dat =
For each gene, the z-score is computed as \(\frac{x - \mu}{\sigma}\). Genes are subsequently ordered by absolute z-score.
# compute z-scores
(dat$CI.R - dat$CI.L)/(1.96*2)
SE =$Z = dat$logFC/SE
dat
# order by absolute z-score
dat[order(-abs(dat$Z)), ] dat =
The final table of DEGs is excerpted below.
Probe | Entrez | Symbol | pVal | logFC | CI.L | CI.R | AveExpr | IQR | Z |
---|---|---|---|---|---|---|---|---|---|
220528_at | 55350 | VNN3 | 0.0000054 | 0.6364431 | 0.4762711 | 0.7966151 | 5.316762 | 0.2853320 | 7.788054 |
217901_at | 1829 | DSG2 | 0.0000121 | 1.2591064 | 0.9113181 | 1.6068947 | 5.660072 | 1.3211326 | 7.095836 |
221774_x_at | 55578 | SUPT20H | 0.0000138 | -0.7651777 | -0.9796790 | -0.5506764 | 9.226198 | 0.8668349 | -6.991790 |
214607_at | 5063 | PAK3 | 0.0000216 | -0.7971482 | -1.0326670 | -0.5616293 | 11.381266 | 0.3430353 | -6.633909 |
204169_at | 3614 | IMPDH1 | 0.0000349 | -0.4899134 | -0.6431766 | -0.3366502 | 9.583001 | 0.3443415 | -6.265237 |
220166_at | 26507 | CNNM1 | 0.0000437 | -1.0752016 | -1.4208177 | -0.7295856 | 8.875539 | 0.8941334 | -6.097504 |
220508_at | 150160 | CCT8L2 | 0.0000440 | 0.9614237 | 0.6521368 | 1.2707106 | 5.749132 | 0.9516106 | 6.092694 |
203545_at | 79053 | ALG8 | 0.0000593 | -0.3769096 | -0.5026366 | -0.2511825 | 9.330563 | 0.3124001 | -5.875766 |
216340_s_at | 1550 | CYP2A7P1 | 0.0000602 | -1.2374568 | -1.6510244 | -0.8238891 | 6.683584 | 0.6043895 | -5.864616 |
209897_s_at | 9353 | SLIT2 | 0.0000650 | -1.2942462 | -1.7308222 | -0.8576702 | 9.105332 | 0.6278201 | -5.810495 |
209443_at | 5104 | SERPINA5 | 0.0000803 | 2.2466704 | 1.4688434 | 3.0244974 | 6.084860 | 1.9962309 | 5.661251 |
218196_at | 28962 | OSTM1 | 0.0000981 | 0.6158905 | 0.3972679 | 0.8345132 | 9.068102 | 0.7831332 | 5.521594 |
203831_at | 22864 | R3HDM2 | 0.0001170 | -0.6973039 | -0.9503296 | -0.4442782 | 11.158465 | 0.3451135 | -5.401490 |
213666_at | 23157 | SEPTIN6 | 0.0001178 | -0.8683941 | -1.1837738 | -0.5530143 | 9.431468 | 0.7494847 | -5.396835 |
208031_s_at | 5990 | RFX2 | 0.0001323 | 0.6163767 | 0.3892203 | 0.8435331 | 5.254820 | 0.0000000 | 5.318355 |
Top DEGs are visualized via a volcano plot (Blighe, 2019).
EnhancedVolcano(dat,
lab = dat$Symbol,
x = "logFC",
y = "pVal",
ylim = c(0, max(-log10(dat[, "pVal"]), na.rm=TRUE)),
title = "GSE1297 Volcano Plot - CA1, GPL96",
subtitle = "p-value Threshold = 0.05, FC Threshold = 2",
caption = NULL,
pCutoff = 0.05,
FCcutoff = log2(2))
Citation
Research Paper: Noori, A.,1,2,3,4, Mezlini, A.M.,2,3,4,5, Hyman, B.T.,2,4,5, Serrano-Pozo, A.,2,4,5*, Das, S.,2,3,4,5. 2020. Systematic review and meta-analysis of human transcriptomics reveals neuroinflammation, deficient energy metabolism, and proteostasis failure across neurodegeneration. Neurobiology of Disease 149, 105225. https://doi.org/10.1016/j.nbd.2020.105225
Methods Paper: Noori, A.,1,2,3,4, Mezlini, A.M.,2,3,4,5, Hyman, B.T.,2,4,5, Serrano-Pozo, A.,2,4,5*, Das, S.,2,3,4,5. 2021. Differential gene expression data from human central nervous system across Alzheimer’s disease, Lewy body diseases, and amyotrophic lateral sclerosis and frontotemporal dementia spectrum. In press.
Affiliations
- Harvard College, Cambridge, MA 02138, United States of America
- Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States of America
- MIND Data Science Lab, Cambridge, MA 02139, United States of America
- MassGeneral Institute for Neurodegenerative Disease, Charlestown, MA 02129, United States of America
- Harvard Medical School, Boston, MA 02115, United States of America
* Co-Corresponding Authors
References
Blighe, K., 2019. EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.
Carvalho, B.S., Irizarry, R.A., 2010. A framework for oligonucleotide microarray preprocessing. Bioinformatics 26, 2363–2367. https://doi.org/10.1093/bioinformatics/btq431
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P., 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 4, 249–264. https://doi.org/10.1093/biostatistics/4.2.249
Kauffmann, A., Gentleman, R., Huber, W., 2009. arrayQualityMetrics - a Bioconductor Package for Quality Assessment of Microarray Data. Bioinformatics 25, 415–416. https://doi.org/10.1093/bioinformatics/btn647
Kauffmann, A., Huber, W., 2010. Microarray data quality control improves the detection of differentially expressed genes. Genomics 95, 138–142. https://doi.org/10.1016/j.ygeno.2010.01.003
Leek, J.T., 2011. Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67, 344–352. https://doi.org/10.1111/j.1541-0420.2010.01455.x
Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D., 2012. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883. https://doi.org/10.1093/bioinformatics/bts034
Leek, J.T., Storey, J.D., 2007. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLOS Genetics 3, e161. https://doi.org/10.1371/journal.pgen.0030161
Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T., 2007. Entrez Gene: Gene-centered information at NCBI. Nucleic Acids Research 35, D26–D31. https://doi.org/10.1093/nar/gkl993
Phipson, B., Lee, S., Majewski, I.J., Alexander, W.S., Smyth, G.K., 2016. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. The annals of applied statistics 10, 946–963. https://doi.org/10.1214/16-AOAS920
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K., 2015. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. https://doi.org/10.1093/nar/gkv007
Walsh, C.J., Hu, P., Batt, J., Santos, C.C.D., 2015. Microarray Meta-Analysis and Cross-Platform Normalization: Integrative Genomics for Robust Biomarker Discovery. Microarrays (Basel, Switzerland) 4, 389–406. https://doi.org/10.3390/microarrays4030389
Wang, X., Lin, Y., Song, C., Sibille, E., Tseng, G.C., 2012. Detecting disease-associated genes with confounding variable adjustment and the impact on genomic meta-analysis: With application to major depressive disorder. BMC Bioinformatics 13, 52. https://doi.org/10.1186/1471-2105-13-52
Wickham, H., 2016. Ggplot2: Elegant Graphics for Data Analysis, 2nd ed. 2016. ed, Use R! Springer International Publishing : Imprint: Springer, Cham. https://doi.org/10.1007/978-3-319-24277-4