Overview

In Noori et al. (2020), we performed a systematic review and meta-analysis of human CNS transcriptomic datasets in the public Gene Expression Omnibus (GEO) and ArrayExpress repositories across Alzheimer’s disease (AD), Lewy body diseases (LBD), and the amyotrophic lateral sclerosis and frontotemporal dementia (ALS-FTD) disease spectrum. Here, we provide the source code to generate the data quality reports and perform differential expression analyses for these datasets. Please see our accompanying publication in Data In Brief for additional information.

Setup

Install R to run our code. Requisite packages are loaded.

# base functions for R/Bioconductor
library(Biobase)

# query Gene Expression Omnibus or ArrayExpress
library(GEOquery)
library(ArrayExpress)

# data pre-processing
library(oligo)
library(arrayQualityMetrics)
library(sva)

# differential expression analysis
library(limma)

# probe annotation
library(org.Hs.eg.db)
library(AnnotationDbi)

# data visualization
library(ggplot2)
library(EnhancedVolcano)

Query Repository

Specified transcriptomics datasets are retrieved by querying either the Gene Expression Omnibus or ArrayExpress repositories. The following script performs differential expression analysis on GSE1297, however, this analysis is consistent across all included datasets.

# load series and platform data from GEO
gset = getGEO("GSE1297", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx = grep("GPL96", attr(gset, "names")) else idx = 1
gset = gset[[idx]]

# clean column names
fvarLabels(gset) = make.names(fvarLabels(gset))

# sample groups generated by GEO2R utility
gsms = "1XX111XXX11X0X00000XXXX0XX1X00X"
sml = c()
for (i in 1:nchar(gsms)) { sml[i] = substr(gsms,i,i) }

# eliminate samples marked as "X"
sel = which(sml != "X")
sml = sml[sel]
gset = gset[ ,sel]

RMA Normalization

Missing values are removed.

sumNA = apply(gset, 1, function(x) sum(is.na(x)))
idxNA = which(sumNA > 0)
if(length(idxNA) > 0) {gset = gset[-idxNA, ]}

Robust Multichip Average (RMA) normalization is performed using the oligo package (Carvalho and Irizarry, 2010; Irizarry et al., 2003).

# subset expression matrix from ExpressionSet
ex = exprs(gset)

# detect if normalization is required
qx = as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC = (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)

# perform RMA normalization
if (LogC) {
  ex = apply(ex, 2, as.numeric)
  exRMA = oligo::basicRMA(ex, rownames(gset))
  exprs(gset) = exRMA
}

## Background correcting
## Normalizing
## Calculating Expression

Disease labels are generated and the ExpressionSet is reordered.

# map group label to AD vs. CTRL
dis = sml; dis[dis == 1] = "AD"; dis[dis == 0] = "CTRL"
dis = factor(dis, levels=c("CTRL", "AD"))
rel = order(dis)

# reorder ExpressionSet and label
gset = gset[, rel]; sml = sml[rel]; dis = dis[rel]
pData(gset)$Condition = dis

Quality Control

The arrayQualityMetrics package is used to generate data quality reports. Outliers are detected via boxplots, MA plots, and inter-array distance comparison (Kauffmann et al., 2009; Kauffmann and Huber, 2010). The full data quality report for GSE1297 is available here.

# generate quality control report
qc = arrayQualityMetrics(gset, outdir = "QC", reporttitle ="GSE1297 QC - CA1, GPL96", force = TRUE)

# identify outliers
array = qc$arrayTable[, 3:5]
keep = apply(array == "", 1, all)
remove = rownames(pData(gset))[!keep]

Following the above normalization, the expression matrix is visualized in a boxplot which indicates aberrant samples (Wickham, 2016).

# parse data for boxplot
ex = exprs(gset)
samples = rep(rownames(pData(gset)), each = nrow(gset))
samples = factor(samples, levels = unique(samples))
pheno = rep(as.character(dis), each = nrow(gset))
bp = data.frame(samples, c(ex), pheno)
bp[["outlier"]] = bp$samples %in% remove

# create boxplot
box = ggplot(bp, aes(x = samples, y = ex, fill = pheno, linetype = outlier)) +
  geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) + 
  scale_fill_manual(values =  c("#FE7B72", "#84B2D7")) +
  scale_linetype_manual(values = c("solid", "dashed")) + 
  labs(title = "GSE1297 Box Plot - CA1, GPL96",
       x = "Samples", y = "Probe Intensity", fill = "Condition", linetype = "Outlier?") +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face="bold"),
        axis.title.x = element_text(size=12, face="bold"),
        axis.title.y = element_text(size=12, face="bold"),
        legend.title = element_text(size=10, face="bold"))

Samples which fail to pass one or more of the three outlier detection steps in arrayQualityMetrics are omitted.

gset = gset[, keep]
sml = sml[keep]
dis = dis[keep]

Probes with low expression are capped at the 20th percentile. This threshold is represented on the final boxplot.

# cap probes with low-expression
filter = exprs(gset) # with outliers removed
thresh = quantile(c(filter), 0.20)
filter[which(filter < thresh)] = thresh
exprs(gset) = filter

# boxplot with low-expression threshold
box + geom_hline(yintercept=thresh, linetype="dashed", color = "red", size = 0.5)

Surrogate Variable Analysis

Model matrices are created.

# set group names
sml = paste("G", sml, sep="")
fl = as.factor(sml)
gset$description = fl

# create full model matrix
mod = model.matrix(~ description, gset)

# create null model matrix
mod0 = model.matrix(~1, gset)

The number of significant surrogate variables (\(\le\) 4) is estimated using the asymptotic approach proposed by Leek (2011), then SVA is performed (Leek et al., 2012; Leek and Storey, 2007).

# estimate # of surrogate variables
n.sv = num.sv(exprs(gset), mod, method="leek")
if(n.sv > 4) { n.sv = 4 }

# perform SVA
svobj = sva(exprs(gset), mod, mod0, n.sv=n.sv)

## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

modSv = cbind(mod, svobj$sv)

Differential Expression Analysis

Differential expression analysis is performed using the limma package (Phipson et al., 2016; Ritchie et al., 2015).

# generate linear model fit
fit = lmFit(gset, modSv)

# compute contrasts
cont.matrix = cbind("C1"=c(0,1,rep(0,svobj$n.sv)))
fit2 = contrasts.fit(fit, cont.matrix)

# empirical Bayes statistics for DE
fit2 = eBayes(fit2, 0.01)

The ranked table of all differentially-expressed genes (DEGs) with confidence intervals included is created. Then, the interquartile range (IQR) is calculated from the raw expression matrix.

# get DE genes
tT = topTable(fit2, adjust="fdr", sort.by="B", number=nrow(fit2), confint=TRUE)

# calculate IQR
ex = exprs(gset)[rownames(tT), ]
IQR = apply(ex, 1, function(x) IQR(x, na.rm = TRUE))
tT$IQR = IQR

Probe Annotation

The annotation package specified below corresponds to Affymetrix Human Genome U133A Array (i.e., GPL96) for GSE1297. Annotation packages for other microarrays are available from R/Bioconductor. Probes wihtout Entrez ID mapping are removed (Maglott et al., 2007).

# load annotation package
library(hgu133a.db)

# perform mapping and remove duplicate probes
annot = AnnotationDbi::select(hgu133a.db, tT$ID, c("SYMBOL", "ENTREZID", "GENENAME"))
annot = annot[!duplicated(annot$PROBEID), ]

# join DE genes with annotation
dat = cbind(annot[, c("PROBEID", "ENTREZID", "SYMBOL")], tT[, c("P.Value", "logFC", "CI.L", "CI.R", "AveExpr", "IQR")])
colnames(dat) = c("Probe", "Entrez", "Symbol", "pVal", "logFC", "CI.L", "CI.R", "AveExpr", "IQR")

# remove probes without Entrez ID mapping
dat = dat[!is.na(dat$Entrez), ]

In the event that multiple probes map to the same gene, the single probe with the greatest interquartile range (IQR) is retained (Walsh et al., 2015; Wang et al., 2012).

# group by ENTREZ ID, then order by IQR (greatest to least)
dat = dat[order(dat$Entrez, -dat$IQR), ]

# retain probe with the greatest IQR only
dat = dat[!duplicated(dat$Entrez), ]

For each gene, the z-score is computed as \(\frac{x - \mu}{\sigma}\). Genes are subsequently ordered by absolute z-score.

# compute z-scores
SE = (dat$CI.R - dat$CI.L)/(1.96*2)
dat$Z = dat$logFC/SE

# order by absolute z-score
dat = dat[order(-abs(dat$Z)), ]

The final table of DEGs is excerpted below.

Probe	Entrez	Symbol	pVal	logFC	CI.L	CI.R	AveExpr	IQR	Z
220528_at	55350	VNN3	0.0000054	0.6364431	0.4762711	0.7966151	5.316762	0.2853320	7.788054
217901_at	1829	DSG2	0.0000121	1.2591064	0.9113181	1.6068947	5.660072	1.3211326	7.095836
221774_x_at	55578	SUPT20H	0.0000138	-0.7651777	-0.9796790	-0.5506764	9.226198	0.8668349	-6.991790
214607_at	5063	PAK3	0.0000216	-0.7971482	-1.0326670	-0.5616293	11.381266	0.3430353	-6.633909
204169_at	3614	IMPDH1	0.0000349	-0.4899134	-0.6431766	-0.3366502	9.583001	0.3443415	-6.265237
220166_at	26507	CNNM1	0.0000437	-1.0752016	-1.4208177	-0.7295856	8.875539	0.8941334	-6.097504
220508_at	150160	CCT8L2	0.0000440	0.9614237	0.6521368	1.2707106	5.749132	0.9516106	6.092694
203545_at	79053	ALG8	0.0000593	-0.3769096	-0.5026366	-0.2511825	9.330563	0.3124001	-5.875766
216340_s_at	1550	CYP2A7P1	0.0000602	-1.2374568	-1.6510244	-0.8238891	6.683584	0.6043895	-5.864616
209897_s_at	9353	SLIT2	0.0000650	-1.2942462	-1.7308222	-0.8576702	9.105332	0.6278201	-5.810495
209443_at	5104	SERPINA5	0.0000803	2.2466704	1.4688434	3.0244974	6.084860	1.9962309	5.661251
218196_at	28962	OSTM1	0.0000981	0.6158905	0.3972679	0.8345132	9.068102	0.7831332	5.521594
203831_at	22864	R3HDM2	0.0001170	-0.6973039	-0.9503296	-0.4442782	11.158465	0.3451135	-5.401490
213666_at	23157	SEPTIN6	0.0001178	-0.8683941	-1.1837738	-0.5530143	9.431468	0.7494847	-5.396835
208031_s_at	5990	RFX2	0.0001323	0.6163767	0.3892203	0.8435331	5.254820	0.0000000	5.318355

Top DEGs are visualized via a volcano plot (Blighe, 2019).

EnhancedVolcano(dat,
                lab = dat$Symbol,
                x = "logFC",
                y = "pVal",
                ylim = c(0, max(-log10(dat[, "pVal"]), na.rm=TRUE)),
                title = "GSE1297 Volcano Plot - CA1, GPL96",
                subtitle = "p-value Threshold = 0.05, FC Threshold = 2",
                caption = NULL,
                pCutoff = 0.05,
                FCcutoff = log2(2))

Citation

Research Paper: Noori, A.,^1,2,3,4, Mezlini, A.M.,^2,3,4,5, Hyman, B.T.,^2,4,5, Serrano-Pozo, A.,^2,4,5*, Das, S.,^2,3,4,5. 2020. Systematic review and meta-analysis of human transcriptomics reveals neuroinflammation, deficient energy metabolism, and proteostasis failure across neurodegeneration. Neurobiology of Disease 149, 105225. https://doi.org/10.1016/j.nbd.2020.105225

Methods Paper: Noori, A.,^1,2,3,4, Mezlini, A.M.,^2,3,4,5, Hyman, B.T.,^2,4,5, Serrano-Pozo, A.,^2,4,5*, Das, S.,^2,3,4,5. 2021. Differential gene expression data from human central nervous system across Alzheimer’s disease, Lewy body diseases, and amyotrophic lateral sclerosis and frontotemporal dementia spectrum. In press.

Affiliations

Harvard College, Cambridge, MA 02138, United States of America
Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States of America
MIND Data Science Lab, Cambridge, MA 02139, United States of America
MassGeneral Institute for Neurodegenerative Disease, Charlestown, MA 02129, United States of America
Harvard Medical School, Boston, MA 02115, United States of America

* Co-Corresponding Authors

References

Blighe, K., 2019. EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.

Carvalho, B.S., Irizarry, R.A., 2010. A framework for oligonucleotide microarray preprocessing. Bioinformatics 26, 2363–2367. https://doi.org/10.1093/bioinformatics/btq431

Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P., 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 4, 249–264. https://doi.org/10.1093/biostatistics/4.2.249

Kauffmann, A., Gentleman, R., Huber, W., 2009. arrayQualityMetrics - a Bioconductor Package for Quality Assessment of Microarray Data. Bioinformatics 25, 415–416. https://doi.org/10.1093/bioinformatics/btn647

Kauffmann, A., Huber, W., 2010. Microarray data quality control improves the detection of differentially expressed genes. Genomics 95, 138–142. https://doi.org/10.1016/j.ygeno.2010.01.003

Leek, J.T., 2011. Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67, 344–352. https://doi.org/10.1111/j.1541-0420.2010.01455.x

Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D., 2012. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883. https://doi.org/10.1093/bioinformatics/bts034

Leek, J.T., Storey, J.D., 2007. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLOS Genetics 3, e161. https://doi.org/10.1371/journal.pgen.0030161

Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T., 2007. Entrez Gene: Gene-centered information at NCBI. Nucleic Acids Research 35, D26–D31. https://doi.org/10.1093/nar/gkl993

Phipson, B., Lee, S., Majewski, I.J., Alexander, W.S., Smyth, G.K., 2016. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. The annals of applied statistics 10, 946–963. https://doi.org/10.1214/16-AOAS920

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K., 2015. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. https://doi.org/10.1093/nar/gkv007

Walsh, C.J., Hu, P., Batt, J., Santos, C.C.D., 2015. Microarray Meta-Analysis and Cross-Platform Normalization: Integrative Genomics for Robust Biomarker Discovery. Microarrays (Basel, Switzerland) 4, 389–406. https://doi.org/10.3390/microarrays4030389

Wang, X., Lin, Y., Song, C., Sibille, E., Tseng, G.C., 2012. Detecting disease-associated genes with confounding variable adjustment and the impact on genomic meta-analysis: With application to major depressive disorder. BMC Bioinformatics 13, 52. https://doi.org/10.1186/1471-2105-13-52

Wickham, H., 2016. Ggplot2: Elegant Graphics for Data Analysis, 2nd ed. 2016. ed, Use R! Springer International Publishing : Imprint: Springer, Cham. https://doi.org/10.1007/978-3-319-24277-4

Differential Gene Expression Data from the Human Central Nervous System across Alzheimer’s Disease, Lewy Body Diseases, and the Amyotrophic Lateral Sclerosis and Frontotemporal Dementia Spectrum

Ayush Noori, Aziz M. Mezlini, Bradley T. Hyman, Alberto Serrano-Pozo, Sudeshna Das