Novel Insight Into the Etiology of Autism Spectrum Disorder Gained by Integrating Expression Data With Genome-wide Association Statistics

Background A recent genome-wide association study (GWAS) of autism spectrum disorder (ASD) (ncases = 18,381, ncontrols = 27,969) has provided novel opportunities for investigating the etiology of ASD. Here, we integrate the ASD GWAS summary statistics with summary-level gene expression data to infer differential gene expression in ASD, an approach called transcriptome-wide association study (TWAS). Methods Using FUSION software, ASD GWAS summary statistics were integrated with predictors of gene expression from 16 human datasets, including adult and fetal brains. A novel adaptation of established statistical methods was then used to test for enrichment within candidate pathways and specific tissues and at different stages of brain development. The proportion of ASD heritability explained by predicted expression of genes in the TWAS was estimated using stratified linkage disequilibrium score regression. Results This study identified 14 genes as significantly differentially expressed in ASD, 13 of which were outside of known genome-wide significant loci (±500 kb). XRN2, a gene proximal to an ASD GWAS locus, was inferred to be significantly upregulated in ASD, providing insight into the functional consequence of this associated locus. One novel transcriptome-wide significant association from this study is the downregulation of PDIA6, which showed minimal evidence of association in the GWAS, and in gene-based analysis using MAGMA. Predicted gene expression in this study accounted for 13.0% of the total ASD single nucleotide polymorphism heritability. Conclusions This study has implicated several genes as significantly up/downregulated in ASD, providing novel and useful information for subsequent functional studies. This study also explores the utility of TWAS-based enrichment analysis and compares TWAS results with a functionally agnostic approach.

are available in the O'Brien fetal brain dataset, and features representing splice events are available in the CMC dorsolateral prefrontal cortex dataset.
From the processed genotype and gene expression data, TWAS predictors were computed using FUSION software (see URLs). In brief, the FUSION software first estimates the cis-SNP-heritability of each feature based on SNPs +/-500kb from the feature boundary using the AI-REML algorithm in GCTA (10). For features that have nominally significant cis-SNP-heritability (p < 0.01), predictive models are generated using BLUP, elastic net, LASSO and BSLMM (CMC, YFS, NTR, METSIM only) models. Five-fold cross validation is then used to evaluate the out-of-sample variance explained by each model with the best model being used in the TWAS.

Defining transcriptome-wide significance
Many genes are available in multiple SNP-weight sets and have highly correlated predicted expression.
Furthermore, genes near one another often have correlated expression and are therefore not independent. Therefore, a Bonferroni significance threshold for all features (gene/tissue combinations) would be highly conservative. We estimated transcriptome-wide significance using a permutation procedure, which accounts for the correlation between features, within and across SNP-weight sets.
Initially, expression levels for all features from all SNP-weight sets (N features = 38,157) were imputed into the 1000 genomes reference dataset used by FUSION (N individuals = 489) using the FUSION protocol which involves PLINK (11) (see URLs). Then, for each permutation, a random normally distributed phenotype was generated, linear regression was performed to derive a p-value of association for each feature, and the minimum p-value was stored. This procedure was repeated 1,000 times. The 5% quantile of the minimum p-values is the transcriptome-wide significance threshold with the features used in this study. Based on these permutations the transcriptome-wide significance threshold was estimated at p = 4.25×10 -6 (95% CI = 2.86×10 -6 -5.53×10 -6 ). This approach for estimating transcriptome-wide significance is an adaptation of the permutation procedure used, in part, to estimate the genome-wide significance threshold (12).

Calculating proportion of SNP association explained by predicted expression
The proportion of a SNP-level association accounted for by predicted expression in the TWAS was calculated as 1-(χ2 of conditioned GWAS association) / (χ2 of unconditioned GWAS association). This is the same method used by TWAS-hub to calculate '% variance explained' (http://twas-hub.org).

Similarities and differences between TWAS and MAGMA
MAGMA and TWAS both aggregate SNP associations within gene regions. However, a key difference is that MAGMA aggregates the association of SNPs within gene regions without taking into account of SNP effects on gene expression. MAGMA is therefore considered functionally agnostic. In contrast, TWAS aggregates the association of SNPs within gene regions weighted by their effect on gene expression, so comparison of TWAS to MAGMA highlights the effect of considering SNP-effects on gene expression. Another important difference is that MAGMA includes any gene region for which SNPs are available in the GWAS and linkage disequilibrium (LD) reference, whereas TWAS only includes genes with significantly heritable gene expression (based on SNPs within a 500kb window).

TWAS-based enrichment analysis
Analytical procedure Competitive enrichment was tested for by performing a linear regression for each gene-set, whereby gene Z-scores were predicted by membership of each gene-set, including covariates for gene length and the number SNPs within the gene region. Given the functional consequence of each genes up-or downregulation is unknown, Z-scores were calculated as probit transformed (1-p), resulting in an approximately normally distributed Z-score of non-zero association. To avoid potential bias due to outliers, Z-scores were truncated to be between -3 and 6. This regression approach for enrichment analysis can also be used to test for a correlation between TWAS associations and continuous gene annotations, termed gene property analysis, as is also implemented in MAGMA.
To avoid bias due to the correlation between genes we use lme4qtl (13) to fit a mixed model regression of TWAS Z-score on gene-set membership, accounting for the correlation in Z-scores between genes due to LD. The correlation matrix used was computed based on the same predicted gene expression values used when estimating the transcriptome wide-significance threshold. The correlation between genes that were more than 5Mb apart or on separate chromosomes were set to zero. Any gene-gene correlations with an R-squared less than 0.0001 were set to zero. The matrix was stored as a sparse matrix substantially reducing the memory requirements and duration of the analysis. The linear mixed model with the sparse matrix was performed using the lme4qtl package in R. The software used for this analysis (TWAS-GSEA) is publically available (see URLs).
We analysed TWAS association results from all 16 SNP-weight sets simultaneously to improve genome coverage and reduce the multiple testing burden. If multiple features represent the same gene, such as when a gene is captured in multiple SNP-weight sets, only the feature that gave the best prediction of expression (as measured by cross-validated R2) was retained.

Interpretation of S-LDSC TWAS-based heritability estimates
Estimates of variance explained by each SNP-weight set should not be interpreted as a measure of enrichment as they are highly correlated with the sample size of the dataset used to derive the SNPweights. SNP-weights are only available for a gene if the gene's expression has a statistically significant SNP-heritability. Larger samples often have greater power to detect significantly SNP-heritable expression, and therefore SNP-weight sets derived from larger samples typically include SNP-weights for more genes, which leads to an increased variance explained by the SNP-weight set.

Supplementary Figures
Supplementary Figure S1. Manhattan plot of all ASD TWAS associations. Each point represents a single gene tested, with physical position plotted on the x-axis and Z score of association between the gene and ASD plotted on the y-axis. Transcriptome-wide significant associations are highlighted as red points and are labelled with their ID. If more than one transcriptome-wide significant feature represents the same gene, only the most significant feature is highlighted in red and labelled. The blue horizontal line indicates transcriptome-wide significance (p < 4.25×10 -6 ). 7 Supplementary Figure S2. Regional association plot. The top panel shows all of the genes in the locus. Marginally TWAS associated genes are highlighted in green, jointly significant genes are highlighted in blue, non-significant genes are in red, and genes that were not assessed in the TWAS are in grey. The bottom panel shows a Manhattan plot of the GWAS data before (grey) and after (blue) conditioning on the green genes. Supplementary Figure S9. Comparison of gene-level Z scores derived using MAGMA and TWAS, and SNP-level Z score in the corresponding GWAS, containing either a significant TWAS or MAGMA association. The absolute TWAS Z score is used here. Genes within 250kb of genes significant in the either the TWAS of MAGMA analysis are included. Empty cells indicate the gene was not available in the TWAS or MAGMA analysis. Asterisks indicate the Z score surpassed the corresponding significance threshold (TWAS = p < 4.25x10 -6 ; MAGMA = Bonferroni p-value </= 0.05, GWAS = p <=5x10 -8 ).
Supplementary Figure S10. Comparison of gene-level Z scores derived using MAGMA and TWAS, and SNP-level Z score in the corresponding GWAS, containing either a significant TWAS or MAGMA association. The absolute TWAS Z score is used here. Genes within 250kb of genes significant in the either the TWAS of MAGMA analysis are included. Empty cells indicate the gene was not available in the TWAS or MAGMA analysis. Asterisks indicate the Z score surpassed the corresponding significance threshold (TWAS = p < 4.25x10 -6 ; MAGMA = Bonferroni p-value </= 0.05, GWAS = p <=5x10 -8 ).
Supplementary Figure S11. Comparison of gene-level Z scores derived using MAGMA and TWAS, and SNP-level Z score in the corresponding GWAS, containing either a significant TWAS or MAGMA association. The absolute TWAS Z score is used here. Genes within 250kb of genes significant in the either the TWAS of MAGMA analysis are included. Empty cells indicate the gene was not available in the TWAS or MAGMA analysis. Asterisks indicate the Z score surpassed the corresponding significance threshold (TWAS = p < 4.25x10 -6 ; MAGMA = Bonferroni p-value </= 0.05, GWAS = p <=5x10 -8 ).
Supplementary Figure S12. Comparison of gene-level Z scores derived using MAGMA and TWAS, and SNP-level Z score in the corresponding GWAS, containing either a significant TWAS or MAGMA association. The absolute TWAS Z score is used here. Genes within 250kb of genes significant in the either the TWAS of MAGMA analysis are included. Empty cells indicate the gene was not available in the TWAS or MAGMA analysis. Asterisks indicate the Z score surpassed the corresponding significance threshold (TWAS = p < 4.25x10 -6 ; MAGMA = Bonferroni p-value </= 0.05, GWAS = p <=5x10 -8 ).