Emerging Methods and Resources for Biological Interrogation of Neuropsychiatric Polygenic Signal

Most neuropsychiatric disorders are highly polygenic, implicating hundreds to thousands of causal genetic variants that span much of the genome. This widespread polygenicity complicates biological understanding because no single variant can explain disease etiology. A strategy to advance biological insight is to seek convergent functions among the large set of variants and map them to a smaller set of disease-relevant genes and pathways. Accordingly, functional genomic resources that provide data on intermediate molecular phenotypes, such as gene-expression and methylation status, can be leveraged to functionally annotate variants and map them to genes. Such molecular quantitative trait locus mappings can be integrated with genome-wide association studies to make sense of the polygenic signal that underlies complex disease. Other resources that provide data on the 3-dimensional structure of chromatin and functional importance of specific genomic regions can be integrated similarly. In addition, mapped genes can then be tested for convergence in biological function, tissue, cell type, or developmental stage. In this review, we provide an overview of functional genomic resources and methods that can be used to interpret results from genome-wide association studies, and we discuss current challenges for biological understanding and future requirements to overcome them.

Genome-wide association studies (GWASs) are yielding an ever-increasing number of robustly associated risk loci for many disorders, including neuropsychiatric ones (1). While this constitutes a breakthrough, it also poses new challenges. Neuropsychiatric disorders are complex traits and, in contrast to Mendelian traits, are influenced by a plethora of genetic variants with minuscule effect sizes. This apparent polygenicity complicates biological understanding (2) because typically no single variant can point toward a relevant molecular pathway.
The past 5 years have seen an unprecedented increase in the number and scale of available resources with information on the biological function of genetic variants. Variants can be linked to genes and functions at organism, tissue, or cellular resolution. At the same time, tools and statistical methods have been developed that address the polygenic nature of neuropsychiatric disorders and test for convergence of trait-associated variants on biologically relevant functions. Here, we aim to review freely accessible and large-scale biological resources and methods that can be used to interpret outcomes from GWASs in biological context and aid in formulating GWAS-based hypotheses on disease mechanisms that can be tested in functional experiments. We also address specific challenges related to neuropsychiatric disorders.

AVAILABLE RESOURCES TO ASSESS THE POTENTIAL IMPACT OF GENETIC VARIANTS ON GENE FUNCTION
The first step in interpreting GWAS results is to annotate traitassociated variants with their putative impact on gene function ( Figure 1). Generally, there are two routes by which a variant may influence disease risk. First, a variant may affect the structure of a protein and lead to disruption of normal functioning and subsequent disease. Here, nonsynonymous exonic variants are most easily interpretable, as the structure of a protein directly relates to its function (3). Variants in coding as well as noncoding regions that affect alternative splicing may have similar consequences by producing protein isoforms (4,5).
Second, a genetic variant may affect the expression level of a gene. Transcriptional control is believed to play a fundamental role in disease (6), and GWAS hits are found to be enriched in regulatory sequences (7). Variants that influence gene expression, known as expression quantitative trait loci (eQTLs), can be identified by integrating genotype with geneexpression data (8). The same approach can be applied to identify variants that affect other molecular phenotypes such as methylation, acetylation, chromatin-conformation features, or splicing. Splicing QTLs can affect expression levels of gene isoforms and, in some cases, show stronger enrichment for disease-associated variants than eQTLs (5,9). Collectively, these regulatory variants are referred to as molecular QTLs (molQTLs) (10). Below, we review the various resources that contain biological information relevant for annotating genetic variants (see also Table 1).

Positional Mapping
In the early 2000s, the Human Genome Project completed the sequencing of nearly all 3 billion base pairs that make up human DNA (11). As a follow-up, the Encyclopedia of DNA Elements (ENCODE) project aimed to annotate all functional sequences in the human genome using experimental and computational approaches (12) (Table 1). These large, international, and multidisciplinary projects gave rise to many resources and tools that allow annotating genetic variants and mapping them to genes based on their physical location. Many tools (13)(14)(15)(16)(17)(18)(19)(20) provide information about the predicted functional impact of variants (e.g., missense, nonsense, frameshift) and positional information (e.g., exonic, intronic, transcription factor binding site). Positional mapping is particularly useful for variants in coding regions, as it directly links them to genes, but the effects on intermediate molecular phenotypes, such as methylation status and expression levels, in noncoding regions may be missed. identify large numbers of associated variants for polygenic traits. These traits can be mapped to a prioritized set of genes through functional and positional annotation, as well as statistical fine-mapping. Gene-set analysis can test for convergence of prioritized genes in biological function, tissue, or cell type. Developmental stage (not depicted) is ideally taken into account. The aim is to generate genetically informed hypotheses about the likely causal mechanisms underlying the trait of interest. molQTL, molecular quantitative trait locus.

Transcriptional Landscapes
Several consortia have generated large microarray and RNA sequencing (RNAseq) datasets to profile gene expression in human tissues (Table 1). These datasets are rich and interesting in their own right, but integration with genotype data allows for subsequent eQTL mapping, in which association studies are performed on the expression levels of each gene (8). Most of these studies focus on the effects of cis-eQTLs that are associated with expression levels of nearby genes (,1 Mb), because they tend to have larger effects on gene expression than trans-eQTLs that are associated with the expression levels of distal genes (.5 Mb) (21). Additionally, focusing on a smaller subset of cis-eQTLs reduces the multiple-testing burden drastically.
The Genotype-Tissue Expression (GTEx) project has generated a wealth of data by applying RNAseq to postmortem tissue of 948 donors across 54 tissues, 13 of which are in the brain (22,23). In this resource alone, it was found that almost all protein-coding genes and approximately two thirds of all long noncoding RNA genes had corresponding cis-eQTLs (22). Many of these were tissue specific. Moreover, GWAS hits were significantly enriched for eQTLs. Thus, the integration of GTEx eQTL mappings with GWAS hits allows for the discovery of tissue-specific effects of trait-associated variants on geneexpression levels. However, because of ancestry-specific linkage disequilibrium (LD), top hits from eQTL analyses cannot simply be overlaid with GWAS hits (see Pinpointing the Most Likely Causal Variants).
While GTEx provides information on spatial transcriptomic regulation, resources such as BrainSpan provide corresponding temporal information (24,25). Gene expression is a stochastic phenomenon that varies across development (24), which is mostly due to shifts in maturity and cell-type fractions in tissues (26). Therefore, both spatial and temporal trajectories need to be considered for relevant eQTL identification. Brain-Span provides RNAseq data on a sample of 57 donors across the lifespan, for 17 (sub)cortical brain structures. BrainVar is a new resource that aims to generate a similar dataset, but with a larger sample comprising 176 donors, yielding more precise spatial and temporal profiling (27). These resources can also be used to impute gene-expression levels in independent datasets and conduct transcriptome-wide association studies (TWASs) on the imputed expression levels. Several tools are available for this purpose (28)(29)(30)(31)(32)(33).
Although most resources provide information about healthy tissue, the ultimate purpose of interpreting GWAS results is to delineate disease-underlying mechanisms. Other resources, such as PsychENCODE (34), provide RNAseq and genotype data on large samples of postmortem tissue from control subjects as well as donors with schizophrenia, major depressive disorder, or bipolar disorder. A reverse-genetics approach can be used when gene-expression data are present in cases and controls to identify differentially expressed genes (35).
Gene-expression levels in bulk tissue samples reflect the average expression levels of genes across all cell types included in a sample. Because of the development of singlecell RNAseq techniques (36), information on gene-expression levels at the cellular level is becoming rapidly available (37) and can further refine variant-expression associations. Current results from single-cell RNAseq analyses show that genetic variants often have their disease-conferring effects in a cell type-specific manner (37)(38)(39), although it is likely that approximately 50% of eQTLs are shared across tissues and cell types (22), and that while transcriptional signatures are clearly distinct between different classes of cell types, they are rather continuous within each class (40,41).
Epigenomic Landscapes eQTL mapping can indicate variants that likely regulate gene expression, but it cannot tell us how this regulation is accomplished. The genomic resource xQTL provides data from 494 donors on methylation and acetylation status, in addition to transcription levels (42) ( Table 1). This resource additionally allows for xQTL-weighted GWAS analysis, in which variants' p values are weighted by their molQTL effects. Using this resource, a mediation analysis suggested that 9% of identified eQTLs exert their effects via epigenetic modifications. Given that many more epigenetic modifications exist than are provided in this resource, this number is probably an underestimation. Thus, the complementation of RNAseq data with epigenetic information may provide insight into how eQTLs influence gene expression, and this, in turn, will allow for more detailed annotation of risk loci in neuropsychiatric disorders (43).
PsychENCODE is a resource that integrates much of the information in the preceding paragraphs (i.e., several molQTLs, as well as temporal and spatial data) by aggregating many existing datasets and generating new datasets (26,34,44,45). Their integrated model was able to improve the prediction of psychiatric-disease status approximately sixfold compared with simple additive polygenic-risk scores and to identify more than twice as many eGenes as the CommonMind Consortium, the next-largest eQTL resource (26,46).

Nuclear Landscapes
Our genome is often conceptualized as a 2-dimensional and 1directional string of base pairs, when in fact it is folded nonrandomly into a 3-dimensional structure (47). Contact points between chromatin regions that can be kilobases and megabases apart from each other may still structurally and functionally interact (48). Changes in chromatin organization can influence gene expression in subtle ways (49,50). The 4D Nucleome program aims to map the spatial and temporal genome organization in humans and mice (51) ( Table 1). Generated contact maps can be used to link genetic variants and genes, using information that goes beyond mere base-pair distance (52,53). Furthermore, the emerging picture is that the spatial organization of chromatin is highly variable (54) and that this variability directly relates to changes in gene expression (55).
Similar to other intermediate molecular phenotypes, features of chromatin conformation can also be the basis for QTL mapping. The first such studies have recently been carried out and led to the identification of chromatin QTLs that influence chromatin loop strength (56) and chromatin accessibility (57).

PINPOINTING THE MOST LIKELY CAUSAL VARIANTS
Functional annotation of trait-associated variants using a variety of resources can be easily conducted with the FUMA webtool (58) and aids in separating variants with currently no known effect on the expression level or structure of a gene's product from those with known effect. However, whether the annotated effect is also relevant to the trait still needs to be formally tested. Because of the presence of LD, which is the systematic association of alleles at different loci on the same chromosome (59), most trait-associated variants are likely to be correlated with a causal variant rather than being causally involved themselves.
Apart from using functional annotation, trait-associated variants can also be prioritized based on statistical fine-mapping. The simplest strategies are to select all significantly associated, LD-independent variants in a risk locus based on the lowest association p value and LD pattern in the region, or to conduct a stepwise conditional analysis, searching for the minimal set of variants that explain the association signal in a risk locus. However, the variant with the strongest association is not necessarily the causal variant, because of sample fluctuations in allele frequencies and LD, and because it may be in LD with two or more causal variants. More sophisticated statistical finemapping strategies rely on a Bayesian framework that incorporates the pattern of association, LD, and possibly functional information to calculate the posterior probability of being causal for each variant (60). Several tools for statistical fine-mapping are available (61)(62)(63)(64)(65)(66)(67)(68)(69)(70), of which FINEMAP (67,68) is the current standard. The recently developed method PolyFun (70) combines functional information on coding, conserved, regulatory, and LD-related annotations with statistical fine-mapping methods employed in FINEMAP (67,68) or SuSiE (69).
An additional strategy to interpret GWAS risk loci is to formally test whether the most likely causal variant is shared between the trait of interest and a second, typically molecular, trait. Such colocalization involves bivariate fine-mapping of a shared locus and can be used alongside TWAS approaches (28-33) that merely test for association with imputed transcription levels. Most methods for colocalization rely on a Bayesian model to calculate the posterior probability that the same variant is causal in both traits. The current state-of-the-art tools are COLOC (71) and eCAVIAR (65), which take summary-level statistics as input and are robust against different modes of inheritance (COLOC) or allelic heterogeneity (eCAVIAR). COLOC, however, assumes 1 causal variant per locus (interpreted as "at least 1"), whereas eCAVIAR allows up to 6 causal variants per locus. Enloc (72) additionally allows assessing enrichment of molQTLs, which improves the accuracy of the evaluation of colocalization. FOCUS (73) combines expression imputation (TWAS) with fine-mapping and colocalization and provides credible sets of genes conditional on the TWAS signal.
The accuracy of colocalization methods depends on the sample sizes of molQTL GWASs, which are currently relatively small. In addition, genetic variants that are not prioritized after colocalization can still be causally involved via other mechanisms and should not be dismissed. However, with increasing molQTL GWAS sample sizes, colocalization methods can aid in prioritizing variants and genes and can provide additional mechanistic insight.

GENE TO FUNCTION: LOOKING FOR CONVERGENCE
To provide meaningful starting points for functional follow-up experiments, the polygenic nature of neuropsychiatric disorders must be considered when interpreting the results of GWASs. One way is to look for convergence among the vast number of associated or most likely causal variants. Convergence can be tested using pathway-based association tools that rely on a collection of predefined gene sets and evaluate trait associations of the genes that are part of them (74). Table 2 provides a brief overview of resources that can be used to define gene sets, and this is explained more thoroughly in the Definitions of Biological Functions and Pathways in the Supplement. The interpretability and reliability of such an analysis critically depend on accurate gene annotations of biological functions and pathways, as well as sound statistical methods to evaluate the strength of statistical association above chance.

Tools for Systematic Evaluation of Convergence of Trait-Associated Variants in Biologically Relevant Functions
Lists of gene sets can be used to investigate whether traitassociated variants converge on specific gene sets, by conducting pathway or gene-set analysis (GSA). GSA entails evaluating whether genes that are part of the gene set show a generally stronger association with the trait of interest compared with the association of genes that are not part of the gene set. This analysis can be accomplished by using a binary measure of gene-trait association and conducting a 2 3 2 hypergeometric test or a quantitative measure of gene-trait association (e.g., transformed p values for association) and using a regression framework.
There are several tools available for GSA (75)(76)(77)(78)(79)(80)(81)(82)(83)(84)(85), some of which are reviewed in (74). The most widely used tools are stratified LDSC (86) and MAGMA (84). Although LDSC is not strictly a GSA tool, it does allow for the evaluation of enrichment of single nucleotide polymorphism-based heritability in functional categories of variants. MAGMA (84) relies on allocating variants to genes and uses a multiple regression model that evaluates whether the gene-based association signal inside a gene set is stronger than that outside the gene set. As with any other regression, inclusion of covariates is straightforward, and by default, MAGMA conditions on gene size, gene density, the inverse of the mean minor allele count in each gene, and the number of genes in a gene set. While the default is to assign variants to genes based on a window of 1 kb on each side of the transcription start and end sites, this window can be adjusted. However, there are no agreed-upon standards for window sizes around genes, as each gene may require a different window (87), reflecting an optimal balance between including relevant variants and excluding noise.
First, statistical power is only marginally related to the sample size of the GWAS and is mostly influenced by the genetic effect per gene (or variant) and the number of genes in a gene set, in addition to the ratio of the heritability due to the genes inside and outside the set. If the set-specific heritability is low (#5%), gene-set sizes of 100 genes still have adequate (80%) power, while gene-set sizes of 25 genes will have low (,10%) statistical power. However, if the gene-set heritability  (92). Second, competitive testing, which tests the null hypothesis that the trait association of genes in the gene set is not stronger than of genes outside it, is almost always preferred over often-biased self-contained testing, which tests the null hypothesis that genes in the gene set are not associated with the trait of interest. This approach is especially important for traits that are known to be polygenic, with nearly all genes showing at least some association with the trait of interest. Third, statistical power of GSA generally decreases as the trait heritability increases and it becomes less likely that most of the genetic variance resides in the gene set of interest. This power loss occurs because with higher trait heritability, the genetic signal is divided across many genes under a polygenic model, increasing the heritability due to genes that are not included in the set of interest. Fourth, the test for statistical association should accurately control for the number of variants, the number of genes, and the level of LD, and some tools have been shown to be more sensitive to these biases than others (92). If the number of genes in a gene set increases, the likelihood of observing a significant association in that gene set by chance also increases. Genes generally tend to have multiple functions and can be expressed in multiple tissues and cell types (22,23,93). In addition, classes of functional gene properties tend to be correlated across genes. For example, a set of genes involved in vesicle docking in the synapse will also show preferential expression in brain tissue. If this set of genes is enriched for strong statistical associations with a trait of interest, traditional GSA testing will result in significant associations for the "vesicle docking" and the "expressed in brain tissue" gene sets, unable to tell which of these is the most likely causal factor. Conditional GSA testing (94) addresses this issue by evaluating the evidence of association of one set (e.g., "vesicle docking") versus or in combination with another set (e.g., "expressed in brain tissue"). Such post hoc GSA testing has been shown to refine hypotheses on where convergence of polygenic signal resides (95), which is of crucial importance for designing meaningful functional follow-up experiments.

CHALLENGES SPECIFIC TO NEUROPSYCHIATRIC DISORDERS
Although GWASs for neuropsychiatric traits have yielded many replicable hits, several challenges specific to neuropsychiatric disorders may complicate post-GWAS analyses. First, the brain is the most relevant tissue for neuropsychiatric disorders, yet relatively few molQTL resources include extensive information on spatial and temporal characteristics of the brain, and even fewer are disease specific. Many neuropsychiatric disorders manifest in adulthood yet are believed to be developmental in nature, which further impedes access to relevant tissue. Second, even if the whole brain has been mapped in terms of regional and temporal expression of genes, this information will not immediately lead to mechanistic disease insight into neuropsychiatric disorders. GWASs and post-GWAS analyses will merely allow us to formulate hypotheses on convergence of trait-associated variants in specific brain regions, at a specific developmental stage, and potentially for specific cell types. However, functional follow-up experiments need a clear biomarker that can be evaluated in vivo in case-control groups or human brain tissue samples. Therefore, more extensive GWASinformed fundamental biological experiments are needed that map the temporal and spatial circuitry of specific cell types and how they are affected by genetic variation, while simultaneously dealing with the polygenic nature of neuropsychiatric disorders. Tools such as chemogenetic manipulation with DREADDs (designer receptors exclusively activated by designer drugs) (96) or induced pluripotent stem cell technology (97,98) that do not rely on manipulating single genes are particularly relevant in this regard. Third, neuropsychiatric disorders show extensive genetic correlations (99), which may be due to diagnostic (symptom) overlap, pleiotropic effects of causal genes, or causal relations between multiple traits. Thus, the study of a specific neuropsychiatric trait requires considering its pheno-and genotypic overlap with other neuropsychiatric traits. Genetic overlap can be quantified with LDHub (100), which automates and scales LDSC (85) genetic correlation analyses, or GCTA (101) when raw genotypic data are available. GenomicSEM (102) allows for more sophisticated modelling of relationships between multiple traits.

CONCLUSIONS
The widespread polygenicity of neuropsychiatric disorders precludes the identification of single disease-relevant genes and complicates the formulation of mechanistic hypotheses that can be tested in functional follow-up studies. To interpret outcomes from GWASs in biological context requires 1) functional annotation of trait-associated variants using external biological resources, 2) prioritization of likely causal variants and genes, and 3) evaluation of convergence at multiple biological levels using appropriate statistical methods. We have provided an overview of currently available functional genomic resources and methods that can be used in each of these 3 steps. There are several ways in which the currently available resources and methods can be optimized. First, most resources for annotation of the regulatory function of trait-associated variants relies on transcriptional data, while other intermediate molecular phenotypes may also be required to explain the regulatory consequences of genetic variation. More and larger studies are needed that map multiple molQTLs, preferably in an integrated manner, such as is done by the PsychENCODE (34) consortium. Second, current resources for single-variant annotation are mostly based on adult postmortem tissue, disregarding the temporal variation in molecular phenotypes. Temporal variation in gene function is especially important for many neuropsychiatric disorders, such as autism spectrum disorder and attention-deficit/hyperactivity disorder, that are developmental in nature (103). As a consequence, relevant biological mechanisms might only become noticeable at certain ages. Additionally, adult tissue may not contain the disease-causing cell types, which may be found exclusively in early developmental stages (104). Resources containing information at multiple developmental stages are thus needed. Third, complementing coarse bulk tissue data with fine-grained single-cell data constitutes another muchneeded advance. In the future, the Single-Cell eQTLgen Consortium (https://www.eqtlgen.org/single-cell.html) aims to identify the specific cell types in which eQTL effects emerge (21). The Allen Brain Institute has undertaken the Herculean task of defining all cell types in the adult human and mouse cerebral cortex according to their morphological, electrophysiological, and transcriptional properties (105)(106)(107). Human fetal neocortical cell types are now also being mapped comprehensively (108). Taking a step forward, the Human Cell Atlas (109) and HuBMAP (110) aim to define all human cell types. Once genetic variants have been mapped to genes, this information may allow for the delineation of disease mechanisms that are cell type specific. Fourth, relatively few resources provide data on donors with neuropsychiatric disorders. As the ultimate goal is to delineate disease mechanism, contrasting healthy and diseased tissue is likely to provide important insights. Fifth, some resources constitute a patchwork of small studies that impede easy integration with GWASs. In contrast, resources such as GTEx (23) and xQTL (42) provide single and large datasets that are easily accessible and readily integrated. Preferably larger, more homogeneous resources will be available in the future. Lastly, novel methods are needed that integrate information from multiple molQTLs to optimize prioritization of causal variants and genes, and that allow evaluating the effects of variants across the continuous allelic frequency spectrum (for a discussion of rare variants, see the Supplement).
Only by combining spatial and temporal trajectories, diseased and healthy tissue, and bulk and single-cell data and testing for convergence in well-defined gene sets will we be able to paint a Methods and Resources for Interpretation of GWAS Hits Biological Psychiatry January 1, 2021; 89:41-53 www.sobp.org/journal comprehensive picture of our 3-dimensional genomic and epigenomic landscape in the context of neuropsychiatric disorders. But this is only the first step toward true biological insight of disease. The approaches mentioned here are purely exploratory and will only be able to generate hypotheses, as opposed to confirm them. Computational approaches that predict how genetic mutations affect neuronal function at protein, cellular, or network level can be employed to formulate hypotheses that need to be validated in experimental studies. These computational approaches can include the annotated effects of hundreds of genetic variants taking into account the polygenic nature of mental disorders (111). Such computational approaches have successfully been applied in the context of cardiac traits, although they may prove to be more challenging to apply to neuropsychiatric disorders. Hypothesized biological mechanisms are not guaranteed to be immediately relevant for the trait of interest as, for example, the effects of variants may be mediated by environmental mechanisms or the relevant biological pathway may not have been tested. To prove causal involvement of biological mechanisms in neuropsychiatric disorders, we need functional in vitro and in vivo follow-up experiments. This approach requires close collaboration between human geneticists and neurobiologists, and the application of novel neurobiological tools that appropriately deal with the polygenic nature of neuropsychiatric disorders.