If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Genetic studies of neuropsychiatric disease strongly suggest an overlap in liability. There are growing efforts to characterize these diseases dimensionally rather than categorically, but the extent to which such dimensional models correspond to biology is unknown.
We applied a newly developed natural language processing method to extract five symptom dimensions based on the National Institute of Mental Health Research Domain Criteria definitions from narrative hospital discharge notes in a large biobank. We conducted a genome-wide association study to examine whether common variants were associated with each of these dimensions as quantitative traits.
Among 4687 individuals, loci in three of five domains exceeded a genome-wide threshold for statistical significance. These included a locus spanning the neocortical development genes RFPL3 and RFPL3S for arousal (p = 2.29 × 10−8) and one spanning the FPR3 gene for cognition (p = 3.22 × 10−8).
Natural language processing identifies dimensional phenotypes that may facilitate the discovery of common genetic variation that is relevant to psychopathology.
). Such overlap highlights the limitations of a nosologic system focused on categories of symptoms rather than dimensions. For this reason, recent initiatives emphasize the utility of identifying symptom domains that may better correspond to underlying neurobiology (
The rise of biobanks embedded in health care systems or national registries provides an opportunity to investigate the impact of genomic variation in a less biased fashion than traditional disease case-control designs. However, such biobanks typically capture primarily coded clinical data, i.e., categorical diagnoses. We have recently developed multiple methods to examine narrative clinical notes to extract symptom dimensions as a means of augmenting these coded data (
We hypothesized that symptom dimensions based on expert-curated terms capturing National Institute of Mental Health Research Domain Criteria (RDoC) domains would be associated with common genomic variation and could thereby implicate novel sets of genes related to psychopathology. As proof of concept, we therefore applied a newly described natural language processing (NLP) method for extracting dimensional phenotypes to hospital discharge summaries drawn from the genomic biobank of an academic medical center (
) and used standard genome-wide association studies to investigate these novel phenotypes as quantitative traits.
Methods and Materials
Overview and Data Set Generation
We drew on three waves of participants in the Partners Biobank from the Brigham and Women’s Hospital network and the Massachusetts General Hospital network, representing approximately the first 15,000 individuals genotyped as part of the Partners HealthCare Biobank initiative (
). Narrative discharge summaries were extracted from the longitudinal electronic health record of the Massachusetts General Hospital. We included any individuals 18 years of age or older who had at least one hospitalization between 2010 and 2015.
A datamart containing all clinical data was generated with i2b2 server software (version 1.6; i2b2, Boston, MA), a computational framework for managing human health data (
). The Partners HealthCare System Institutional Review Board approved both the study protocol and the release of biobank data, which were collected after acquiring written informed consent from participants and explicitly allowed identifiable data to be shared with qualified investigators.
Study Design and Analysis
Primary analyses used a cohort design with all patients admitted for any reason during the time period noted above. Discharge documentation was used to estimate dimensional psychopathology scores for one encounter per individual; when an individual was hospitalized on multiple occasions during the study period, a single hospitalization was selected at random to minimize bias resulting from other means of ascertainment. The derivation of dimensional psychopathology has been described elsewhere (
); in brief, it began with a set of seed terms for each of the five National Institute of Mental Health RDoC definitions drawn from National Institute of Mental Health workgroup statements, then expanded these term lists to include synonyms (
). This second expansion step is important because it reduces potential bias introduced by a given specialty or set of providers who may use specific terminology to characterize symptoms, yielding a broader set of terms that should better generalize across providers and hospitals. Each note is assigned a score corresponding to a simple count of term appearance. We have developed simple code to facilitate dimension extraction in other data sets (
DNA was extracted from buffy coat, and genotyping was done using three versions of the Illumina Multi-Ethnic Global (MEG) array (Illumina, Inc., San Diego, CA) (MEGA, n = 4927; MEGA EX, n = 5353; and MEG, n = 4784; mappable variants available for each were 1,411,334, 1,710,339, and 1,747,639, respectively). These common variant arrays all incorporate content from the 1000 Genomes Project Phase 3. Single nucleotide polymorphism (SNP) coordinates were remapped based on the TopGenomicSeq provided by Illumina; all reference SNP cluster IDs correspond to build 142 of the Single Nucleotide Polymorphism Database. To determine the forward strand of the SNP, we aligned both SNP sequences (alleles A and B) to hg19 using the BLAST-like alignment tool (BLAT) with default parameters set by the University of California Santa Cruz Genome Browser (
Each cohort was cleaned, imputed, and analyzed separately to avoid batch effects. In each batch we included subjects with genotyping call rates exceeding 99%; no related individuals based on identity by descent were included (
). From these individuals, any genotyped SNP with a call rate of at least 95% and a Hardy-Weinberg equilibrium p value <1 × 10−6 was included. Imputation used the Michigan Imputation Server implementing Minimac3 (
For each batch, we applied principal components analysis of a linkage-disequilibrium-pruned set of genotyped SNPs to characterize population structure, based on EIGENSTRAT as implemented in PLINK software version 1.9 (
). We then plotted these components with superimposition of HapMap samples to confirm location of Northern European individuals. The present analysis included only individuals of Northern European genomic ancestry to minimize the risk for confounding by ancestry (i.e., population stratification) and because the power to detect association in other ancestry groups would be limited (
We examined single-locus associations in each batch, then combined in inverse-variance-weighted fixed effects meta-analysis. In all analyses, only biallelic SNPs with minor allele frequencies of at least 1% in all batches were retained. Tests for association used linear regression assuming an additive allelic effect and examined each of the five dimensional measures as a quantitative trait, with adjustment for the first 10 principal components a priori. (In previous work, analyses incorporating five or 20 components did not yield meaningfully different results.) Association results are presented in terms of independent loci after pruning using the clump command in PLINK, with a 250-kb window and r2 = .2. Locus plots were generated using LocusZoom software (
Reported p values are not adjusted for lambda or linkage disequilibrium scores; in previous work, adjustment for lambda-1000 or linkage disequilibrium score regression intercept did not meaningfully change relative results. Lambdas ranged from 0.998 to 1.003 (
We examined 4687 individuals of Northern European ancestry across the three batches (wave 1, 1589; wave 2, 1547; wave 3, 1551), with meta-analysis of 893,900 SNPs with minor allele frequency of 0.01 or greater. The cohorts included 2363 females (50.4%), and the mean age was 64.3 years (SD, 14.9 years). Figure 1 shows Manhattan plots for each of the five dimensional phenotypes (Q-Q plots are shown in Supplemental Figure S1).
For each of the dimensions, the 10 independent loci with strongest evidence of association are described in Table 1. Overall, one locus was associated with arousal, two with social, and one with cognition at a standard genome-wide significance threshold (p < 5 × 10−8); these four regions are depicted in Figure 2. Notably, for arousal, the associated locus spans RFPL3 and RFPL3S; this family of proteins has been suggested to be important in primate neocortical evolution (
In this analysis of 4687 individuals drawn from a biobank spanning two academic medical centers, we identified four loci associated with dimensional psychopathology at a standard genome-wide threshold based on natural language processing of narrative hospital discharge notes. Two of these span genes are associated with neurodevelopment (RFPL3) or neurodegeneration (PFR3). While both are known to be brain expressed, neither has previously been strongly associated with neuropsychiatric disease, suggesting the potential utility of the approach we describe in understanding brain function in a manner that is unbiased by traditional nosology.
While not achieving a genome-wide threshold for significance, we also note the observed association between the calcium channel subunit CACNA2D3 and positive valence. This locus has previously been associated with pain sensitivity, which may impact reward responsiveness, suggesting convergent validity (i.e., assay sensitivity) (
). This family of subunits represents the target for multiple anticonvulsants used to treat neuropathic pain and has recently been shown to regulate accumulation of voltage-gated calcium channels and exocytosis at the synapse (
While these loci are promising as candidates for follow-up study, multiple limitations in this proof-of-concept study should be considered. First, while we exceed a standard threshold for genome-wide studies, replication will increase confidence in these results. (At a more stringent experiment-wide threshold, based upon correlation between these domains, one could also argue that a threshold of 2 × 10−8 would be appropriate.) We elected to meta-analyze all data available to us, rather than holding out a replication set, and present these results in the hope that they will encourage other hospital-linked biobanks to consider our approach. Second, as with any common variant study, none of these variants can be considered causal, and biological studies will be required to characterize their effect.
More broadly, it is entirely possible—indeed, likely—that other dimensional features or extraction methods, as well as incorporation of other data types, would lead to identification of other loci. We adopted a new method for identifying dimensional psychopathology from narrative clinical notes based on seed terms extracted from RDoC workgroup statements, which we have recently described in more detail along with initial validation (
). These scores do not yet address subdomains; sensitivity likely varies by domain, and indeed, as with RDoC itself, the presence of terms loading on a given domain does not necessarily represent psychopathology and may instead capture normal or subsyndromal variation. We note that the present study represents an example of transfer learning: a model trained in one type of cohort (psychiatric hospitalizations) is applied to distinguish features of another (all-cause hospitalizations), but further investigations of portability will be important. In particular, this approach complements rather than replaces analysis of more traditional curated phenotypes (
). Beyond investigating other strategies for concept extraction, it will be valuable to understand the extent to which incorporating other types of notes or integrating these data with coded clinical data improve the identification of dimensions of psychopathology [for further discussion of general methodologic considerations, please see (
With these caveats in mind, our results suggest an approach to identifying genes associated with psychopathology beyond traditional diagnostic categories, and they demonstrate the feasibility and potential utility of this broad class of approaches, aiming to be both transparent and portable. Narrative clinical notes may contain a wealth of clinical detail relevant to developing dimensional representations of brain diseases. With increasing availability of biobanks and registries as a resource for genomic discovery and translation, natural language processing represents a way to amplify their utility for investigating complex phenotypes that avoids the constraint of traditional psychiatric nosology.
Acknowledgments and Disclosures
This work was supported by National Human Genome Research Institute (NHGRI) Grant No. 1P50MH106933-04 and National Institute of Mental Health (NIMH) Grant No. 1R01MH106577-01A1 (to RHP) and the Broad Institute Stanley Center Fellowship and Brain and Behavior Foundation Grant No. 26489 (to THM). The sponsors had no role in study design, writing of the report, or data collection, analysis, or interpretation. The corresponding and senior authors had full access to all data and made the decision to submit for publication.
We thank the participants and administrators of the Partners HealthCare Biobank for their contribution to this work.
RHP serves on the scientific advisory board for Perfect Health, Genomind, and Psy Therapeutics and is a consultant for RID Ventures. The other authors report no biomedical financial interests or potential conflicts of interest.