Authors: Christina Leslie
Abstract: Dysregulated epigenetic programs are a feature of many cancers, and the diverse differentiation states of immune cells as well as their dysfunctional states in tumors are in part epigenetically encoded. We will present recent analysis work and computational methodologies from our lab to decode epigenetic programs from genome-wide data sets.
In a recent collaborative work, we characterized chromatin states governing CD8 T cell dysfunction in cancer and reported that tumor-specific T cells differentiate to dysfunction through two discrete chromatin states: an initial plastic state that can be functionally rescued (i.e. through immunotherapy) and a later fixed state that is resistant to therapeutic reprogramming. We now follow up on this work by presenting a computational framework to decipher transcriptional programs governing chromatin accessibility and gene expression in normal and dysfunctional T cell responses through a large-scale analysis of published data from mouse tumor and chronic viral infection models. This modeling shows that in all these systems, T cells commit to becoming dysfunctional early after an immune challenge, rather than first mounting and then losing an effector response. Through scRNA-seq analysis, we characterize the phenotypic diversity of this common trajectory from plastic to fixed dysfunction.
We will also present a recent collaboration with the Sawyers lab on FOXA1 mutants in prostate cancer, showing that somatic alterations in this pioneer transcription factor lead to altered differentiation programs, through analysis of ATAC-seq in mouse prostate organoid systems.
Finally, we will describe a novel machine learning approach called BindSpace to leverage massive in vitro TF binding data from SELEX-seq experiments through a joint embedding of DNA k-mers and TF labels, leading to improved prediction of TF binding.
Authors: Joel Saltz
Abstract: We have developed a rich set of Pathology informatics tools, methods and algorithms. This work includes pipelines to compute a variety of biologically significant Pathology features, including spatial maps of tumor infiltrating lymphocytes (TILs). These methods allow us to generate detailed maps of lymphocytes, tumor and necrotic areas; We have employed these methods to generate publicly available whole slide TIL maps and spatial statistics for 13 cancer types and 5,000 subjects – these are available on The Cancer Imaging Archive. This work is opening a path to in depth studies that relate spatial TIL patterns to molecular characterizations. In addition and of even greater importance is the rapid evolution of clinical and cancer surveillance studies that integrate Pathomics analyses with clinical and molecular analyses to predict outcome and response to treatment and to elucidate cancer population characteristics.
An Estrogen Regulated Feedback Loop Limits the Efficacy of Estrogen Receptor Targeted Breast Cancer Therapy
Authors: Tengfei Xiao, Wei Li, Shirley Liu and Myles Brown
Abstract: Endocrine therapy resistance invariably develops in advanced estrogen receptor positive (ER+) breast cancer, but the underlying mechanisms are largely unknown. Using genome-wide CRISPR/Cas9 knockout screens and the MAGeCK algorithm we previously developed, we have identified C-terminal SRC kinase (CSK) as a critical node in a previously unappreciated negative feedback loop that limits the efficacy of current ER targeted therapies. Estrogen directly drives CSK expression in ER+ breast cancer. At low CSK levels, as is the case in ER+ breast cancer patients resistant to endocrine therapy and with the poorest outcomes, the p21 protein-activated kinase 2 (PAK2) becomes activated and drives estrogen independent growth. PAK2 over-expression is also associated with endocrine therapy resistance and worse clinical outcome, and the combination of a PAK2 inhibitor with an ER antagonist synergistically suppressed breast tumor growth. Clinical approaches to endocrine therapy-resistant breast cancer must overcome the loss of this estrogen-induced negative feedback loop that normally constrains the growth of ER+ tumors.
Authors: Joo Sang Lee, Lital Adler, Hiren Karathia, Narin Carmel, Shiran Rabinovich, Noam Auslander, Rom Keshet, Arie Admon, David Wilson Iii, Yardena Samuels, Sridhar Hannenhalli, Eytan Ruppin and Ayelet Erez
Abstract: Although immune checkpoint therapy leads to durable clinical responses in many cancer patients, it fails in many others. To improve the efficacy of the treatment, it is highly important to identify predictive biomarkers. While tumor mutational burden and checkpoint target’s expression have been associated with enhanced response to checkpoint immunotherapies, they yet provide only a modest predictive signal and hence there is a need to identify additional predictive factors. Specifically, while there is growing evidence that metabolic alterations can affect the tumor and modulate the immune response, the potential effects of altered cancer metabolism on tumor mutagenesis and immunotherapy remain unexplored.
The urea cycle (UC) converts excess nitrogen derived from the breakdown of nitrogen-containing molecules (e.g., ammonia) to urea, a relatively non-toxic and disposable nitrogenous compound. We and others have shown that silencing of the UC enzyme ASS1 promotes cancer proliferation by diverting its substrate aspartate toward CAD enzyme, which mediates the first three reactions in the pyrimidine synthesis pathway. We now demonstrate, by analysis of the TCGA data, tumor samples and cancer cell line experiments, that UC dysregulation (UCD) is a much wider common metabolic phenomenon that maximizes nitrogen utilization in cancer, favoring pyrimidine synthesis over urea disposal. Remarkably, we find that the UCD changes the 1:1 purine (R)-to-pyrimidine (Y) ratio in favor of pyrimidine in cancer cells. Moreover, by analyzing whole-exome sequencing (WES), RNA sequencing (RNAseq) and proteomics data of the same patients in TCGA, we find that: (a) UCD is significantly associated with a novel and unique pattern of purine-to-pyrimidine transversion mutational bias across many cancer types at the DNA coding (sense) strand, and (b) this trend becomes stronger and more significant at both the mRNA and protein levels, testifying to its functional implications. Notably, the overall mutational load in cancer is negatively correlated with UCD, testifying to their independence.
To test whether such mutational bias is associated with better immunotherapy response, we analyze published data of large melanoma cohorts. We find that responders of both anti-PD1 and anti-CTLA4 therapy exhibit significantly higher UCD and R->Y mutational bias than non-responders. We further observe that the peptides carrying transverse R->Y mutations are preferentially presented as neo-peptides in responders independent of mutational load through the computational prediction of MHC-I binding, and this trend becomes significant for more clonal neo-peptides, promoting UCD as a potential biomarker for the success of immunotherapy.
Collectively, these results support our hypothesis that UCD is a prevalent metabolic phenomenon in cancer, generating mutational biased neo-peptides, worsening patients’ prognosis and yet enhancing the response to immune therapy independent of tumor mutational burden. Taken together, our findings point to the important role of UCD in a broad spectrum of cancers and to the role of UCD in predicting response to immune check point therapy. Broadly, our results suggest future therapeutic interventions aiming to increase UCD levels to enhance the coverage and efficiency of cancer immunotherapy.
Aberrant ERBB4-SRC Signaling as a Hallmark of Group 4 Medulloblastoma Revealed by Integrative Phosphoproteomic Profiling
Authors: A Forget, L Martignetti, S Puget, L Calzone, S Brabetz, D Picard, A Montagud, S Liva, A Sta, F Dingli, G Arras, J Rivera, D Loew, A Besnard, J Lacombe, M Pagès, P Varlet, C Dufour, H Yu, Al Mercier, E Indersie, A Chivet, S Leboucher, L Sieber, K Beccaria, M Gombert, Fd Meyer, N Qin, J Bartl, L Chavez, K Okonechnikov, T Sharma, V Thatikonda, F Bourdeaut, C Pouponnot, V Ramaswamy, A Korshunov, A Borkhardt, G Reifenberger, P Poullet, Md Taylor, M Kool, Sm Pfister, D Kawauchi, Emmanuel Barillot, M Remke and O Ayrault
Abstract: The current consensus recognizes four main medulloblastoma subgroups (wingless, Sonic hedgehog, group 3 and group 4). While medulloblastoma subgroups have been characterized extensively at the (epi-)genomic and transcriptomic levels, the proteome and phosphoproteome landscape remain to be comprehensively elucidated. Using quantitative (phospho)-proteomics in primary human medulloblastomas, we unravel distinct posttranscriptional regulation leading to highly divergent oncogenic signaling and kinase activity profiles in groups 3 and 4 medulloblastomas. Specifically, proteomic and phosphoproteomic analyses identify aberrant ERBB4-SRC signaling in group 4. Hence, enforced expression of an activated SRC combined with p53 inactivation induces murine tumors that resemble group 4 medulloblastoma. Therefore, our integrative proteogenomics approach unveils an oncogenic pathway and potential therapeutic vulnerability in the most common medulloblastoma subgroup.
Inference of single-cell genotypes characterizes the genetic heterogeneity in pancreatic cancer precursor lesions
Authors: Violeta Beleva Guthrie, Lily Zheng, Cathy Guerra, Yuko Kuboki, Laura Wood and Rachel Karchin
Abstract: Intra-tumor heterogeneity (ITH) has critical implications for cancer diagnosis and treatments. Singe-cell DNA sequencing (SCS) provides maximal resolution to ITH, and thus a powerful framework to understand in detail the dynamics of cancer evolution. However, SCS data is prone technical errors consisting of both false-positives and false-negatives due to amplification biases and allelic dropout. Here, we present a computational tool to determine single-cell genotypes that leverages information jointly from multiple single cells, as well as from multi-region bulk sequencing. Our method reduces errors due to amplification bias by combining existing variant calling tools with a clustering and imputation algorithm. We apply our method to characterize the genetic heterogeneity of pancreatic cancer precursor lesions and show that different mutations in the same early cancer driver gene often occur in unique tumor clones within the same lesion. This suggesting the possibility of polyclonal origin or an unidentified initiating event preceding this critical mutation. Multiple mutations in later-occurring driver genes are also frequently localized to unique tumor clones, raising the possibility of convergent evolution of these genetic events in pancreatic tumorigenesis. Overall, this analysis provides the first insights into genetic heterogeneity of pancreatic cancer precursors at the single-cell level and has implications for the development of early detection approaches as well as prediction of response to targeted therapy.
Authors: Johannes G. Reiter, Alvin P. Makohon-Moore, Jeffrey M. Gerold, Alexander Heyde, Christine A. Iacobuzio-Donahue, Bert Vogelstein and Martin A. Nowak
Abstract: Therapy selection and treatment success of patients increasingly depends on the identification of genetic alterations. Recent studies reported that some of the identified genetic alterations were only present in subpopulations of tumor cells and hence pose a barrier to the success of this precision medicine approach. Genetic intratumoral heterogeneity is a natural consequence of imperfect DNA replication. Any two randomly selected cells, whether normal or cancerous, are therefore genetically different. While genomic heterogeneity within primary tumors is associated with relapse, heterogeneity among treatment naïve metastases has not been comprehensively assessed. Moreover, as a result of the different forms of tumor heterogeneity and the recent focus on subclonal heterogeneity, some discrepancies have arisen between the interpretations of observed heterogeneity and its clinical implications. Other discrepancies arise from loose distinctions between functional driver gene mutations and passenger mutations because not every mutation within a bona fide driver gene actually drives tumorigenesis.
We evaluated the extent of genetic heterogeneity within untreated cancers with particular regard to its clinical relevance. We analyzed sequencing data for 76 untreated metastases from 20 patients and inferred cancer phylogenies for breast, colorectal, endometrial, gastric, lung, melanoma, pancreatic, and prostate cancers. We found that within individual patients a large majority of driver gene mutations are common to all metastases. To determine whether mutations in putative driver genes were likely to be functional, we pooled the information of various databases and bioinformatic methods to predict their functional consequences. We observed that the driver gene mutations that were not shared by all metastases are unlikely to have functional consequences. With a single biopsy of a primary tumor in 14 patients, the likelihood of missing a functional driver gene mutation that was present in all metastases was 2.6%. Furthermore, all functional driver gene mutations detected in the primary tumor were present among all metastases a patient. To identify the evolutionary determinants of inter metastatic heterogeneity, we developed a mathematical framework to assess how rates of growth, mutation, and dissemination give rise to driver gene mutation heterogeneity. We find that the original founding clone of the primary tumor most likely seeds all detectable metastases. The increased growth rate conferred by a new driver mutation is insufficient to compensate for the time spent waiting for the driver mutation to occur. The model reveals that the probability of observing inter metastatic driver heterogeneity increases when the primary tumor grows very slowly before metastases are seeded, the average growth advantage of additional driver mutations is very large, and the driver gene mutation rate is high.
These data indicate that the cells within the primary tumors that gave rise to metastases are genetically homogeneous with respect to functional driver gene mutations. Thus, single biopsies capture most of the functionally important mutations in metastases and therefore provide essential information for therapeutic decision making.
Authors: Fang Wang, Shaojun Zhang, Tae-Beom Kim, Yu-Yu Lin, Ramiz Iqbal, Zixing Wang, Vakul Mohanty, Kanishka Sircar, Jose Karam, Michael Wendl, Funda Meric-Bernstam, John Weinstein, Li Ding, Gordon Mills and Ken Chen
Abstract: Achieving precision and individualization is the key to further advance the understanding and treatment of cancer. Molecular profiling of tissue samples (e.g., tumor) using bulk DNA sequencing are of limited power and precision. Multi-omics profiling such as parallel genome and transcriptome sequencing promises comprehensive, functional readout of a tissue sample. However, novel, systematic approaches are required to leverage the increased data dimensionality, heterogeneity and complexity.
To address the issues of intra-tissue heterogeneity and integration of multi-omics in cancer, we developed a statistical approach called Texomer that enables a joint analysis of bulk DNA and RNA sequencing data to perform allele-specific deconvolution and quantify tumor purity and heterogeneity. Texomer first estimates tumor DNA purity, intratumor heterogeneity (ITH) and allele-specific copy numbers (ASCNs) using germline single nucleotide polymorphisms (SNPs) and somatic single nucleotide variants (SNVs) sites from the bulk whole exome (WES) data, through iterative integerization of the tumor ASCNs. It then estimates tumor RNA purity and allele-specific expression levels (ASELs) through maximizing the likelihood of observing the RNA read counts in the autologous whole transcriptome sequencing (WTS) data, given the estimated DNA purity and ASCNs. Texomer further probabilistically classifies each SNP and SNV site as copy number concordant or discordant, which can be used to reveal functionally selected variants in copy number altered regions. Finally, Texomer estimates a differential allelic cis-regulatory effect (DACRE) score to quantify the tumorigenic potential of a variant allele relative to its wildtype.
Evaluation using simulated data and multiple datasets from the cancer genome atlas (TCGA) indicated that Texomer achieved desirable technical accuracy and outperformed existing tools such as ASCAT, TITAN, Sequenza and FACETS. We found more accurate genotype-phenotype association based on Texomer-transformed profiles than bulk WES/WTS data in the tumors. Moreover, breast cancers categorization based on Texomer-transformed profiles achieved improved accuracy, resulting in more clusters with homogeneous profiles and distinct biological properties than do the bulk data. In addition, the improved power achieved by Texomer manifested in significantly improved accuracy for functional variant prioritization. Applying DACRE score as a filter doubled the precision in predicting the known functional variants. Our study clearly revealed the analytical challenges involved in performing joint WES and WTS profiling of patient samples, and delivered a statistically robust solution Texomer to accomplish the benefits of multi-omics profiling towards the realization of personalized genomic medicine.
Late Breaking Research Talks
Authors: Nuraini Aguse, Yuanyuan Qi and Mohammed El-Kebir
Abstract: Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative, so as to remove inference errors and identify common dependencies between mutations in the input trees. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.
We introduce the MULTIPLE CONSENSUS TREE (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T.
Authors: Hatice Osmanbeyoglu, Fumiko Shimizu, Angela Rynne-Vidal, Tsz-Lun Yeung, Hannah Wen, Petar Jelinic, Samuel Mok, Gabriela Chiosis, Douglas Levine and Christina Leslie
Abstract: Epigenomic data on transcription factor occupancy and chromatin accessibility can elucidate the developmental origin of cancer cells and reveal the enhancer landscape of key oncogenic transcriptional regulators. We develop a computational strategy called PSIONIC (patient-specific inference of networks informed by chromatin) to combine cell line chromatin accessibility data with large tumor expression data sets and model the effect of enhancers on transcriptional programs in multiple cancers. We generated a new ATAC-seq data set profiling chromatin accessibility in gynecologic and basal breast cancer cell lines and applied PSIONIC to 723 patient and 96 cell line RNA-seq profiles from ovarian, uterine, and basal breast cancers. Our computational framework enables us to share information across tumors to learn patient-specific TF activities, revealing regulatory differences between and within tumor types. Moreover, PSIONIC-predicted activity for MTF1 in cell line models correlated with sensitivity to MTF1 inhibition, showing the potential of our approach for personalized therapy. Many of the identified TFs were significantly associated with survival outcome in basal breast, uterine serous and endometrioid carcinomas. To validate one PSIONIC-derived prognostic TF, we performed immunohistochemical analyses in 31 uterine serous tumors for ETV6 and 45 basal breast tumors for MITF and confirmed that the corresponding protein expression pattern were also significantly associated with prognosis.
Authors: Gryte Satas, Simone Zaccaria, Geoffrey Mon and Ben Raphael
Abstract: Background: Cancer is characterized by intratumor heterogeneity, where tumor cells contain different collections of somatic mutations, including single-nucleotide variants (SNVs) and copy-number aberrations (CNAs). Single-cell DNA sequencing has emerged as a promising technology to measure this heterogeneity at the level of individual cells and reconstruct the evolutionary process of cancer. Current studies infer single-cell phylogenies using either SNVs or CNAs alone. However, SNVs often overlap with CNAs, and a CNA may delete an SNV later during tumor evolution. Not accounting for SNV losses may lead to the incorrect inference of single-cell phylogenies for tumors.
Results: We describe a loss-supported model for single-cell phylogenies that uses constraints from observed CNAs to inform phylogenetic inference. We use this model to develop a new algorithm, SCARLET (Single-Cell Algorithm for Reconstructing Loss-supported Evolution of Tumors). On simulated data, SCARLET outperforms existing methods in reconstructing the structure of single-cell phylogenies and inferring the presence of mutations. We then use SCARLET to analyze a single-cell dataset from a colorectal cancer patient and show that the loss-supported model yields more plausible phylogenies than existing methods.
Authors: Welles Robinson, Roded Sharan and Mark DM Leiserson
Abstract: Somatic mutations are caused by endogenous and exogenous DNA damage as well as DNA replication errors that are not fixed by the various DNA repair pathways. Previous computational approaches have revealed 30 distinct signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some signatures are very similar, making the interpretation of mutation signature activity challenging. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g., smoking history) or molecular features (e.g., inactivations to DNA repair genes).
To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method for mutation signature extraction that directly models the effect of observed tumor-level covariates. Borrowing from the field of Bayesian topic modeling, TCSM changes the prior expectation of signature exposure based on the observed covariates of a tumor. On both simulated and real data, TCSM outperforms both non-negative matrix factorization and other topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures that are associated with different covariates. We use TCSM to identify five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. In a combined melanoma and lung cancer cohort, we use TCSM, including cancer type as one of the covariates, to identify four signatures and provide statistical evidence to support earlier claims that three lung cancers from TCGA are misdiagnosed metastatic melanomas.
Authors: Yoo-Ah Kim, Rebecca Sarto Basso, Damian Wójtowicz, Dorit S. Hochbaum, Fabio Vandin and Teresa Przytycka
Abstract: Phenotypic heterogeneity in cancer is often caused by different patterns of genetic alterations. Understanding such phenotype-genotype relationships is fundamental for the advance of personalized medicine. One of the important challenges in the area is to predict drug response on a personalized level. Recent projects have characterized drug sensitivity for a large number of drugs in hundreds of cancer cell lines and the drug response data, together with information about the genetic alterations in these cells, can be used to understand how genomic alterations impact drug sensitivity. Phenotype-genotype relationships in cancer can be better interpreted in a pathway-centric view, in which genetic alterations in the disease are considered from the context of dysregulated pathways rather than from the perspective of mutations in individual genes. However, most of pathway identification methods in cancer focus on finding subnetworks that include general cancer drivers or are associated with discrete features such as cancer subtypes, hence cannot be applied directly for the analysis of continuous features like drug response. On the other hand, existing genome wide association approaches do not fully utilize the complex proprieties of cancer mutations, and are not designed to zoom in on subnetworks that are specific enough to help understand drug action.
To address these challenges, we propose a computational method, named NETPHLIX (NETwork-to-PHenotpe mapping LeveragIng eXlusivity), which aims to identify subnetworks of mutated genes that are collectively associated with continuous cancer phenotypes. Utilizing properties such as mutual exclusivity and functional interactions among genes, we formulate the problem as an integer linear program and solve it optimally to obtain a connected set of mutated genes maximizing the association. Once we obtain the optimal gene modules, we assess both the significance and robustness of the identified modules by performing permutation tests and bootstrapping, We evaluated NETPHLIX and other related methods using simulations and demonstrated that NETPHLIX outperforms competing methods. In particular, using network information helps NETPHLIX identify coherent modules without adding spurious genes, compared to previous approaches.
Analyzing a cell line drug response dataset with NETPHLIX, we identified sensitivity associated subnetworks for a large set of drugs, including interesting response modules to MEK1/2 inhibitors in both directions (increased and decreased sensitivity to the drug) that the previous method, which does not utilize network information, failed to identify. The genes belong to two distinct modules that are related to MAPK/ERK signaling but associated with opposite response to the drug targeting MEK1/2 and ERK2 genes. Effective computational methods to discover these associations will improve our understanding of the molecular mechanism of drug sensitivity, help to identify potential drug combinations, and have profound impacts on genome-driven, personalized drug therapy. Furthemore, NETPHLIX can be used to identify subnetworks of mutated genes that are associated with any continuous phenotypes beyond drug response data.
NETPHLIX is available at https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#netphlix
Authors: Xiaoshan Melody Shao, Justin Huang, Ashok Sivakumar, Kymberleigh Pagel and Rachel Karchin
Abstract: Whole-exome sequencing of tumor samples is now commonly used to predict neoantigen peptides, which may have clinical utility as immunotherapy biomarkers of therapeutic response and peptide vaccine design. We have developed a peptide-MHC binding deep neural network method, MHCnuggets,for high-throughput pipelines, designed to be applied to tens of thousands of patient samples. Our long short-term memory (LSTM) neural network architecture (Hochreiter, 1995) is more flexible and much faster than currently available computational tools designed for this purpose. It handles peptides that bind to common or rare alleles of MHC Class I or Class II, while other tools require separate network architectures and software packages for each of these tasks. The networks can be trained on binding affinities from biochemical assays and/or results of ligand elution/mass spectrometry. In benchmark experiments, the LSTM has similar positive predictive value to other methods as a binding affinity predictor, but improves positive predictive value by fourfold at prediction of endogeneous bound peptides identified through mass spectrometry. We show advantages of the LSTM neural network architecture to preserve natural amino-acid residue ordering in peptides, rather than considering only abundant amino acids at peptide anchor residues. We also demonstrate the feasibiltity of applying our method to very large data sets of patients. Neoantigen prediction, including binding affinity calculation for each somatically mutated peptide and its unmuated equivalent, for 6981 HLA haplotyped samples in 28 Cancer Genome Atlas tumor types runs in 8 hours, on a single GPU node.
Harnessing synthetic lethality to predict the response to cancer treatment
Authors: Joo Sang Lee, Avinash Das, Livnat Jerby-Arnon, Rand Arafeh, Noam Auslander, Matthew Davidson, Lynn McGarry, Daniel James, Arnaud Amzallag, Seung Gu Park, Kuoyuan Cheng, Welles Robinson, Dikla Atias, Chani Stossel, Ella Buzhor, Gidi Stein, Joshua Waterfall, Paul Meltzer, Talia Golan, Sridhar Hannenhalli, Eyal Gottlieb, Cyril Benes, Yardena Samuels, Emma Shanks and Eytan Ruppin
Abstract: Significance: The identification of Synthetic Lethal interactions (SLi) have long been considered a foundation for the advancement of cancer treatment. The rapidly accumulating large-scale patient data now provides a golden opportunity to infer SLi directly from patient samples. Here we present a new data-driven approach termed ISLE for identifying SLi, which is then shown to be predictive of clinical outcomes of cancer treatment in an unsupervised manner, for the first time.
Methods: ISLE consists of four inference steps, analyzing tumor, cell line and gene evolutionary data: It first identifies putative SL gene pairs whose co-inactivation is underrepresented in tumors, testifying that they are selected against. Second, it further prioritizes candidate SL pairs whose co-inactivation is associated with better prognosis in patients, testifying that they may hamper tumor progression. Finally, it eliminates false positive SLi using gene essentiality screens (testifying to causal SLi relations) and prioritizing SLi paired genes with similar evolutionary phylogenetic profiles.
Results: We applied ISLE to analyze the TCGA tumor collection and generated the first clinically-derived pan-cancer SL-network, composed of SLi common across many cancer types. We validated that these SLi match the known, experimentally identified SLi (AUC=0.87), and show that the SL-network is predictive of patient survival in an independent breast cancer dataset (METABRIC). Based on the predicted SLi, we predicted drug response in a wide variety of in vitro, mouse xenograft and patient data, altogether encompassing >700 single drugs and >5,000 drug combinations in >1,000 cell lines, 375 xenograft models and >5,000 patient samples. Importantly, these predictions were performed in an unsupervised manner, reducing the known risk of over-fitting the data commonly associated with supervised prediction methods. SL-derived predictions are based on computing an SL-score that estimates the efficacy of a given drug in a given tumor based on the latter’s omics data. The SL-score counts the number of inactive SL-partners of a given drug target(s) in the given tumor, reflecting the notion that a drug is likely to be more effective in tumors where many of its targets’ SL-partners are inactive. The predicted SL-scores show significant correlations (R > 0.4) with large-scale in vitro and in vivo drug response screens for the majority of drugs tested. Based on the conjecture that synergism between drugs may be mediated by underlying SLi between their targets, we additionally provide accurate predictions of drug synergism for both in vitro and in vivo drug combination screens (AUC~0.8). Most importantly, we demonstrate for the first time that an SL-network can successfully predict the treatment outcome in cancer patients in multiple large-scale patient datasets including the TCGA, where SLis successfully predict patients’ response for 75% of cancer drugs.
Conclusions: ISLE is predictive of the patients’ response for the majority of current cancer drugs. Of paramount importance, the predictions of ISLE are based on SLi between (potentially) all genes in the cancer genome, thus prioritizing treatments for patients whose tumors do not bear specific actionable mutations in cancer driver genes, offering a novel approach to precision-based cancer therapy. The predictive performance of ISLE is likely to further improve with the expected rapid accumulation of additional cancer omics and clinical phenotypic data.
Detection of Fraud in Large Omics Datasets
Authors: Michael Bradshaw and Samuel Payne
Abstract: Fraud is universal problem and affects our society through events like identity theft, money laundering, fixed elections and fabricated clinical trial results. With the rise of electronic crimes, specific criminal justice and regulatory bodies have been formed. The scientific community is not exempt from fraud; about 1% of authors admit to data fabrication, meaning thousands of articles are published each year with manipulated data. These false research products could be the basic science motivating drug development, while some may be clinical investigations. Current measures to prevent and deter scientific misconduct come in the form of the peer-review process and clinical trial auditors.
Recent advances in high-throughput omics technologies have moved biology into the realm of big-data. Large-scale characterization of cell lines and drugs like Cancer Cell Line Encyclopedia and the Connectivity Map point to future clinical trial requirements including omics data collection. This new quantity of data requires methods of computational fraud detection capable of identifying patterns hidden in big-data.
We explore methods of data fabrication and detection in cancer proteogenomic experiments using supervised machine learning and Benford’s law-like digit preferences. We use data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) cohort for endometrial carcinoma, specifically the copy number alteration (CNA), proteomic and transcriptomic datasets. Data from 100 tumor samples was supplemented with 50 additional fake samples. Three different methods of varying sophistication are used for fabrication: random number generation, resampling with replacement and imputation.
Data Generation: Random Number – Fake samples are generated by randomly picking numbers between the maximum and minimum values observed in the original data. Resampling with Replacement -Fake samples are generated by creating lists of all observed values across the cohort for a gene. Values for the fabricated data are then chosen by randomly sampling from these lists with replacement. Imputation – A fake sample was generated by first copying a real sample. We iteratively introduced NAs and then imputing these with missForrest until every value has been imputed.
Results: Our machine learning model was inspired by Benford’s law, which is extensively used to detect financial fraud. Thus, instead of using the high dimensional matrices, the omics data measurements of proteins or genes were converted into 20 features: the percent occurrence of the digits 0-9 in the first two digits after the decimal place. We train 6 machine learning models on digit preference of real and fake data. Final models were tested on 100 iterations of data fabrication and a mean accuracy calculated. On data fabricated with random number generation or resampling, several machine learning models were able to achieve 100% accuracy. Detecting fabrication in imputation of the CNA data was also surprisingly successful, with many models achieving >98% accuracy.
CHASMplus reveals the scope of somatic missense mutations driving human cancers
Authors: Collin Tokheim and Rachel Karchin
Abstract: Large-scale cancer sequencing studies of patient cohorts have statistically implicated many genes driving cancer growth and progression, and their identification has yielded substantial translational impact. However, a remaining challenge is to increase the resolution of driver prediction from the level of genes to mutations, because mutation-level predictions are more closely aligned with the goal of precision cancer medicine. Here we present CHASMplus, a computational method, that is uniquely capable of identifying driver missense mutations, including those specific to a cancer type, as evidenced by significantly superior performance on diverse benchmarks. Applied to 8,657 tumor samples across 32 cancer types in The Cancer Genome Atlas, CHASMplus identifies over 4,000 unique driver missense mutations in 240 genes, supporting a prominent role for rare driver mutations. We show which TCGA cancer types are likely to yield discovery of new driver missense mutations by additional sequencing, which has important implications for public policy.
Atlas of Cancer Signaling Network: a resource of multi-scale biological maps to study disease mechanisms
Authors: Inna Kuperstein
Abstract: Summary: Cancerogenesis is associated with aberrant functioning of a complex network of molecular interactions, simultaneously affecting multiple cellular functions. Therefore, the successful application of bioinformatics and systems biology methods for analysis of high-throughput data in cancer research heavily depends on availability of global and detailed reconstructions of cancer-specific molecular networks amenable for computational analysis. We present here the second edition of Atlas of Cancer Signaling Network (ACSN), a pathway database and an interactive comprehensive network map of molecular mechanisms deregulated in cancer cells and tumor microenvironment. The resource includes tools for map navigation, visualization and analysis of molecular data in the context of signaling network maps. Constructing and updating ACSN involves careful manual curation of molecular biology literature and participation of experts in the corresponding fields. The cancer-oriented content of ACSN comprehensively covers the hallmarks of cancer and completely original.
Description: ACSN (https://acsn.curie.fr) is a web-based resource of multi-scale biological maps depicting molecular processes in cancer cell and tumor microenvironment. The core of the Atlas is a set of interconnected cancer-related signaling and metabolic network maps. Molecular mechanisms are depicted on the maps at the level of biochemical interactions, forming a large seamless network of above 8000 reactions covering close to 3000 proteins and 800 genes and based on more than 4500 scientific publications. The Atlas is a “geographic-like” interactive “world map” of molecular interactions leading the hallmarks of cancer as described by Hanahan and Weinberg. The Atlas is created with the use of systems biology standards and amenable for computational analysis. ACSN is composed of 13 comprehensive maps of molecular interactions. There are six maps covering signalling processes involved in cancer cell and four maps describing tumor microenvironment. In addition, there are 3 cell type-specific maps describing signaling within different cells types frequently surrounding and interacting with cancer cells. This feature of ACSN2.0 reflects complexity of tumor microenvironment. The maps of ACSN are interconnected, the regulatory loops within cancer cell and between cancer cell and tumor microenvironment are systematically depicted.
The cross-talk between signaling mechanisms and metabolic processes in the cancer cells is explicitly depicted thanks to new feature of the Atlas: ACSN is now connected to RECON metabolic network, the largest graphical representation of human metabolism. The maps of ACSN are organized in a hierachical manner. They are decomposed into functional modules with meaningful network layout. Navigation of the ACSN2.0 is intuitive thanks to Google Maps-like features of NaviCell web platform. The exploration of the Atlas is simplified due to semantic zooming feature, allowing the user to visualise the seamless Atlas and individual maps from ACSN collection at different levels of details description.
Cross-referencing with other databases and links to the scientific papers that were used for creating the Atlas allows the user to study in depth the knowledge represented in ACSN2.0. The ACSN content is permanently expanded with new signaling network maps and updated by the latest discoveries in the field of cancer-related cell signaling. In addition, ACSN is not only a cancer-oriented database: ACSN maps depict fundamental cell signaling processes and can be used in many domains of molecular biology.
The NaviCell web-based data analysis toolbox integrated in ACSN allows importing and visualizing heterogeneous omics data on top of the ACSN maps and performing standard functional analysis. In addition, NaviCom tool associated with ACSN2.0 environment can automatically generate ACSN-based molecular portraits of different cancer types using multi-level omics data from large-scale cancer omics data resources such as cBioPortal.
An additional breakthrough is that the signaling mechanisms represented in the ACSN2.0 maps are cross-linked to the metabolic processes depicted in RECON map from Virtual Metabolic Human initiative (Noronha et al, 2018). This allows studying involvement of metabolism into cancer development and understanding regulation of metabolic circuits by cell signaling mechanisms in cancer and vice versa.
To our knowledge, this is the only resource in the field that systematically gathers together unique and up-to-date information about mechanisms deregulated in cancer cells and tumor microenvironment. The entire content of ACSN2.0 is available at the website and downloadable in various formats. Essential scripts and documentation regarding maps construction process, implementing maps into NaviCell platform and NaviCell visualisation tool are provided through GitHub repository and accessible from the ACSN website, that makes ACSN compliant with FAIR principles.
Comprehensive map of regulated cell death signalling network: a powerful analytical tool for studying diseases
Authors: Inna Kuperstein
Abstract: The cell death process draws special attention, due to frequent perturbations of its machinery in various diseases. It can be described into three highly interconnected steps: initiation, signalling and execution of death signals. The different regulated cell death modes are co-coordinated by common inputs and the interplay between the cell death mechanisms is complex. There is often a combination of cues, such as cell energy status, external signals, intracellular damage state, that in combination can lead either cell survival, or to cell death scenarios. Obviously, all the regulations are dynamic, most activating and inhibiting actors can be reversed until late execution phases are triggered, which are irreversible. The existing linear and disconnected representations of regulated cell death mechanisms is far from satisfactory. In this work we exposed the first attempt to gather together all available information about various regulated cell death mechanisms and to represent it in a structured and computer-readable manner. The comprehensive map of regulated cell death covers the initiating phases, the signalling phase where the mode of the cell death is chosen and the execution phase resulting in cell elimination.
We used a systems biology approach to gather information about all known modes of regulated cell death. Based on the experimental data retrieved from literature by manual curation, we graphically represent the biological processes in a form of seamless comprehensive signalling network map of regulated cell death. The molecular mechanism of each regulated cell death mode is represented in details. The Regulated Cell Death (RCD) map is divided into 27 functional modules that can be visualized contextually in the whole map, or as individual diagrams. The map contains more than 1200 proteins and genes, 2020 biochemical reactions and is based on 600 scientific papers. The resource is open source and accessible via several web platforms for map navigation, data integration and analysis. The RCD map was used for functional interpretation of the differences in cell death regulation between Alzheimer’s disease and lung cancer using expression data that allowed emphasizing mechanisms responsible for inverse comorbidity between the two disorders. In addition, integration and analysis of expression and genomic data from ovarian cancer provided distinctive signatures of four major sub-groups in this disease.
RCD resource is topic-specific, and covers all known modes of cell death mechanisms and their crosstalk. The thoughtful layout and visual organisation of the biological knowledge on the map makes it a distinguished resource for data analysis and interpretation.
Systematic Discovery of the Functional Impact of Somatic Genome Alterations in Individual Tumors through Tumor-specific Causal Inference
Authors: Chunhui Cai, Gregory Cooper, Kevin Lu, Xiaojun Ma, Shuping Xu, Zhenlong Zhao, Xueer Chen, Yifan Xue, Adrian Lee, Nathan Clark, Vicky Chen, Songjian Lu, Lujia Chen, Liyue Yu, Harry Hochheiser, Xia Jiang, Jane Wang and Xinghua Lu
Abstract: Cancer is mainly caused by somatic genome alterations (SGAs). Precision oncology involves identifying and targeting tumor-specific aberrations resulting from causative SGAs. We report a tumor-specific causal inference (TCI) framework, which estimates causative SGAs by modeling causal relationships between SGAs and molecular phenotypes (e.g., transcriptomic, proteomic, or metabolomic changes) within an individual tumor. We applied the TCI algorithm to tumors from The Cancer Genome Atlas (TCGA) and estimated for each tumor the SGAs that causally regulate the differentially expressed genes (DEGs) in that tumor. Overall, TCI identified 634 SGAs that are predicted to cause cancer-related DEGs in a significant number of tumors, including most of the previously known drivers and many novel candidate cancer drivers. The inferred causal relationships are statistically robust and biologically sensible, and multiple lines of experimental evidence support the predicted functional impact of both the well-known and the novel candidate drivers that are predicted by TCI. TCI provides a unified framework that integrates multiple types of SGAs and molecular phenotypes to estimate which genome perturbations are causally influencing one or more molecular/cellular phenotypes in an individual tumor. By identifying major candidate drivers and revealing their functional impact in an individual tumor, TCI sheds light on the disease mechanisms of that tumor, which can serve to advance our basic knowledge of cancer biology and to support precision oncology that provides tailored treatment of individual tumors.
A Topology-Aware Edit Distance Measure For Cancer Evolutionary Trees
Authors: Yangqiaoyu Zhou and Layla Oesper
Abstract: According to the clonal evolution theory of cancer, the evolutionary history of a tumor can be described as a type of tree. These trees are directed graphs with vertices that represent tumor populations and are labelled by a set of mutations; each label indicates when a mutation first arose and is inherited by all descending populations. There are a number of important applications that require distance measures between these trees. For example, one recent study uses such distance measures to find a consensus tree among multiple possible trees. However, existing distance measures for cancer evolutionary trees are limited as they don’t explicitly consider differences between the tree topologies, and some existing distance measures don’t fully consider ancestral/clonal relationships between mutations.
In this work, we modify an edit-based distance (SuMoTED) for single-labelled trees proposed by McVicar et al. We create a novel normalization scheme for SuMoTED in contrast to the original normalization. Since the edit-based distance focuses only on ancestor-descendant relationships, we extend this measure by combining it with a shape measure we proposed called EDBush that allows for measurement of different cancer evolution models. We also extend the distance measure to work on multi-labelled trees.
We compare our normalization scheme with the original SuMoTED normalization on simulated data. We find that our normalization preserves the relative distances between trees, while McVicar et al.’s skews the distances for trees with different branching factors. We apply our method on single-labelled simulated data, and we discover that our distance metric outperforms SuMoTED alone or EDBush alone. We also apply our distance on multi-labelled trees created using different tree inference methods for a triple negative breast cancer dataset and an acute lymphoblastic leukemia dataset. We find that the results align with our intuition on which trees are more similar.
Accurate Quantification of Copy-number Aberrations and Whole-genome Duplications in Multi-sample Tumor Sequencing Data
Authors: Simone Zaccaria and Ben Raphael
Abstract: opy-number aberrations (CNAs) and whole-genome duplications (WGDs) are frequent somatic mutations in cancer. Accurate quantification of these mutations from DNA sequencing of bulk tumor samples is complicated by tumor purity, admixture of multiple tumor clones with distinct mutations, and high aneuploidy. Standard methods for CNA inference analyze tumor samples individually, but recently DNA sequencing of multiple samples from a cancer patient – e.g. from multiple regions of a primary tumor, matched primary/metastases, or multiple time points – has become common.
We introduce Holistic Allele-specific Tumor Copy-number Heterogeneity (HATCHet), an algorithm that infers allele and clone-specific CNAs and WGDs jointly across multiple tumor samples (e.g. multiple regions or time points) from the same patient. HATCHet provides a fresh perspective on CNA inference and includes several algorithmic innovations that overcome limitations of existing methods. On 49 samples from 10 prostate cancer patients, HATCHet identifies subclonal CNAs in only 29 samples, while explaining the data better than published analysis which report subclonal CNAs in all samples. In contrast, on 35 samples from 4 pancreas cancer patients, HATCHet identifies subclonal CNAs and WGDs that were missed in published analysis. HATCHet’s inferred CNAs are also more consistent with the reports of polyclonal origin of metastasis in a subset of patients and with the somatic SNVs identified across all patients. HATCHet substantially improves the analysis of CNAs and WGDs, leading to more reliable studies of tumor evolution in primary tumors and metastases.
Effective clustering for single cell sequencing cancer data
Authors: Simone Ciccolella, Murray Patterson, Paola Bonizzoni and Gianluca Della Vedova
Abstract: Single cell sequencing (SCS) technologies provide a level of resolution which makes it very suitable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.
One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells, possibly along with additional information such as copy number, etc. This line of attack makes sense, since the hundreds of tumor cells sequenced typically originate from a few dozen clones — hence a method which reliably groups mutations (or cells) that originate from the same clone can be very effective. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing these means. However, since the values in these matrices are of a categorical nature (presence, absence, number of copies, etc.), we explore techniques for clustering categorical data — which are commonly used in data mining and machine learning — to SCS data, with this same goal in mind.
In this work, we explore such categorical clustering techniques, and show on a study of simulated cancer phylogenies that the k-modes technique is the most effective strategy, when coupled with our novel dissimilarity measure — which we term the conflict dissimilarity measure — for computing the resulting modes, or centroids. We demonstrate this by showing that k-modes clusters mutations with high precision: never pairing too many mutations which are unrelated in the ground truth, but also obtains accurate results in terms of the tree inferred downstream from a tumor phylogeny inference method, when applied to the resulting reduced instance produced by k-modes. Finally, we apply the entire pipeline (clustering + inference method) to a real dataset which was previously too large for the inference method alone, showing that our clustering procedure is effective in reducing the running time, hence raising the bar considerably on the instance size that can be solved.
Availability: Our approach, Celluloid: clustering single cell sequencing data around centroids, which uses k-modes along with our conflict dissimilarity measure, is available at https://github.com/AlgoLab/celluloid/ under an MIT license.
A full version of this article can be found on BioRxiv at https://doi.org/10.1101/586545.
Descendant Cell Fraction: Copy-aware Inference of Clonal Composition and Evolution in Cancer
Authors: Mohammed El-Kebir, Simone Zaccaria and Ben Raphael
Abstract: A tumor results from an evolutionary process where somatic mutations accumulate in a population of cells. This process gives rise to a tumor that is a mixture of distinct clones, distinguished by somatic mutations including single-nucleotide variants (SNVs), copy-number aberrations (CNAs), and other changes. The standard approach to identify such clones is to cluster SNVs that have similar cancer cell fractions (CCFs), defined as the proportion of tumor cells harboring the mutation. The key assumption of this approach is that SNVs with similar CCFs have occurred on the same phylogenetic branch. There are, however, two key deficiencies: (1) the CCF cannot be unambiguously inferred from DNA sequencing data; (2) the CCF does not account for loss of mutations, which is common in tumors with CNAs. Thus, the standard approach might lead to incorrect reconstructions of tumor clonal architectures, which in turn might lead to incorrect conclusions in downstream analyses.
Here, we define a novel quantity, the descendant cell fraction (DCF) that addresses these deficiencies in a rigorous manner, providing a summary statistic for both the prevalence and the evolutionary history of an SNV. That is, SNVs with the same DCF are likely to have occurred on the same branch of the phylogenetic tree describing the evolution of the tumor. We introduce DeCiFer, an algorithm to simultaneously infer evolutionary histories of individual SNVs and clusters SNVs by their corresponding DCFs under the principle of parsimony. Underpinning DeCiFer is an elegant embedding of the high-dimensional space of evolutionary histories of SNVs onto the low-dimensional DCF space. On simulated data, we show that DeCiFer more accurately clusters SNVs than existing methods and infers evolutionary histories with high recall. On a metastatic prostate cancer dataset, we show that DeCiFer’s use of the DCF to cluster SNVs results in more parsimonious evolutionary and migration histories of these metastatic cancers. Thus, DeCiFer enables more accurate quantification of intra-tumor heterogeneity and improves inference of tumor evolution.
Ab initio Spillover Compensation in CyTOF Data
Authors: Qi Miao, Fang Wang, Rafet Basar, Muharrem Muftuoglu, Li Li, Katy Rezvani and Ken Chen
Abstract: The mass cytometry (CyTOF) technology, which employs metal-isotope-tagged monoclonal antibodies to measure the expressions of the surface proteomic markers and/or intracellular signaling molecules in single cells, has been widely used to characterize cellular heterogeneity in tumor samples. The technology is particularly important to study immune responses to cancer treatment, as it can measure up to 40 parameters (channels) from around 100,000 cells in a single assay. Due to technological limitations, the intensity measured in a channel can be affected by neighboring channels. Although generally minor, such spillover effects can substantially limit the accuracy of cell type identification and lineage tracing. It is possible to alleviate the effects by selecting high purity isotopes and redesigning the metal isotopes. However, that is often complicated, time-consuming and costly. The Catalyst approach (PMID:29605184) can reduce spillover, but it requires the use of additional control beads, which increases the cost.
Here, we present a novel computational method that can autonomously compensate the spillover effects in a CyTOF dataset without using any control beads. Our method utilizes knowledge-guided modeling and statistical algorithms to infer the optimal compensation matrix and perform correction.
In CyTOF data, spillover usually results from three sources: signals in neighboring channels, isotope impurity resulting from unspecific metal-antibody conjugation, and oxidized metals from other channels. When a channel is affected, its intensity distribution often follows a bimodal distribution, with the lower intensity component corresponding to the spillover noise. Defining the noise components as Y_(N×M) and signals as X_(N×M), where N is the number of cells and M is the number of channels. Our goal is to identify a compensation matrix A_(M×M) that best predicts Y from X: Y≈XA. The computational challenge is that Y and X are convoluted in the data and will need to be deconvoluted from admixed intensity measures. We hypothesize that it is possible to perform such deconvolution based on factor analysis of channel density distributions, prior knowledge about the error sources and of the biology of the cells.
To perform deconvolution, we first used Hartigans’ dip test to identify channels having bimodal intensity distributions. For those channels, we used a maximal likelihood approach to deconvolute the data into a Gaussian (noise) and a log-Gaussian (signal) component. For the rest of the channels, we selected an empirical cutoff (first quantile). Given X and Y, we then used sequential quadratic programming to estimate the compensation matrix A. Multiple constraints were introduced to regularize the optimization: 1) negative intensity values should not be observed in CyTOF data, 2) spillover tends to affect neighboring channels and 3) oxidation should affect particular channels but not others (due to predictable changes in metal mass). Finally, we performed correction by subtracting out noises estimated by the model. We examined our method using three datasets, obtained respectively from C57BL/6J mouse bone marrow, healthy human bone marrow, and chronic lymphocytic leukemia patient samples. In the C57BL/6J mouse bone marrow and chronic lymphocytic leukemia patient data, our method led to improved clustering judged by manual gating results, with adjusted rand indexes (ARIs) increased respectively from 0.786 to 0.903, and from 0.343 to 0.370. However, in the human bone marrow data, the ARI was lowered from 0.847 to 0.770. Upon careful examination using t-SNE and PhenoGraph, we found that our method discovered novel, meaningful clusters designated by cluster-specific CD7 and CD20 expressions, which explained the drop in the ARI.
In summary, we developed a new method that could alleviate spillover effects in CyTOF data without relying on additional control data. Our method is implemented in R and can be widely distributed. We expect that our method will significantly benefit the cancer research community in studying tumor microenvironment and developing novel immunotherapy.
Toward understanding mutagenesis in transcription factor binding sites
Authors: Harshit Sahay, Ariel Afek and Raluca Gordan
Abstract: The ability to sequence thousands of whole cancer genomes has led to the finding that the vast majority of somatic variants in cancer occur in non-coding regions, including in genomic sites bound by transcription factor (TF) proteins, which regulate gene expression. Finding regulatory non-coding drivers (i.e. drivers that occur in TF binding sites) is an important line of research, and has been the focus of several recent studies. However, only a few recurrently mutated regulatory regions show significant association with gene expression, and the number of putative regulatory drivers identified remains much smaller than the number of coding drivers. This suggests that a large number of regulatory somatic mutations may be recurrent due to underlying mutational processes rather than selection in cancer cells. Specifically, the high enrichment of cancer somatic mutations in TF binding sites is currently thought to come from TFs bound to DNA lesions, acting as roadblocks for DNA repair and replication. However, it is unclear how TFs interact with lesioned DNA, and if this binding is indeed strong enough to outcompete repair enzymes for recognition of the lesions. DNA mismatches (i.e. non-complementary base pairs) are an example of such lesions. Mismatches act as precursors to mutations, and can arise from errors during DNA replication, homologous recombination or frequent spontaneous DNA deamination. By changing both the sequence and the structure of the DNA, mismatches are likely to impact TF binding in ways that are not currently understood.
We developed Saturation Mismatch Binding Assay (SaMBA), the first high-throughput assay to characterize binding of TFs to mismatched DNA. By applying this assay to 21 different TFs across 14 different structural families, we observe a complex landscape of TF binding to mismatched DNA. For all TFs tested, we found mismatches that increase binding compared to the wild type sequences. About 30% of mismatches in TF binding sites preserve or increase the binding affinity, and the base pair mutations corresponding to these mismatches do not have the same effects. Thus, sequence alone cannot capture the effects of mismatches, and existing TF-DNA binding models that rely on Watson-Crick sequences cannot predict the binding of TFs to mismatched DNA. We propose expanding current binding models to include the 16-letter alphabet of all possible matched and mismatched base pairs, instead of the 4-letter alphabet that only accounts for Watson-Crick base pairs. We employ k-mer SVR regression models that include all 1-mer and 2-mer features over this 16-letter alphabet, and achieve high prediction accuracy on held-out mismatch binding data (R2=0.89). Our new models shed light on the biophysical mechanism of TF-DNA binding, and can be used to generalize the mismatch effects to the entire genome.
We also tested the hypothesis that TF binding to mismatched DNA can lead to an enrichment of mutations in TF binding sites. For this, we focused on a T-G mismatch in the c-Myc binding site CACGCG (generated by changing the third C to a T). The mismatch increases c-Myc binding by 30-fold, leading to a binding affinity that is stronger than the canonical sequences recognized by the TF. Such strong affinity c-Myc binding could interfere with repair, which would eventually lead to an enrichment of CACGCG > CACGTG mutations in cancer genomes. We tested this hypothesis in two sets of cancer genomes, those without micro-satellite instability (MSS, or micro-satellite stable), and those with micro-satellite instability (MSI). MSI tumors are characterized by deficiencies in DNA mismatch repair mechanisms. Our hypothesis would suggest an enrichment for the aforementioned mutation in genomes with functional repair (MSS) due to competition for the mismatched site between c-Myc and repair enzymes. However, for MSI tumors with deficient repair, we expect this enrichment to be reduced or absent, since mismatches are poorly repaired regardless of their TF binding status. Our results support this hypothesis, as we see a strong enrichment of CACGCG > CACGTG mutations (compared to NNCGCG > NNCGTG) in MSS samples (p < 10-22, Fisher’s exact test) but no enrichment in MSI tumors (p =0.44).
Our results suggest that interactions between mismatches (and more generally, damaged DNA) and transcription factors, which might act as roadblocks to repair, could play an important role in mutagenesis. Since these mutations arise from endogenous cellular processes, and are not generally subject to selection, characterizing them is of vital importance for identifying non-coding drivers. Thus, our new models of TF binding to mismatched DNA are critical for understanding mutagenesis in regulatory genomic regions.
Latent periodic process inference from single-cell RNA-seq data
Authors: Shaoheng Liang, Fang Wang, Jincheng Han and Ken Chen
Abstract: Development of cancer is underlaid by complex, convoluted biological processes. Although single-cell RNA sequencing (scRNA-seq) provides an unprecedented opportunity to uncover latent biological processes, accurate characterization of these processes is often challenging because of their stochastic, dynamic and multifactorial nature. It is particularly the case for periodic processes such as cell cycles. Distinct from processes that occur linearly across time, cell cycle is periodic. It starts at the G0 phase, goes through G1, S and G2/M and returns to G0 in around 24-hours for human cells. This process is elegantly orchestrated by variable sets of genes (e.g., cyclins and cyclin-dependent kinases) that are turned on and off at fairly precise timing. As a result of such periodicity, the cycling cells at different transcriptomic states form a circular, non-linear trajectory in high dimensional gene expression spaces. The position of cells alongside the circular trajectory indicates the timing (pseudo-time) of cells in a cell cycle.
A variety of analytical methods have been developed to characterize cell-cycle from scRNA-seq data but are limited in various ways. Approaches based on linear representations such as principal component analysis (PCA), ccRemover, Cyclone, scLVM and Seurat are suboptimal at representing circular trajectory. Recent approaches such as reCAT can distill non-linear trajectories but require large, uniformly-sampled data. Most of these methods depend on prior marker genes, which may be biased or incomprehensive for studying malignant cells with mutated cell-cycling pathways.
To address these limitations, we developed a new method Cyclum, which innovates an Auto-Encoder with novel sinusoidal activation functions to capture the circular trajectory formed by a periodic process in the high dimensional gene expression space, without relying on prior gene sets. Specifically, this method adopts a double-layer perceptron as an encoder to compress the gene expressions of each cell into one dimension, representing cell-cycling pseudo-time. The pseudo-time is then fed into the decoder, in which it gets nonlinearly transferred into two dimensions using sine and cosine functions, and then linearly mapped back to the gene expression space. Optimizing the encoder and decoder simultaneously leads to an optimal inference of the cell pseudo-times. Conceptually, Cyclum identifies an optimal (least square) circular manifold embedded in the expression space and unfolds it onto a linear space to obtain pseudo-times. It infers genes whose expression dynamics match the inferred periodicity and are thus related to the periodic process. Consequently, Cyclum can be applied to decompose the confounded processes and remove the effects of cell cycling from the scRNA-seq data.
Experiments using the scRNA-seq data from a set of proliferating cell-lines and mouse embryonic stem cells (mESCs) show that Cyclum reconstructed experimentally labeled cell-cycling stages and rediscovered known cell-cycling genes at significantly higher accuracy than PCA, reCAT and Cyclone. Downsampling experiments show that Cyclum is more robust to small sample sizes and experimental noises. Applying Cyclum to removing confounding cell-cycling effects recovered obscured clonal architecture in virtual tumor data, which were generated synthetically by perturbing the expressions of randomly selected genes in the isogenic mESC data. Under almost all conditions, the accuracy and robustness of Cyclum exceed comparable approaches such as PCA, ccRemover and Seurat.
We further apply Cyclum on scRNA-seq data derived from melanoma patients. Cyclum results in accurate cell-cycle pseudo-time inference supported by the Gene Set Enrichment Analysis (GSEA). Removing the cell cycling effects elucidated the presence of drug-resistance (MITF low and AXL high) cell subpopulations in pretreatment samples and novel cell-cycle genes such as KCNQ1OT1, FBLIM1, etc., which are implicated in the literature. Thus, Cyclum can be applied as a generic tool for characterizing periodic processes underlying cellular development/differentiation from scRNA-seq data and will particularly benefit cancer studies. The source code of Cyclum is freely available for academic use at https://github.com/KChen-lab/cyclum.
Methyl-sensitive transcription factor motifs in leukemia cell populations
Authors: Coby Viner, James Johnson, Laura García-Prat, Charles A. Ishak, Nicolas Walker, Hui Shi, Marcela Sjöberg-Herrera, Shu Yi Shen, David J. Adams, John E. Dick, Anne C. Ferguson-Smith, Daniel D. De Carvalho, Timothy L. Bailey and Michael M. Hoffman
Abstract: Many transcription factors initiate transcription only in specific contexts, providing the means for sequence specificity of transcriptional control. A four-letter DNA alphabet only partially describes the nucleobase diversity a transcription factor might encounter. Genomic cytosine is often modified to 5-methylcytosine (5mC) or 5-hydroxymethylcytosine (5hmC). Modification-sensitive transcription factors provide a mechanism for widespread changes in (hydroxy)methylation—downstream effectors of gene expression programs. In particular, acute myeloid leukemia (AML) and glioblastoma multiforme often have dysregulated epigenetic cytosine states, caused by mutations in DNMTs, TET2, or IDH1/2. AML is maintained by a population of distinct leukemia stem cells, which express particular markers that can be isolated via flow cytometry. Accurate prediction of modification sensitivity, across distinct populations of cancer cells, is crucial for understanding and deconvolving gene–regulatory effects.
We developed methods to discover motifs and identify transcription factor binding sites (TFBSs) in DNA with covalent modifications. Our models expand the standard A/C/G/T alphabet, adding m (5mC) and h (5hmC). We adapted the position weight matrix (PWM) model of TFBS affinity to an expanded alphabet. We engineered several tools to work with expanded-alphabet sequence and PWMs. First, we developed a program, Cytomod, to create a modified sequence, using data from bisulfite and oxidative bisulfite sequencing experiments. Second, we enhanced the MEME Suite, to support arbitrary alphabets, including modification-sensitive motifs. In particular, a new version of CentriMo enables central motif enrichment analysis to infer direct DNA binding in an expanded-alphabet context. Third, we added support for our alphabet to the recently-developed RSAT matrix-clustering software, enabling clustering of modified PWMs. These versions permit users to specify new alphabets, anticipating future alphabet expansions, including analyses of the downstream oxidized covalent modifications of 5hmC.[>
We created multiple expanded-alphabet sequences using whole-genome maps of 5mC and 5hmC in naive ex vivo mouse T cells and 5mC in the leukemia cell line K562. In addition to our analyses on K562, we have conducted whole genome bisulfite sequencing for three cell populations of patient-derived AML cell fractions (8227 cells): quiescent (CD34+/CD38-), progenitor CD34+/CD38+, and senescent or terminal (CD34-) cells. Using these sequences and (organism-matched) ChIP-seq data from ENCODE and others, we identified modification-sensitive cis-regulatory modules. We elucidated known methylation binding preferences, in all cell types, including C/EBPβ’s preference for methylated motifs and c-Myc’s opposite preference. Using these known binding preferences to calibrate parameters, we then discovered novel preferences for 5 transcription factors, as well as numerous new 5mC and 5hmC motifs.
We have also determined specific clustered predictions of unmethylated- vs. methyl-preferring motif groups, finding evidence for small subsets of potentially methyl-preferring motifs in otherwise unmethylated-preferring factors. We have begun in vivo tests of several predictions in K562, using the recently-developed CUT&RUN assay, both with unconverted and bisulfite-converted DNA, to specifically validate our methylated binding motifs and their clusters. Using our methylation datasets from 8227 cell populations, we are applying these methods to discover binding preferences, potentially resulting from leukemia stem cell–specific transcriptomic re-programming.
Tracing tumor cellular evolution through copy number alterations
Authors: Fang Wang, Qihan Wang, Jincheng Han, Shaoheng Liang, Ruli Gao, Li Ding, Nicholas Navin and Ken Chen
Abstract: The accumulation of copy number alterations (CNAs) in an individual tumor leads to the presence of heterogeneous cell populations. CNA mutagenesis provides diverse genomic profiles upon which selection and evolution can act. These CNA records permit the life history of a tumor to be deciphered to infer the chronology, lineage and CNA events during the evolution. Computational and technological advances are needed to gain a deeper understanding on the course and dynamics of the evolution and determine the tumor’s next move. Recent advances in single-cell sequencing (SCS) technologies provide an unprecedented opportunity to investigate such intricate developmental processes at cellular resolution. However, there is no computational method that encodes structural-biological properties of human genomes and knowledge about CNA mutagenesis in reconstructing cellular evolution from SCS data.
To accurately infer tumor cellular evolution from SCS data, we defined a novel minimal genome evolution distance (MGED) to represent the minimal number of CNA events required to evolve one genome to the next. This metric is biologically more meaningful in measuring the expected time lapse between two genomes than conventional metrics such as Euclidean distance, which are biased by the sizes and types of CNAs. We proved mathematically that identifying MGED is equivalent to inferring a sequence of longest possible CNAs, which can be solved using a greedy algorithm in linear time. We then adopted the Edmond’s algorithm to infer a rooted directed minimal genome evolution tree (RDMGET) from pair-wise MGEDs, which can be executed in polynomial time and thereby scalable to thousands of unique genomes. To characterize the overall evolution dynamics, we further developed a statistical test that probabilistically categorizes RDMGETs into linear, branching, punctuated, or neutral evolution models.
We evaluated our methods using SCS data simulated under various CNA mechanisms such as breakage-fusion-bridge and non-homologous end joining, CNA rate and evolution dynamics models. Our method reconstructed lineages and chronology at an accuracy much higher than traditional metrics/approaches. We applied our method to longitudinal single-cell DNA sequencing (scDNA-seq) data obtained from chemoresistant triple-negative breast cancer patients. The cellular chronologies inferred by our methods matched the timing and the histology of the samples. Moreover, we discovered lineages with accelerated CNA burdens, which would have been missed by populational clustering analysis. In some instances, we were able to identify lineage-specific putative functional CNAs affecting known cancer genes. Similar biological insights were gained when we applied our method on single-cell RNA sequencing (scRNA-seq) data obtained from head and neck cancer and multiple myeloma patients, preprocessed using InferCNV that transforms RNA expression to CNA profiles.
Contribution of synthetic lethality to cancer risk and onset time across human tissues
Authors: Nishanth Ulhas Nair, Kuoyuan Cheng, Joo Sang Lee and Eytan Ruppin
Abstract: The tissue-specificity of cancer and cancer risk is a fundamental open research question. Beyond advancing our understanding of carcinogenesis, elucidating the factors underlying cancer risk may also contribute to cancer prevention. Two studies by Tomasetti et al. published recently in Science have shown that the variation in tissue cancer risk can be explained by the number of tissue stem cell divisions occurring during lifetime. Following, Klutstein et al. have shown that abnormal DNA methylation is another important mediator of cancer risk. While cancer risk is likely not determined by a single factor, no other factor has been reported since to account for this fundamental variation.
Here we show that, in addition to stem cell divisions and the levels of abnormal DNA methylation, synthetic lethality (SL) is another strong determinant of cancer risk across human tissues. SL is a well-known type of genetic interaction where cell death occurs under the combined inactivation of two paired SL genes but not either of them alone. Targeting SLs has been recognized as a highly valuable approach for cancer treatment. We hypothesized that since down-regulated SL gene pairs reduce the viability of cancer cells, they may impede the malignant transformation of normal cells, thus modulating cancer risk. Utilizing several recently published large-scale cancer SL networks, we systematically quantified the SL load (defined as the number of down-regulated SL gene pairs) in numerous normal and cancer tissues from the TCGA and GTEx datasets. Our key findings are:
1. SL load is higher in normal tissues vs tumors originating from them, and is higher in early-stage/less proliferative cancers vs more advanced/highly proliferative ones. This supports the notion that higher SL load may indeed impede cancer development.
2. SL load in normal tissues is strongly inversely correlated with their lifetime cancer risk. Importantly, this correlation remains significant after controlling for the number of tissue stem cell divisions or the levels of abnormal DNA methylation in each tissue.
3. In addition to cancer risk, SL load is also a novel predictor of cancer onset age across different normal tissues – higher SL load is associated with later onset of cancers in that tissue.
4. The SLs lost in the transition from healthy to cancer tissues tend to be the functionally stronger ones, testifying that their loss is not random and is functionally selected for.
5. Using an optimization approach, we identify a subset of SLs whose load is very strongly associated with lifetime cancer risk, at a level higher than that obtained with the total SL load or any of the previously reported factors. These SLs are also strongly predictive of cancer onset age across tissues although they were never optimized for that (and lifetime cancer risk and onset time are not correlated among themselves).
In summary, our study points to SL load as a novel factor that is strongly associated with cancer risk, extending upon the earlier pertaining reports. Furthermore, SL load predicts cancer onset age across tissues, for the first time. Taken together, our findings point to the pivotal role of synthetic lethality in mediating cancer development.
Integrating (phospho)proteomics for improved identification of signal pathway abnormalities in cancer
Authors: Kuoyuan Cheng, Sridhar Hannenhalli and Eytan Ruppin
Abstract: Transcriptomic profiling techniques has become a regular tool in studying the molecular abnormalities in cancers. To facilitate the functional interpretation of their findings, researchers commonly obtain pathway-level summarization of the gene-level data with a wide variety of gene set analysis methods. However, mRNA level can be a poor measure of protein activities, and emerging large-scale multi-omic cancer studies have shown that cellular pathways activities are better reflected by the (phospho)proteome, consistent with the long-established central role of protein phosphorylation in cell signal transduction. We hypothesize that combining (phospho)proteomic with transcriptomic data will allow a better determination of pathway activities, which is critical for identifying the targetable abnormal changes in cancers. In this study, we take advantage of the (phospho)proteome data from the NCI-CPTAC database to combine with the TCGA cancer data in a multi-omic integration approach. We found that transcriptome, proteome and phosphoproteome revealed different yet complementary aspects of the biological changes in cancers. We developed a computational pipeline for pathway analysis using multi-omic data, and showed that it achieved significantly higher accuracy in recovering known pathway activity changes in well-established subtypes of breast cancer, compared to traditional pathway analysis methods using only transcriptomic data. Considering that (phospho)proteomics has yet to be regularly employed in cancer studies and thus not widely available, we further applied machine learning to predict the multi-omic-based pathway activity levels from gene (mRNA) expression only, which achieved high predictive accuracies for selected central cellular pathways. In summary, this study consolidates the added value of (phospho)proteome, pointing to the importance of multi-omic approaches. A multi-omic methodological framework is provided for improved accuracy of inferring signal pathway activities in cancers, which is likely to facilitate the discovery of novel anti-cancer targets.
Network-based approaches elucidate differences within APOBEC and clock-like signatures in breast cancer
Authors: Yoo-Ah Kim, Damian Wójtowicz, Rebecca Sarto Basso, Itay Sason, Welles Robinson, Dorit S. Hochbaum, Mark DM Leiserson, Roded Sharan, Fabio Vandin and Teresa Przytycka
Abstract: Studies of cancer mutations typically focus on identifying cancer driver mutations. However, in addition to the mutations that confer a growth advantage, cancer genomes accumulate a large number of passenger somatic mutations resulting from normal DNA damage and repair processes as well as mutations triggered by carcinogenic exposures or aberrations in DNA maintenance machinery. These mutagenic processes often produce characteristic mutational patterns called mutational signatures. Understanding the etiology of the mutational signatures shaping the landscape of a cancer genome is an important step towards understanding tumorigenesis as the information often provides clues to the nature of the disease. For example, some mutational signatures are linked to specific deficiencies in DNA repair mechanism, which can suggest an efficient cancer treatment strategy. Indeed, the mutational signature caused by homologous recombination repair deficiency (HRD) helped identify patients who can benefit from PARP inhibitor treatment.
In this study, we focus on uncovering the relation of mutational signature strength with other biological properties of cancer patients such as gene expression and/or alterations. Signature strength in a cancer patient can be measured by the number of mutations that are attributed to the given signature and thus can be considered as a continuous phenotype. Considering mutational signatures as cancer phenotypes, we asked two complementary questions (i) what are functional pathways whose gene expression levels are associated with a certain mutational signature, and (ii) what are mutated pathways(if any) that might underlie specific mutational signatures? The correlated gene expression modules can hint biological processes that are either mutagenic (e.g. DNA damage related to APOBEC enzyme activity) or are activated as a result of a mutagen (e.g. a DNA repair pathway). Similarly, mutated pathways can point to dysregulated DNA repair processes that underlie a given signature.
Applying the two commentary approaches to a breast cancer dataset, we have been able to identify pathways associated with a number of mutational signatures on expression and/or on mutation levels, including: (i) differences between the two clock-like signatures (COSMIC signature 1 and 5) with respect to their associations with cell cycle (ii) association of the NER pathway and oxidation processes with the strength of clock-like Signature 5, (iii) differences in mutated subnetworks associated with two APOBEC related signatures (COSMIC signature 2 and 13). In particular, the two clock like signatures have been previously found correlated with the age of patients yet the strengths of correlation differ between the two signatures and vary across different cancer types. Our findings suggest that some processes, other than patient’s age, contribute to the rise of each of these signatures such as potential involvement of environmental factors. We demonstrate that our findings are consistent with the results from recent studies and provide additional insights that are important for understanding mutagenic processes in cancer and developing anti-cancer drugs.
Analysis of clustered mutations on highly mutated proteins in endometrial cancer
Authors: Amanda Oliphant, Emily Hoskins, Samuel Pugh, Jonathan Jarman, Daniel Cui Zhou, David Adams, Sean Beecroft, Li Ding and Samuel Payne
Abstract: Endometrial cancer, the most common cancer of the female reproductive system, is driven by mutations which are often unique to an individual or shared between a small number of patients in a cohort. This complicates data analysis because it is more difficult to identify the commonalities between patients. Combining molecular data with three-dimensional protein structure allows us to identify ‘hotspots’, or spatially clustered regions of a folded protein which are frequently mutated. We used the program Hotspot3D to locate hotspots of mutation on TP53 and PIK3CA, two of the most commonly mutated proteins in endometrial cancer, in a cohort of 95 women with endometrial carcinoma. Integrating proteomic and phosphoproteomic data, we identified correlations between the presence of a hotspot mutation in TP53 or PIK3CA, protein abundance, and phosphorylation of both the mutated protein and other proteins. Analysis of the cis and trans effects of TP53 mutation revealed several changes correlated only with hotspot mutations, and many changes which occurred regardless of mutation location. In contrast, our analysis of PIK3CA revealed that mutation of this protein had surprisingly little observable impact on the proteome and phosphoproteome of endometrial cancer cells.
Investigation of TP53 revealed the presence of a mutational hotspot in the protein’s DNA-binding domain. Mutations within this hotspot showed effects on the expression of several cancer-related proteins such as AURKA and NOL7, even when mutations outside of the hotspot caused no significant change. Other proteins such as XPO1 and CDK1 were equally affected by hotspot and non-hotspot mutations. These results provide valuable insight into the mechanisms driving the development of endometrial cancer. Mutations within the hotspot decrease p53’s ability to bind DNA, affecting transcriptional targets, while proteins influenced by all mutations most likely interact with p53 on a protein level either directly or indirectly.
Nearly half of the patients in our cohort had mutations in PIK3CA, and many of these mutations were located in a well-known hotspot. In spite of this, we discovered that mutations in PIK3CA had no significant impact on the phosphorylation profile of downstream proteins in its signaling cascade. We broadened our analysis and scanned our entire proteomic and phosphoproteomic datasets for any possible changes correlated with PIK3CA mutation, comparing hotspot mutations against wildtype as well as any mutation against no mutation. No significant results were found, suggesting that high mutation frequency and the presence of a hotspot are not always indicators that a protein plays a central role in cancer development.
Information Theoretic approaches to interrogate T-cell receptor diversity for cancer immunotherapy
Authors: Ashok Sivakumar, Dylan Hirsch, Rohit Bhattacharya, Justin Huang, Collin Tokheim, Valsamo Anagnostou, Victor Velculescu, Simon Wing and Rachel Karchin
Abstract: The T-cell receptor (TCR) repertoire of a patient’s immune system provides a novel data source to assess the fitness of a patients’ immune capability and presents an opportunity for assessing the likelihood of response under various treatment paradigms for disease. This investigation quantifies PDL-1 Immune checkpoint blockade response in Public Melanoma data and a clinical cohort of Non Small Cell Lung cancers via changes in tumor biopsies TCRs over the time of treatment. We have developed a flexible and customizable computational suite of tools to analyze biologically significant attributes of the TCR repertoire. Key findings include changes in the mutual information of Variable and Joining Genes’s Usage and Clonality, differences in TCR length of clones that are more public or shared between individuals, and modulations in clonality of repertoires demonstrating durable clinical benefit. Amino acids composing the highly dominating clones are also compared against known viral clones by HLA type, and modeled using an HMM across the landscape of samples to identify sequence characteristics and features for possible vaccine targets. Finally, all features are integrated into a machine learning pipeline to classify and predict patient response to therapies that specifically target immune cells in conjunction with current clinical features such as predicted neoantigen burden, indicating such synthesis of data is a viable strategy for more personalized immune therapies.
The Open Custom-Ranked Analysis of Variants Toolkit (OpenCRAVAT): A customizable annotation and prioritization pipeline for genes and variants
Authors: Kymberleigh Pagel, Lily Zheng, Rick Kim, Kyle Moad, Michael Ryan and Rachel Karchin
Abstract: The modern cancer genomics researcher is confronted with literally hundreds of published methods to identify driver genes, mutations, and pathways. Relevant resources include databases of genes and variants, phenotype-genotype relationships, algorithms that score and rank genes, and in silico variant effect prediction tools. Gene and variant prioritization is a multi-factorial problem, leading to the emergence of decision support frameworks which make it easier for users to integrate many resources in an interactive environment. Current analysis frameworks are limited by closed proprietary architectures, access to a restricted set of tools, lack of customizability, web dependencies that expose protected data, and limited scalability. OpenCRAVAT is an open source, scalable decision support framework with an extensive catalog of resources to support cancer driver variant and gene prioritization. It is highly customizable, does not expose protected data over the web and scales to very large datasets. As the term decision support implies, associating variants to disease ultimately relies on manual human expert review and interpretation. Automated systems can make these tasks tractable, by reducing the number of variants and genes to be considered to the most promising few. Interactive environments for visual exploration of results are critical to manual review. To our knowledge, OpenCRAVAT is the first open tool that provides both fast, configurable variant and gene prioritization and an integrated graphical interface that facilitates expert manual review of prioritized results.
ExploSig: Hypothesis-driven Exploration of Mutation Signature Etiology
Authors: Mark Keller, Welles Robinson, Mark Leiserson
Abstract: Mutation signatures provide insight into the mutational processes that have been operative in a tumor. Since the development of the computational methods that enable extraction of mutation signatures from cancer genomes, many signatures have been attributed to known endogenous or exogenous factors. Individual studies have pursued hypotheses probing the suspected environmental or molecular factors underlying individual signatures. As new hypotheses emerge along with novel signatures and large sequencing data sets, it can be difficult and time-consuming to perform analyses that span the iterative tasks of data processing, signature decomposition, and visualization. Here we present a web-based interactive visualization tool called ExploSig (https://explosig.lrgr.io) for analysis of mutation signatures, clinical data, and gene-level data, both within and across samples. No other mutation signature browser tools facilitate interactive visualization of these data types simultaneously, an important feature for exploration of signature etiology. Informed by human-computer interaction principles for information visualization systems, this tool enables users to perform the core tasks that form interactive analyses. We demonstrate that recent findings associating specific mutation signatures with particular clinical variables or alterations of genes can be reproduced using this tool, with no prior technical knowledge. ExploSig creates reproducible interactive workflows that can be saved or shared, confirming hypotheses or reinforcing prior results on additional data sets. The public instance of ExploSig currently contains data from over 10,000 cancer exomes from The Cancer Genome Atlas PanCanAtlas initiative and over 2,500 cancer genomes from the International Cancer Genome Consortium. We anticipate that such an interactive visualization tool will be especially useful in the near future for investigating potential etiologies of those signatures that remain unknown.