Model Assembly and Updates

Cancer types of interest

We start with six cancer types that are particularly relevant due to a combination of frequency of occurrence and lack of adequate treatments. The cancer types we have initially chosen are as follows.

  • Acute Myeloid Leukemia (aml)

  • Breast Carcinoma (brca)

  • Lung Adenocarcinoma (luad)

  • Pancreatic Adenocarcinoma (paad)

  • Prostate Adenocarcinoma (prad)

  • Skin Cutaneous Melanoma (skcm)

Each type is followed by a “code” in parantheses indicating the identifier of the model through which models are organized in the cloud, on AWS S3.

Model availability

EMMAA models may be browsed on the EMMAA Dashboard, for more information, see a tutorial to the dashboard here: EMMAA Dashboard, and the dashboard itself here: For example the AML model can be accessed directly at

Defining model scope

Each model is initiated with a set of prior entities and mechanisms that take entities as arguments. Search terms to extend each model are derived from the set of entities.

Deriving relevant terms for a given type of cancer

Our goal is to identify a set of relevant entities (proteins/genes, families, complexes, small molecules, biological processes and phenotypes) that can be used to acquire information relevant to a given model. This requires three components:

  • A method to find entities that are specifically relevant to the given cancer type

  • A background knowledge network of interactions between entities

  • A method to identify relevant links and entities on the background knowledge network

These methods, as described in the subsections below, are implemented in the TcgaCancerPrior (emmaa.priors.cancer_prior.TcgaCancerPrior) class.

Finding disease genes

To identify genes that are relevant for a given type of cancer, we turn to The Cancer Genome Atlas (TCGA), a cancer patient genomics data set available via the cBio Portal.

We implemented a client to the cBio Portal which is documented here.

Through this client, we first curate a list of patient studies for the given cancer type. These patient studies are tabulated in emmaa/resources/cancer_studies.json.

Next, we query each study with a list of genes (the entire human genome, in batches) to determine which patients have mutations in which genes. From this, we calculate statistics of mutations per gene across the patient population.

Finding relevant entities in a knowledge network

Finding relevant entities requires a prior network that can be supplied as a parameter to TcgaCancerPrior. We use a network derived from processing and assembling the content of the PathwayCommons, SIGNOR, and BEL Large Corpus databases, as well as machine reading all biomedical literature (roughly 32% full text, 68% abstracts) with two machine reading systems: REACH and Sparser. This network has 2.5 million unique mechanisms (each corresponding to an edge).

Starting from the mutated genes described in the previous section, we use a heat diffusion algorithm to find other relevant nodes in the knowledge network. We first normalize the mutation counts by the length of each protein (since larger proteins are statistically more likely to have random mutations which can lack functional significance). We then apply the normalized mutation count as a “heat” on the node in the network corresponding to the gene. When calculating the diffusion of heat from nodes, we take into account the amount of evidence for each edge in the network. The number of independent evidences for the edge (i.e. the number of database entries or extractions from sentences in publications by reading systems) and use a logistic function with midpoint set to 20 by default (parameterizable) to set a weight on the edge. We use a normalized Laplacian matrix-based heat diffusion algorithm on an undirected version of the network, which can be calculated in a closed form using scipy.sparse.linalg.expm_multiply.

Having calculated the amount of heat on each node, we apply a percentile-based cutoff to retain the nodes with the most heat.

Assembling a prior network

Given a set of entities of interest, we turn to the INDRA DB and query for all Statements about these entities. This set of Statements becomes the starting point from which the model begins a process of incremental extension and assembly. This is implemented in emmaa.priors.prior_stmts.

Updating the network

Given the search terms associated with the model, we use a client to the PubMed web service to search for new literature content.


Given a set of PMIDs, we use our Amazon Web Services (AWS) content acquisition and high-throughput reading pipeline to collect and read publications using the REACH and Sparser systems. We then use INDRA’s input processors to extract INDRA Statements from the reader outputs. We also associate metadata with each Statement: the date at which it was created and the search terms which are associated with it. These functionalities are implemented in the emmaa.readers.aws_reader module.

As an optimized approach to gathering and reading new publications, we decoupled this step from EMMAA, and it is currently done independently by a scheduled job of the INDRA DB once a day. EMMAA’s model update jobs query the DB directly for Statements extracted from the new publications each day, making the model update cycle significantly faster. These queries are implemented in emmaa.readers.db_client_reader.

Automated incremental assembly

Each time new “raw” Statements are added to the model from new literature results, an assembly process is run which involves the following steps:

  • Filter out hypotheses

  • Map grounding of entities

  • Map sequences of entities

  • Filter out Statements with ungrounded entities

  • Run preassembly in which exact and partial redundancies are found and resolved

  • Calculate belief score for each Statement

  • Filter to statements above a configured belief threshold

  • Filter out subsumed Statements with respect to partial redundancy graph

  • (In some models) filter out Statements representing indirect mechanisms

The set of Statements obtained this way are considered to be “assembled” at the knowledge level. It is this assembled set of Statements that are considered when showing update statistics on the Dashboard. The newly obtained assembled Statements are also evaluated against Statements already existing in the model. Note that The Statements below the threshold still remain in the “raw” model knowledge and can later advance to be included in the published model if they collect enough evidence to reach the belief threshold.

A new Statement can relate to the existing model in the following ways:

  • Novel: there is no such mechanism yet in the model

  • Redundant / Corroborating: the mechanism represented by the Statement is already in the model, providing new, corroborating evidence for that Statement

  • Generalization: the mechanism is a more general form of one already in the model

  • Subsumption: the mechanism is a more specific form of one already in the model

  • Conflicting: the mechanism conflicts with one already in the model

Currently, the dashboard lists new Statements without explicitly showing what relationship they have to the existing model.