In these exercises, we will use a variety of text-mining tools and databases based on text mining to interpret the associations of genes and diseases. The exercises will teach you how to:
- automatically highlight named entities in a web page
- use named entity recognition for synonym-aware information retrieval
- extract associations based on cooccurrence of entities in the literature
All exercises are purely web-based. We recommend using Firefox, as some functionality will not work in the latest Chrome and Chrome-based browsers.
In this exercise we will first introduce the basics of text mining: 1) dictionary-based named entity recognition and 2) how this can be used to help retrieve literature. Afterwards we will move on to how one can use the complete literature to 3) extract associations between entities and finally 4) how these associations can be used for knowledge discovery.
1.1 Named Entity Recognition
The goal of named entity recognition (NER) is to find names mentioned in text and resolve them to the underlying biomedical entities (document → entity A, entity B, entity C). To illustrate this, we will use the EXTRACT tool, which is designed to use NER to support manual database curation.
Install the EXTRACT bookmarklet as described on the EXTRACT website. We recommend using Firefox, as some functionality might not work in the latest Chrome and Chrome-based browsers. (If you wish to run EXTRACT on articles in several formats (e.g. word documents or PDFs) please use the OnTheFly2.0 webserver).
Hint: If the bookmarks toolbar is not showing in Firefox then go the File menu bar and select View → Toolbars → Bookmarks Toolbar → Always show
Open the paper “Identification of BCL-XL as highly active survival factor and promising therapeutic target in colorectal cancer” (Scherr et al., 2020) and click the EXTRACT bookmarklet. After a short time, terms should be highlighted in the text.
What do the different colors mean? How many different types of biomedical entities can you see in the abstract? Does any of these terms seems to be put in a wrong category and can you think of a reason why that happens if you hover over the term(s)?
By clicking or hovering over a tagged term, you will get a popup that includes its standard name, entity type, database or ontology identifier, and a link to its reference record. Click or hover over BCL-XL and colorectal cancer.
Is there a difference between BCL-XL and BCL2L1?
What is the Ensembl Protein ID of BCL-XL and what is the Disease Ontology identifier of colorectal cancer?
Select the Title in the paper and click the EXTRACT bookmarklet. Hover your mouse on top of the terms in the title text. You will see that the text in the identified terms is then highlighted in the results table
Which information is then provided in addition to what is shown in the popup?
1.2 Information retrieval
The goal of information retrieval (IR) is to find the documents pertaining to a topic of interest. When the topic is a biological entity (A), NER can be used to index the literature and thereby support retrieval of relevant documents (A → documents).
We run the same NER system used in EXTRACT on entire PubMed every week and make the results available through a suite of web resources. One such resource is DISEASES. While primarily intended to view disease–gene associations extracted from literature, it can also be used for information retrieval.
Click the following link to retrieve abstracts that mention SCN2A:
Do the abstracts shown in the first two pages all mention SCN2A?
You can similarly use NER to retrieve abstracts for any disease in the Disease Ontology. For example, the following query will retrieve abstracts for neurodegenerative disease (DOID:1289):
Which diseases are highlighted in the abstracts? Can you think of the reason why they are highlighted?
1.3 Relation extraction
The goal of cooccurrence-based relation extraction (RE) is to link entities (A, B) to each other based on them being mentioned together in documents (A → documents → B).
Go to https://diseases.jensenlab.org/ and query for colorectal cancer. Click on the disease term on the Search results page.
Which gene is most strongly associated with colorectal cancer according to text mining?
Click on TP53 in the text-mining table.
Do the abstracts in fact support an association between colorectal cancer and TP53? Comparing it with KRAS which disease-gene association seems to be more clearly stated in the text? Can you think of a reason why?
Cooccurrence-based relation extraction is a very generic approach, which can be used to find associations between any two types of entities for which we can do NER. For example, we can use the same approach to extract EGFR-associated terms from the mammalian phenotype database:
Is the association between EGFR and Increased cell death well established in the literature? Do all the papers that mention the two terms in the first page of the results actually support this association?
In this exercise, we will focus on how one can utilize the text-mining tools used in exercise 1 to analyze an observed association between gastrointestinal system diseases and Parkinson's disease.
2.1 Using NER to dig deeper
LRRK2 is a protein that is well known to be involved in Parkinson's disease. To check if it has also been implicated in gastrointestinal system diseases, we will perform a systematic search for literature linking the two. A simple PubMed search retrieves no publications:
Since LRRK2 (ENSP00000298910) and gastrointestinal system diseases (DOID:77) are both named entities in our dictionary, we can instead use the results of NER to retrieve relevant documents:
The NER-based approach retrieves many more publications. Inspect some of these abstracts.
Are they relevant and why were they were not found by the initial search?
2.2 Linking diseases via genes
Above, we saw how existing text-mining resources can be used to retrieve abstracts connecting entities of interest and to extract associations. We will now use this in a few different ways to attempt to find genes that link two diseases of interest.
The first idea is to use NER-based information retrieval to find abstracts that mention both gastrointestinal system diseases (DOID:77) and Parkinson's disease (DOID:14330). Click the link below to view these abstracts:
Which, if any, genes do you see mentioned in these abstracts?
This approach obviously only works when the association between the two diseases is sufficiently well described in the literature for there to be abstracts that mention both diseases as well as the genes of interest. If that is not the case, but one has a candidate gene in mind, one can instead use information extraction to obtain a list of diseases for the gene in question to see if it is in fact associated with both diseases in the literature. Go to https://diseases.jensenlab.org/, query for LRRK2, and view the disease associations obtained from text mining.
Is LRRK2 associated with both Parkinson's disease and gastrointestinal system diseases?
When one does not have a candidate gene, the best solution is to obtain two gene lists, one for each disease, and then identify the ones that rank high on both lists. This is a variant of the closed knowledge discovery problem. It is unfortunately not currently possible to perform such an analysis via a web interface, but it can be done by downloading the complete set disease–gene association from the DISEASES downloads page and analyzing them in Python or R. Alternatively, the analysis can be performed through Cytoscape as illustrated in the Cytoscape stringApp exercises.
Post them on this Padlet
(just double click in the background to add your question)