We are always looking for Master's students to join the lab to work on projects, ranging from small 7.5 or 15 ECTS point projects to full 30 ECTS point thesis projects. All projects require basic programming knowledge.
General project topics
Text mining of the biomedical literature has been successful in extracted interactions between a broad spectrum of biomedical entities. The JensenLab co-develops a range of widely used resources that catalog association, such as STRING for protein–protein interactions and DISEASES for disease–gene associations. A central component of all the resources is automatic mining of abstracts and open access full-text articles available in PubMed and PubMed Central. As you can see below, we have already done many student projects related to text mining.
However, text mining is by no means the only topic available for student projects. The group also has a strong interest in network biology, including the application of machine/deep learning to networks, as well as computational analysis of mass spectrometry-based proteomics data.
Examples of past Projects
Below is a list of projects that have already been done by students or are currently being done. These projects are thus not available, but they should give you an idea of the kind of concrete projects that are feasible to do as a student project.
Deep learning for link prediction in multilayer protein networks
The STRING database integrates many types of evidence that allow us to link proteins. These links includes a broad range of edge types such as functional associations, physical interactions, genetic interactions, and sequence similarity. The STRING network can thus be considered a so-called multilayer network. However, despite the large amounts of data integrated in STRING, our current knowledge of interactions is incomplete, and it is thus of interest to be able to predict missing edges, a task known as link prediction or link regression.
The aim of the project is to assess several state-of-the-art deep-learning (DL) methods designed to work on networks, such as node2vec, graph neural networks, and graph attention networks. You will first compile the needed training and test sets based on the data in STRING and subsequently use these to train DL models that produce vector embeddings of the network neighborhoods of nodes. You will primarily use these node embeddings as input for predicting links in the test set and evaluate the performance of each method for each type of interaction (e.g. physical interactions or genetic interactions).
Project advisor: Lars Juhl Jensen
Text mining of pharmacologically relevant human protein complexes
Text mining of biomedical literature is a powerful tool to automatically extract and integrate knowledge from the vast body of existing research that is documented in scientific publications. Widely used community databases, such as STRING and DISEASES, already utilize text mining to extract biological associations from the scientific literature. The first step in this process is to detect the names of biological entities in text. The text-mining pipeline used in the mentioned databases achieves this by utilizing constantly developed, concept-specific dictionaries of names, compiled from data deposited in a plethora of well-established resources like UniProtKB or OMIM. However, the recognition of higher-level concepts, like protein complexes, is still lacking, mainly because a lot of information is scattered across several databases and no single reference resource exists, rendering the creation of a dedicated dictionary a very difficult task.
The aim for this project will be to enable named entity recognition (NER) of protein complexes in biomedical texts. Specifically, during this project you will focus on three families of proteins/protein complexes that include prominent drug targets, namely G-protein coupled receptors, ion channels, and protein kinase complexes. Your goal is to make statements about the literature coverage of these complexes and their associations with other biological entities, for example drugs. To achieve this goal, names and synonyms for protein complexes from various resources (e.g. Complex Portal, Reactome and Gene Ontology) will need to be collected and integrated in a single concept-specific dictionary. This will include mapping the proteins in complexes between different resources and utilizing similarity measures to resolve cases where they disagree. Afterwards, this dictionary will be integrated in the existing text-mining pipeline, which will allow the extraction of relationships between complexes and other entities, like proteins and drugs. Ultimately, this will aid in uncovering novel associations of biological entities as a step towards the discovery of previously overlooked treatment options.
Should we treat articles equally? Introducing article weights in text mining of the scientific literature
Text mining is the process that combines information from multiple papers to find connections from unstructured text data. Text mining tools can be used in many disciplines such as drug discovery, proteomics, ecology, healthcare and medicine. The JensenLab tagger is a highly efficient tagging algorithm, used in biomedical text mining, implemented in C++, which can match a document against a dictionary of terms of interest, such as genes/proteins and diseases. Associations between different terms are identified based on their co-occurrence in text. More specifically, different weights are attributed to the associations based on the distance of the entities, with higher scores for entities appearing in the same sentence, followed by entities in the same paragraph and finally entities in the same document. But the information from different articles is considered equally “good” and all articles thus carry equal weight in the final scores. The goal of this project is to update tagger by adding the functionality of weighing documents differently.
The text-mined associations produced by tagger are used to populate various biological databases, such as the STRING database of protein interactions. Thus, to test the implications of document weighing in the text-mined associations, you will compare the results produced with and without document weighting to evaluate whether the latter can produce better quality scores. Weighting shall be introduced based on the impact factor (IF) of the journal in which the paper is published in. Testing different weighing schemas based on journal IFs will give us a rough idea on whether articles should be treated equally regardless of the journal in which they are published in or not. All in all, the project aims to build a software infrastructure to allow article weighing. As a first test case an IF weighting schema will be used and evaluated for its applicability.
Mining the literature to detect connections between lifestyle and diseases
The initial goal of this project is to enhance the prototype version of lifestyle factor ontology, which will then be used to construct an exposure and lifestyle vocabulary. After this goal has been achieved, the above-mentioned vocabulary will be used, in combination with pre-constructed well-annotated disease vocabularies, for the systematic dictionary-based text mining of known associations between diseases and lifestyle factors from the biomedical literature.
To enrich the lifestyle factor ontology you will need to train state-of-the-art deep learning context-based natural language processing models specific to the biomedical domain, like BioBERT, to detect terms from the scientific literature which are not present in the current ontology, classify them as belonging on one of the pre-existing ontology branches and then evaluate whether these are synonyms of existing terms or new terms, not currently present in the ontology. Moreover, you will use the results from in-house pre-trained deep learning models to resolve cases of ambiguity with other well-established dictionaries (e.g. clashes between lifestyle factors and chemicals). This will allow to later focus on the extraction of associations with relevance to disease onset and development. Following the creation of the novel lifestyle vocabulary, you will use the JensenLab dictionary-based tagger to extract relations between these factors and diseases from the entire biomedical literature.
Using deep learning to decipher the relationships of lifestyle factors with diseases
In JensenLab we have developed a lifestyle factor ontology — ranging from nutritional to socio-economic factors — which we used to create a dictionary and detect these terms in the literature. Subsequently, we were able to combine this dictionary with a well-established diseases dictionary and perform simple co-occurrence-based relation extraction. However, naive extraction of lifestyle factors mentioned with a disease results in a mix of factors that might increase the risk, decrease the risk, act as therapeutics or play any other roles in the diseases.
The goal of this project is to create a prototype deep-learning model that will be able to detect lifestyle factors, which affects the risk for disease onset and development. To this end, you will first annotate a corpus manually with disease and lifestyle factors in order to generate a labelled set of positive and negative associations between the two. Then, you will train state-of-the-art deep learning context-based natural language processing models specific to the biomedical domain, like BioBERT, to predict associations between lifestyle factors and diseases based on the context of the words surrounding these entities. Finally, you will apply your model to the entire scientific literature in an effort to unveil the nature of relationships between lifestyle factors and diseases.
Using machine learning as a weapon to fight scientific fraud by detecting “paper-mill” publications
Scientific fraud is a growing concern in research, which has increasingly come to light in the last several years. The scientific community is fighting against it, and to support that purpose databases like Retraction Watch and PubPeer have come into existence, as a means to catalogue and act against this practice in science. Many scientists voluntarily invest their time into this endeavour, e.g. Elisabeth Bik, but their work is becoming increasingly difficult as there are no automated tools to support the detection of fraudulent research, while finding all the cases manually is a never-ending job.
In this project you will implement a machine learning-based method to detect fraud in the biomedical literature, which consists of over 30 million publications extracted from PubMed and PubMed Central. The method that you will develop during this project should be able using the text and metadata from these publications, to detect those amongst them originating from known “paper mills” (papers by different authors and affiliations that all appear to have been generated by the same source), and potentially detect yet unknown “paper mills” using clustering techniques. Succeeding in developing this method would allow the generation of a list of publications which should not be text mined, when searching for associations between biological entities in the scientific literature. This information will then be integrated in the text-mining pipeline that is used to update databases like STRING and DISEASES, which serve thousands of scientists every day.
BERT-based context-aware classifier for biomedical relation extraction
Relation extraction (RE) is a natural language processing task aiming to extract relationships between pairs of entities from a corpus of text that could be either general or domain-specific. Three main approaches for this task are rule-based, unsupervised and supervised RE. The supervised approach has been shown to perform better than rule-based and unsupervised approach. However, the downside of this approach is that it needs large amounts of annotated training data, which can be costly or unfeasible, depending on the context. To mitigate this problem, distant supervision was proposed for training set generation. A recent study has demonstrated that a logistic classifier trained from distantly supervised data set can outperform the baseline unsupervised model in identifying disease-gene and tissue-gene associations. Thus, there is support for the use of distant supervision in biomedical relation extraction. However, there is room for further improvement of the approach for some types of relationships e.g. tissue-gene associations, which is the main goal of this study. For this task, you will use the recently developed bidirectional encoder representation from transformers (BERT) to generate embeddings. This approach has some advantages over fastText’s extension of word2vec embeddings that was used by Junge and Jensen. Embeddings from BERT can contain the context information of the whole sequence unlike those from word2vec, which has limited-context window. Furthermore, BERT embeddings can be fine-tuned for this task by extending BERT with a simple output layer unlike word2vec embeddings, which needs an additional task-specific neural network.