Text-mining resources

Many of the other resources developed in the group rely heavily on dictionary-based named entity recognition (NER) of genes/proteins and other entities and concepts of biomedical relevance. We have thus developed a wide range of biomedical dictionaries, which have been tuned to work well with the associated open-source NER software, and corpora for benchmarking. We have also developed several methods for relation extraction.

Corpora

We have released three manually annotated text corpora for benchmarking NER of organism/species names and environmental descriptors:

Dictionaries

The gene/protein dictionary is based on the alias file of the STRING database. For convenience, we make available separate dictionaries for human and selected eukaryotic model organisms:

Most of our other dictionaries are built based on existing taxonomy and ontology resources:

In addition to the dictionaries listed above, we also provide the combined full dictionary, which is used for text-mining in our databases, as well as the reduced tagger dictionary, which is used by the Tagger web service.

Tools

The group has together with collaborators developed two real-time text-mining tools, namely Reflect and EXTRACT, which can be used within a web browser to augment the functionality. Both tools identify named entities such as genes/proteins within web pages, highlight the identified names, and provide popups with information and functionality related to entities. Whereas Reflect aims to augment pages with information relevant to typical readers, EXTRACT was designed primarily with database curators in mind. We currently recommend using EXTRACT as it is most up-to-date.

We also make available precomputed text-mining results, which are updated weekly and include both the direct results from named entity recognition and derived co-occurrence statistics. Users can access many of these through the topic-specific web resources ORGANISMS, COMPARTMENTS, TISSUES, and DISEASES. The text-mining results can also be accessed through a RESTful API.

The source code for the underlying tagger software is available at GitHub, along with a detailed README describing how to use the tagger.

To improve the quality of the text-mining results, we are developing a context-aware co-occurrence scoring system named CoCoScore. Although this is not yet implemented in the resources listed above, it is already available as open-source software on GitHub.