S1000 corpus

The S1000 corpus is a comprehensive manual reannotation and extension of the S800 corpus that allows highly accurate recognition of species names, both for machine learning and dictionary-based methods. In this page we have gathered links to the publication, the corpus, datasets and codebases related to this project.

Publication and Corpus

Publication in Oxford Bioinformatics: S1000: A better taxonomic name corpus for biomedical information extraction
The S1000 Corpus in BRAT standoff format

Datasets

The Zenodo project related to S1000 that contains:
- The S1000 corpus split in training, development and test sets in BRAT and CoNLL formats
- The guidelines used during annotation of the corpus (also available as an annodoc here)
- the dictionary used by Jensenlab tagger
- results from large scale tagging with the Jensenlab tagger
- the model used for the large scale run of the transformer-based method and
- results from large scale tagging with the transformer-based method
The input documents for the large scale runs with:
- Jensenlab tagger are hosted here and here
- the transformer-based method are hosted here (Note: the pre-processed input documents are the same, the difference is in the document format for the two methods)

Code

Useful scripts to reproduce the results presented in the publication can be found in this Github repo
The codebase for the transformer-based NER tagger can be found here and the codebase used for the large scale run here