S1000 corpus
The S1000 corpus is a comprehensive manual reannotation and extension of the S800 corpus that allows highly accurate recognition of species names, both for machine learning and dictionary-based methods. In this page we have gathered links to the publication, the corpus, datasets and codebases related to this project.
Publication and Corpus
-
Publication in Oxford Bioinformatics: S1000: A better taxonomic name corpus for biomedical information extraction
Datasets
- The Zenodo project related to S1000 that contains:
- The S1000 corpus split in training, development and test sets in BRAT and CoNLL formats
- The guidelines used during annotation of the corpus (also available as an annodoc here)
- the dictionary used by Jensenlab tagger
- results from large scale tagging with the Jensenlab tagger
- the model used for the large scale run of the transformer-based method and
- results from large scale tagging with the transformer-based method
- The input documents for the large scale runs with:
Code
-
Useful scripts to reproduce the results presented in the publication can be found in this Github repo
-
The codebase for the transformer-based NER tagger can be found here and the codebase used for the large scale run here