The S1000 corpus is a comprehensive manual reannotation and extension of the S800 corpus that allows highly accurate recognition of species names, both for machine learning and dictionary-based methods. In this page we have gathered links to the publication, the corpus, datasets and codebases related to this project.

Publication and Corpus


  • The Zenodo project related to S1000 that contains:
    • The S1000 corpus split in training, development and test sets in BRAT and CoNLL formats
    • The guidelines used during annotation of the corpus (also available as an annodoc here)
    • the dictionary used by Jensenlab tagger
    • results from large scale tagging with the Jensenlab tagger
    • the model used for the large scale run of the transformer-based method and
    • results from large scale tagging with the transformer-based method
  • The input documents for the large scale runs with:
    • Jensenlab tagger are hosted here and here
    • the transformer-based method are hosted here (Note: the pre-processed input documents are the same, the difference is in the document format for the two methods)


  • Useful scripts to reproduce the results presented in the publication can be found in this Github repo

  • The codebase for the transformer-based NER tagger can be found here and the codebase used for the large scale run here