skip to content
  • Project Partners

  • Project supported by



This project seeks to enhance access to a large body of scanned literature in the biodiversity domain by developing fuzzy matching of search terms, so that searching the literature is robust to errors introduced by OCR and other sources. Biological knowledge, especially taxonomic knowledge, is often presented in a stylised form, generally using typographical clues to its meaning. This project aims to use typographical information and other contextual clues to identify and tag document content by type. This combination of Natural Language Processing (NLP) with typographical information extraction should be applicable in other fields that historically use structured data. We plan to demonstrate the generality and to extend the procedures developed by Lu et al (2008), applying them to the Biodiversity Heritage Library (BHL) scans from the Natural History Museum in London.

The primary goal is structural recognition, disambiguation and mark-up, from which metadata (taxon names, people's names, locations and dates) will be extracted to build indices and ontologies from the rapidly growing digital content of the BHL. Thus the project is compatible with the programme scope (b), Enhancement of existing collections. The project will also generate approximately 10 volumes of scanned documents, which will be made freely available to the research community.