- Project Partners
Project supported by
Identifying taxonomic names
How do you identify taxonomic names in text? A recent paper by Guido Sautter and colleagues (Sautter et al. 2006) provides an overview of some of the techniques that can be employed. But here is a simple analysis, based on the observation that taxonomic names such as Pica pica are latinate. So how different are taxonomic names to the English text that surrounds them?
What we did was to compare a long list of taxonomic names extracted from ITIS, the Integrated Taxonomic Information System to an English language text of a similar length, the King James version of the Bible obtained from Gutenberg.org. The results of our more detailed analyses may be presented elsewhere, but for now:
- The distribution of word length for the taxonomic names show what looks like a normal distribution of word lengths centred around an average length of 9 characters. The longest words were Archaeosphaerodiniopsidaceae (28 characters) and Heteroptera-enicocephalomorpha (30 characters).
- The peak word length for the Bible is 3 letters. The distribution seems to have a longer tail, but the longest words observed were 18 characters (two occurrences). The mean is 4.08, less than half that of the taxonomic names.
- While 'a' and 'i' are the most common letters in the taxonomic names, 'e' is the most common letter in the Bible text.
- The letters 'w' and 'y' are much more common word beginnings in the Bible, being nearly ten times more likely to appear in the Bible than in taxonomic names. At the end of words, 'a', 'i', 'm', 's' and 'x' are much more common in Latin, whilst other letters such as 'd' and 'e' are much more common in English. Some word endings are much more common in the Bible, such as 'd' (14% in the Bible, less than one hundredth of a percent in taxonomic names, so nearly 2000 times more common in the Bible).
While you can argue that you might get different results if you compared a list of taxonomic names to other collections of English words, this comparison serves to illustrate that taxonomic names potentially are quite different. Not least because taxonomic names are frequently typeset in italics
Sautter, G., Böhm, K., & Agosti, D. (2006). A combining approach to find all taxon names (FAT) in legacy biosystematics literature
. Biodiversity Informatics