☆ 4.4 Article

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

ZOOKEYS (2012)

Journal

ZOOKEYS

Volume -, Issue 209, Pages 235-253

Publisher

PENSOFT PUBL

DOI: 10.3897/zookeys.209.3247

Keywords

Field notes; notebooks; crowd sourcing; digitization; biodiversity; transcription; text-mining; Darwin Core; Junius Henderson; annotation; taxonomic referencing; natural history; Wikisource; Colorado; species occurrence records

Funding

Ben Brumfield
Direct For Biological Sciences [1062193] Funding Source: National Science Foundation
Div Of Biological Infrastructure [1062193] Funding Source: National Science Foundation
Div Of Biological Infrastructure
Direct For Biological Sciences [1062148] Funding Source: National Science Foundation
Div Of Biological Infrastructure
Direct For Biological Sciences [1062271, 1062200] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wildsource, an open text transcription platform. Next, we created Wildsource templates to document places, dates, and taxa to facilitate annotation and wild-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call taxonomic referencing. The result is identification and mobilization of 1,068 observations from three of Henderson's thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn.

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

Journal

ZOOKEYS

Publisher

PENSOFT PUBL

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks

Journal

ZOOKEYS

Publisher

PENSOFT PUBL

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper