☆ 4.7 Article

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science

JOURNAL OF CHEMICAL INFORMATION AND MODELING (2021)

Journal

JOURNAL OF CHEMICAL INFORMATION AND MODELING

Volume 61, Issue 9, Pages 4280-4289

Publisher

AMER CHEMICAL SOC

DOI: 10.1021/acs.jcim.1c00446

Keywords

Funding

EPSRC Centre for Doctoral Training in Computational Methods for Materials Science [EP/L015552/1]
BASF/Royal Academy of Engineering Research Chair in Data-Driven Molecular Engineering of Functional Materials
Science and Technology Facilities Council (STFC) via the ISIS Neutron and Muon Source

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The article introduces a framework for automated populating ontologies, enabling direct extraction of a larger group of properties linked by a semantic network. Exploiting data-rich sources, a new model concept is presented for data extraction of chemical and physical properties. With automatically generated parsers for data extraction and forward-looking interdependency resolution, the power of the approach is illustrated through automatic extraction of a crystallographic hierarchy.

The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science

Journal

JOURNAL OF CHEMICAL INFORMATION AND MODELING

Publisher

AMER CHEMICAL SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science

Journal

JOURNAL OF CHEMICAL INFORMATION AND MODELING

Publisher

AMER CHEMICAL SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper