4.7 Article

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science

Journal

JOURNAL OF CHEMICAL INFORMATION AND MODELING
Volume 61, Issue 9, Pages 4280-4289

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acs.jcim.1c00446

Keywords

-

Funding

  1. EPSRC Centre for Doctoral Training in Computational Methods for Materials Science [EP/L015552/1]
  2. BASF/Royal Academy of Engineering Research Chair in Data-Driven Molecular Engineering of Functional Materials
  3. Science and Technology Facilities Council (STFC) via the ISIS Neutron and Muon Source

Ask authors/readers for more resources

The article introduces a framework for automated populating ontologies, enabling direct extraction of a larger group of properties linked by a semantic network. Exploiting data-rich sources, a new model concept is presented for data extraction of chemical and physical properties. With automatically generated parsers for data extraction and forward-looking interdependency resolution, the power of the approach is illustrated through automatic extraction of a crystallographic hierarchy.
The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available