4.7 Article

Feature Selection for Polymer Informatics: Evaluating Scalability and Robustness of the FS4RVDD Algorithm Using Synthetic Polydisperse Data Sets

期刊

出版社

AMER CHEMICAL SOC
DOI: 10.1021/acs.jcim.9b00867

关键词

-

资金

  1. Argentinean National Council of Scientific and Technological Research (CONICET) [PIP 112-2017-0100829]
  2. Universidad Nacional del Sur (UNS), Bahia Blanca, Argentina [PGI 24/N042, PGI 24/ZM17]

向作者/读者索取更多资源

The feature selection (FS) process is a key step in the Quantitative Structure Property Relationship (QSPR) modeling of physicochemical properties in cheminformatics. In particular, the inference of QSPR models for polymeric material properties constitutes a complex problem because of the uncertainty introduced by the polydispersity of these materials. The main challenge is how to capture the polydispersity information from the molecular weight distribution (MWD) curve to achieve a more effective computational representation of polymeric materials. To date, most of the existing QSPR techniques use only a single molecule to represent each of these materials, but polydispersity is not considered. Consequently, QSPR models obtained by these approaches are being oversimplified. For this reason, we introduced in a previous work a new FS algorithm called Feature Selection for Random Variables with Discrete Distribution (FS4RV(DD)), which allows dealing with polydisperse data. In the present paper, we evaluate both the scalability and the robustness of the FS4RV(DD) algorithm. In this sense, we generated synthetic data by varying and combining different parameters: the size of the database, the cardinality of the selected feature subsets, the presence of noise in the data, and the type of correlation (linear and nonlinear). Moreover, the performances obtained by FS4RV(DD) were contrasted with traditional FS techniques applied to different simplified representations of polymeric materials. The obtained results show that the FS4RV(DD) algorithm outperformed the traditional FS methods in all proposed scenarios, which suggest the need of an algorithm such as FS4RV(DD) to deal with the uncertainty that polydispersity introduces in human-made polymers.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Review Biochemical Research Methods

Using molecular embeddings in QSAR modeling: does it make a difference?

Maria Virginia Sabando, Ignacio Ponzoni, Evangelos E. Milios, Axel J. Soto

Summary: With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. However, comparing different molecular embeddings and traditional representations is not straightforward, hindering the process of choosing suitable representations for QSAR modeling. The study conducted experiments comparing different embedding techniques and found that the predictive performance using molecular embeddings did not significantly surpass that of traditional representations.

BRIEFINGS IN BIOINFORMATICS (2022)

Article Materials Science, Multidisciplinary

Polymer informatics: Expert-in-the-loop in QSPR modeling of refractive index

Santiago A. Schustik, Fiorella Cravero, Ignacio Ponzoni, Monica F. Diaz

Summary: Refractive index is a crucial property for the design of new materials, and machine learning algorithms have been successfully applied in modeling, with the expert-in-the-loop approach showing promise in improving interpretability and generalizability of the models.

COMPUTATIONAL MATERIALS SCIENCE (2021)

Article Polymer Science

PolyMaS: A new software to generate high molecular weight polymer macromolecules from repeating structural units

Santiago A. Schustik, Fiorella Cravero, M. Jimena Martinez, Ignacio Ponzoni, Monica F. Diaz

Summary: The PolyMaS software utilizes SMILES codes to generate linear macromolecules without limiting their length and molar mass, and can adjust the length of the polymer as needed.

POLIMERY (2021)

Article Mathematics, Applied

Filtering non-balanced data using an evolutionary approach

Jessica A. Carballido, Ignacio Ponzoni, Rocio L. Cecchini

Summary: This article presents an evolutionary method called PreCLAS for handling matrices that cannot be analyzed using conventional clustering, regression or classification methods in big data research. The method significantly reduces the number of rows in the matrix and intelligently performs unsupervised row selection, improving the effectiveness of classification and clustering methods.

LOGIC JOURNAL OF THE IGPL (2023)

Article Environmental Sciences

Vehicular fleet characterisation and assessment of the on-road mobile source emission inventory of a Latin American intermediate city

Yamila S. Grassi, Nelida B. Brignole, Monica F. Diaz

Summary: The paper provides a comprehensive analysis of vehicular fleet and mobile source emissions in Bahia Blanca, Argentina in 2018. Motorcycles were identified as the main source of CO, NMVOC, CO2 and CH4, while light commercial vehicles emitted the most amount of NOx. Despite the growth of the vehicular fleet, emissions in 2018 were lower than in 2013, attributed to the incorporation of more efficient emission control technology. However, this improvement resulted in increased GHGs emissions, presenting a continued challenge in the area.

SCIENCE OF THE TOTAL ENVIRONMENT (2021)

Article Chemistry, Physical

Polymer informatics for QSPR prediction of tensile mechanical properties. Case study: Strength at break

Fiorella Cravero, Monica F. Diaz, Ignacio Ponzoni

Summary: This paper introduces an artificial intelligence-based method for predicting the mechanical properties of the tensile test. By using machine learning tools, visual analytics methods, and expert-in-the-loop strategies, a QSPR model composed of five molecular descriptors is proposed, achieving a high correlation coefficient.

JOURNAL OF CHEMICAL PHYSICS (2022)

Review Food Science & Technology

Could QSOR Modelling and Machine Learning Techniques Be Useful to Predict Wine Aroma?

Virginia Cardoso Schwindt, Mauricio M. Coletto, Monica F. Diaz, Ignacio Ponzoni

Summary: Food informatics is playing a significant role in improving the quality and efficiency of the food industry, particularly in the sensory analysis of wines. Machine learning models have been developed to predict wine-related characteristics, but accurate and sufficient data is still needed for reliable predictions. The use of quantitative structure-odour relationship (QSOR) models shows promise in quantitatively predicting wine sensory analysis.

FOOD AND BIOPROCESS TECHNOLOGY (2023)

Article Chemistry, Medicinal

Multitask Deep Neural Networks for Ames Mutagenicity Prediction

Maria Jimena Martinez, Maria Virginia Sabando, Axel J. Soto, Carlos Roca, Carlos Requena-Triguero, Nuria E. Campillo, Juan A. Paez, Ignacio Ponzoni

Summary: The Ames mutagenicity test is widely used to estimate the mutagenic potential of drug candidates. However, most existing in silico models for predicting mutagenicity do not consider the test results of individual experiments conducted for each strain. In this study, we propose a novel neural-based QSAR model that leverages experimental results from different strains involved in the Ames test using multitask learning. Our model outperforms single-task modeling strategies and ensemble models built from individual strains.

JOURNAL OF CHEMICAL INFORMATION AND MODELING (2022)

Article Polymer Science

Design of New Dispersants Using Machine Learning and Visual Analytics

Maria Jimena Martinez, Roi Naveiro, Axel J. Soto, Pablo Talavante, Shin-Ho Kim Lee, Ramon Gomez Arrayas, Mario Franco, Pablo Mauleon, Hector Lozano Ordonez, Guillermo Revilla Lopez, Marco Bernabei, Nuria E. Campillo, Ignacio Ponzoni

Summary: Artificial intelligence (AI) is revolutionizing the discovery of new materials, particularly in the field of virtual screening of chemical libraries. This study developed computational models that can predict the dispersancy efficiency of oil and lubricant additives, a critical property in their design. The proposed models combined machine learning techniques with visual analytics strategies in an interactive tool, aiding domain experts in decision-making. The best-performing model achieved a mean absolute error of 5.50±0.34 and a root mean square error of 7.56±0.47, demonstrating its effectiveness in predicting dispersancy efficiency.

POLYMERS (2023)

Article Genetics & Heredity

Papillary Thyroid Carcinoma: A thorough Bioinformatic Analysis of Gene Expression and Clinical Data

Ivan Petrini, Rocio L. Cecchini, Marilina Mascaro, Ignacio Ponzoni, Jessica A. Carballido

Summary: The likelihood of being diagnosed with thyroid cancer has increased in recent years. The aim of this study is to identify potential genes relevant to Papillary Thyroid Carcinoma (PTC) through bioinformatic analysis. Four genes, PTGFR, ZMAT3, GABRB2, and DPP6, were found to be highly relevant and worthy of further investigation.
Article Chemistry, Multidisciplinary

Explainable artificial intelligence: A taxonomy and guidelines for its application to drug discovery

Ignacio Ponzoni, Juan Antonio Paez Prosper, Nuria E. Campillo

Summary: Artificial intelligence (AI) is increasingly impacting drug discovery. However, in order to be accepted by the medicinal chemistry community, it is important for AI models to be able to explain their predictions in a trustworthy manner. Therefore, research and development of explainable artificial intelligence (XAI) methods have become crucial. This article provides a comprehensive literature review on explanation methodologies for AI models in the field of drug discovery, including a new taxonomy of XAI methods, and introduces visualization strategies for XAI in the chemical domain.

WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE (2023)

Proceedings Paper Computer Science, Interdisciplinary Applications

Statistical Learning Analysis of Thyroid Cancer Microarray Data

Ivan Petrini, Rocio L. Cecchini, Marilina Mascaro, Ignacio Ponzoni, Jessica A. Carballido

Summary: This article presents a comprehensive and comparative analysis of thyroid cancer datasets, including stages for feature selection, hypothesis testing, and classification. The results suggest that some genes, especially the HINT3 gene, are worth further investigation.

BIOINFORMATICS AND BIOMEDICAL ENGINEERING, PT II (2022)

暂无数据