4.6 Article Proceedings Paper

Convolutional neural network based on SMILES representation of compounds for detecting chemical motif

Journal

BMC BIOINFORMATICS
Volume 19, Issue -, Pages -

Publisher

BMC
DOI: 10.1186/s12859-018-2523-5

Keywords

Convolutional neural network; Chemical compound; Virtual screening; SMILES; TOX 21 Challenge

Funding

  1. MEXT-supported Program for the Strategic Research Foundation at Private Universities

Ask authors/readers for more resources

BackgroundPrevious studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features.ResultsWe developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected.ConclusionsThe source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/, and the dataset used for performance evaluation in this work is available at the same URL.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Biochemistry & Molecular Biology

A Web Server for Designing Molecular Switches Composed of Two Interacting RNAs

Akito Taneda, Kengo Sato

Summary: The programmability of RNA-RNA interactions has been successfully utilized to design RNA devices that regulate gene expression, but designing structured RNA sequences that meet multiple criteria has become a complex problem. Despite the lack of a web service for multi-objective design of RNA switches utilizing RNA-RNA interactions, a web server based on the MODENA algorithm was developed to design two interacting RNAs in silico.

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES (2021)

Article Multidisciplinary Sciences

RNA secondary structure prediction using deep learning with thermodynamic integration

Kengo Sato, Manato Akiyama, Yasubumi Sakakibara

Summary: Combining thermodynamic information with deep learning can improve the robustness of RNA secondary structure prediction compared to existing algorithms.

NATURE COMMUNICATIONS (2021)

Article Multidisciplinary Sciences

Adipose-derived mesenchymal stem cells differentiate into heterogeneous cancer-associated fibroblasts in a stroma-rich xenograft model

Yoshihiro Miyazaki, Tatsuya Oda, Yuki Inagaki, Hiroko Kushige, Yutaka Saito, Nobuhito Mori, Yuzo Takayama, Yutaro Kumagai, Toutai Mitsuyama, Yasuyuki S. Kida

Summary: Cancer-associated fibroblasts (CAFs) play important roles in tumor progression and drug resistance in pancreatic ductal adenocarcinoma (PDAC), but existing mouse models have limitations in reproducing the characteristics of clinical CAFs. Researchers have developed a new human cell-derived stroma-rich CDX model, which successfully recapitulates the clinical features of pancreatic cancer by co-transplanting human adipose-derived mesenchymal stem cells and a PDAC cell line into mice.

SCIENTIFIC REPORTS (2021)

Article Multidisciplinary Sciences

Rational thermostabilisation of four-helix bundle dimeric de novo proteins

Shin Irumagawa, Kaito Kobayashi, Yutaka Saito, Takeshi Miyata, Mitsuo Umetsu, Tomoshi Kameda, Ryoichi Arai

Summary: This study successfully predicted and experimentally confirmed new mutations that improve protein stability through in silico saturation mutagenesis and molecular dynamics simulation. The double mutant N22A/H86K showed significant improvement, and these thermostable mutants have potential significance for constructing supramolecular protein complexes.

SCIENTIFIC REPORTS (2021)

Article Biochemical Research Methods

Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins

Hideki Yamaguchi, Yutaka Saito

Summary: Accurate variant effect prediction plays a significant role in protein engineering. Recent machine learning approaches focus on representation learning to generate feature vectors from unlabeled sequences. This article proposes DA-aware evolutionary fine-tuning protocols for Transformer-based variant effect prediction, achieving better performances than previous methods and incorporating structural information without direct supervision.

BRIEFINGS IN BIOINFORMATICS (2021)

Article Biotechnology & Applied Microbiology

Comparative analysis of the relationship between translation efficiency and sequence features of endogenous proteins in multiple organisms

Naoyuki Tajima, Toshitaka Kumagai, Yutaka Saito, Tomoshi Kameda

Summary: The relationship between translation efficiency and sequence features varies across organisms, reflecting their taxonomy. The codon adaptation index shows high correlation in all analyzed organisms.

GENOMICS (2021)

Article Mathematical & Computational Biology

Machine learning approach for discrimination of genotypes based on bright-field cellular images

Godai Suzuki, Yutaka Saito, Motoaki Seki, Daniel Evans-Yamamoto, Mikiko Negishi, Kentaro Kakoi, Hiroki Kawai, Christian R. Landry, Nozomu Yachie, Toutai Mitsuyama

Summary: Morphological profiling, combining optical microscopes and machine vision technologies, has been successfully applied in high-throughput phenotyping. The study demonstrates the potential to discriminate single-gene mutant cells from wild-type cells based on bright-field images. Machine learning was used to construct a model that successfully identified mutant cells.

NPJ SYSTEMS BIOLOGY AND APPLICATIONS (2021)

Article Chemistry, Physical

Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

Yutaka Saito, Misaki Oikawa, Takumi Sato, Hikaru Nakazawa, Tomoyuki Ito, Tomoshi Kameda, Koji Tsuda, Mitsuo Umetsu

Summary: The study shows that machine learning is a useful tool in designing proteins with desired functions in protein engineering. Depending on the presence or absence of highly positive variants in the training data, machine learning-guided directed evolution can lead to improved variants in different regions of sequence space.

ACS CATALYSIS (2021)

Article Biology

A Max-Margin Model for Predicting Residue-Base Contacts in Protein-RNA Interactions

Shunya Kashiwagi, Kengo Sato, Yasubumi Sakakibara

Summary: Protein-RNA interactions are crucial for biological processes, and various computational methods have been developed to predict these interactions. However, accurately predicting residue-base contacts in PRIs remains a challenge. The proposed method using only sequence and predicted structural information shows promising results, comparable to methods based on known binding data.

LIFE-BASEL (2021)

Article Biochemical Research Methods

Integer programming for selecting set of informative markers in paternity inference

Soichiro Nishiyama, Kengo Sato, Ryutaro Tao

Summary: This study presents a novel approach for selecting informative markers based on binary integer programming. By combining with targeted SNP genotyping, this method allows for flexible analysis and has practical applications in large-scale problems in breeding and ecological research.

BMC BIOINFORMATICS (2022)

Article Medicine, Research & Experimental

Selection of target-binding proteins from the information of weakly enriched phage display libraries by deep sequencing and machine learning

Tomoyuki Ito, Thuy Duong Nguyen, Yutaka Saito, Yoichi Kurumida, Hikaru Nakazawa, Sakiya Kawada, Hafumi Nishi, Koji Tsuda, Tomoshi Kameda, Mitsuo Umetsu

Summary: The aim of this study was to design an improved library based on information from a weakly enriched library, as bias during biopanning often leads to the enrichment of undesired variants. Deep sequencing of previous biopanning results revealed that weak enrichment was partially due to biases during phage infection and amplification steps. Machine learning analysis of the deep sequencing data identified distinct sequence patterns, which were used to design phage libraries. Four improved variants with specific target affinity were identified using biopanning.
Article Biochemical Research Methods

Engineering the Substrate Specificity of Toluene Degrading Enzyme XylM Using Biosensor XylS and Machine Learning

Yuki Ogawa, Yutaka Saito, Hideki Yamaguchi, Yohei Katsuyama, Yasuo Ohnishi

Summary: Enzyme engineering using machine learning has made significant progress in recent years. This study explores the application of biosensor-based enzyme engineering method in machine learning. By evaluating the productivity of XylM variants using a fluorescence intensity-based biosensor, training data for machine learning was obtained and a XylM variant with 15 times higher productivity than wild-type XylM was successfully obtained. These findings demonstrate the quantitative and high-throughput capability of biosensors in indirect enzyme activity evaluation, expanding the versatility of machine learning in enzyme engineering.

ACS SYNTHETIC BIOLOGY (2023)

Article Genetics & Heredity

Direct Inference of Base-Pairing Probabilities with Neural Networks Improves Prediction of RNA Secondary Structures with Pseudoknots

Manato Akiyama, Yasubumi Sakakibara, Kengo Sato

Summary: This study proposes a new algorithm for directly inferring base-pairing probabilities of RNA secondary structures using neural networks, independent of their architecture. The algorithm outperforms existing methods in prediction accuracy, as demonstrated by benchmarks with and without pseudoknots.

GENES (2022)

Review Biochemical Research Methods

Recent trends in RNA informatics: a review of machine learning and deep learning for RNA secondary structure prediction and RNA drug discovery

Kengo Sato, Michiaki Hamada

Summary: Computational analysis of RNA sequences plays a crucial role in RNA biology. In recent years, the incorporation of artificial intelligence and machine learning techniques into RNA sequence analysis has gained significant traction. Machine learning-based approaches have shown remarkable advancements, enhancing the precision of sequence analysis related to RNA secondary structures. Furthermore, artificial intelligence and machine learning innovations are also applied in the analysis of RNA-small molecule interactions, RNA drug discovery, and the design of RNA aptamers.

BRIEFINGS IN BIOINFORMATICS (2023)

Article Chemistry, Multidisciplinary

AI and computational chemistry-accelerated development of an alotaketal analogue with conventional PKC selectivity

Jumpei Maki, Asami Oshimura, Chihiro Tsukano, Ryo C. Yanagita, Yutaka Saito, Yasubumi Sakakibara, Kazuhiro Irie

Summary: Protein kinase C (PKC) family is a potential target for treating cancer, Alzheimer's disease, and HIV infection. By screening compounds and designing analogues, we discovered a PKC ligand with remarkable isozyme selectivity.

CHEMICAL COMMUNICATIONS (2022)

No Data Available