4.7 Article

A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features

出版社

FRONTIERS MEDIA SA
DOI: 10.3389/fbioe.2019.00215

关键词

random forests; sub-Golgi protein classifier; ANOVA feature selection; split amino acid composition; k-gap dipeptide; synthetic minority over-sampling

资金

  1. National Key R&D Program of China [2018YFC0910405]
  2. Natural Science Foundation of China [61922020, 61771331]
  3. Scientific Research Foundation in Shenzhen [JCYJ201803061722 07178]

向作者/读者索取更多资源

To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Biotechnology & Applied Microbiology

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Zhibin Lv, Jun Zhang, Hui Ding, Quan Zou

FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY (2020)

Article Computer Science, Artificial Intelligence

A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome

Zhibin Lv, Hui Ding, Lei Wang, Quan Zou

Summary: N6-methyladenine (m(6)A) is a crucial epigenetic modification related to the control of various DNA processes. The iRicem6A-CNN protocol, using machine learning, achieved high accuracy in identifying m(6)A sites in the rice genome, outperforming other predictors.

NEUROCOMPUTING (2021)

Article Biochemical Research Methods

Anticancer peptides prediction with deep representation learning features

Zhibin Lv, Feifei Cui, Quan Zou, Lichao Zhang, Lei Xu

Summary: The study introduced a computational method named iACP-DRLF for identifying anticancer peptides, utilizing light gradient boosting machine algorithm and two sequence embedding technologies. Results showed that deep representation learning features significantly enhanced the models' ability to differentiate anticancer peptides.

BRIEFINGS IN BIOINFORMATICS (2021)

Editorial Material Biotechnology & Applied Microbiology

Editorial: Feature Representation and Learning Methods With Applications in Protein Secondary Structure

Ni Yan, Zhibin Lv, Wenjing Hong, Xue Xu

FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY (2021)

Article Computer Science, Information Systems

Mul-SNO: A Novel Prediction Tool for S-Nitrosylation Sites Based on Deep Learning Methods

Qian Zhao, Jiaqi Ma, Yu Wang, Fang Xie, Zhibin Lv, Yaoqun Xu, Hua Shi, Ke Han

Summary: SNO is crucial for plant immune response and human disease treatment, with the efficient prediction tool Mul-SNO showing promising results.

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS (2022)

Article Biochemistry & Molecular Biology

Identify Bitter Peptides by Using Deep Representation Learning Features

Jici Jiang, Xinxu Lin, Yueqi Jiang, Liangzhen Jiang, Zhibin Lv

Summary: This study presents the development of a machine learning prediction method called iBitter-DRLF, based on deep learning techniques, to accurately identify bitter peptides. By utilizing deep representation learning, this method can make accurate predictions solely based on peptide sequence data. This is of significant importance for improving the palatability of peptide therapeutics and dietary supplements.

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES (2022)

Article Genetics & Heredity

Using Sequence Similarity Based on CKSNP Features and a Graph Neural Network Model to Identify miRNA-Disease Associations

Mingxin Li, Yu Fan, Yiting Zhang, Zhibin Lv

Summary: The research focused on the impact of different feature information of miRNA sequences on the relationship between miRNA and disease. It found that a better graph neural network prediction model of miRNA-disease relationship could be built using CKSNAP feature, and the predicted miRNAs related to lung tumors, esophageal tumors, and kidney tumors were consistent with the wet experiment validation database.
Article Genetics & Heredity

Dynamic transcriptome analysis suggests the key genes regulating seed development and filling in Tartary buckwheat (Fagopyrum tataricum Garetn.)

Liangzhen Jiang, Changying Liu, Yu Fan, Qi Wu, Xueling Ye, Qiang Li, Yan Wan, Yanxia Sun, Liang Zou, Dabing Xiang, Zhibin Lv

Summary: - This study assessed the transcriptional dynamics of filling stage Tartary buckwheat seeds and identified key genes related to seed development through RNA sequencing. Phytohormones ABA, AUX, ET, BR and CTK, along with related TFs, were found to substantially regulate seed development by targeting downstream expansin genes and structural starch biosynthetic genes. The transcriptome data could serve as a theoretical basis for improving the yield of Tartary buckwheat.

FRONTIERS IN GENETICS (2022)

Editorial Material Genetics & Heredity

Editorial: Machine learning for biological sequence analysis

Zhibin Lv, Mingxin Li, Yansu Wang, Quan Zou

FRONTIERS IN GENETICS (2023)

Article Food Science & Technology

IUP-BERT: Identification of Umami Peptides Based on BERT Features

Liangzhen Jiang, Jici Jiang, Xiao Wang, Yin Zhang, Bowen Zheng, Shuqi Liu, Yiting Zhang, Changying Liu, Yan Wan, Dabing Xiang, Zhibin Lv

Summary: This study developed a peptide sequence-based umami peptide predictor, iUP-BERT, using a deep learning pretrained neural network feature extraction method. After optimization, the model showed improved performance compared to existing methods. The built iUP-BERT web server can aid in improving the palatability of dietary supplements.
Article Chemistry, Multidisciplinary

Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features

Hongdi Pei, Jiayu Li, Shuhan Ma, Jici Jiang, Mingxin Li, Quan Zou, Zhibin Lv

Summary: Thermophilic proteins have the potential to be used as biocatalysts in biotechnology. BertThermo, a model using BERT as an automatic feature extraction tool, achieved high accuracy in identifying thermophilic proteins. It outperformed previous predictive algorithms and demonstrated robustness in various datasets.+

APPLIED SCIENCES-BASEL (2023)

Article Food Science & Technology

A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features

Jici Jiang, Jiayu Li, Junxian Li, Hongdi Pei, Mingxin Li, Quan Zou, Zhibin Lv

Summary: A deep learning method called iUmami-DRLF was developed to identify umami peptides based solely on peptide sequence information. The results show that deep learning significantly improved the capability of models in identifying umami peptides. This method can be used to further enhance the umami flavor of food for a satisfying umami-flavored diet.
Article Biochemistry & Molecular Biology

Using the Random Forest for Identifying Key Physicochemical Properties of Amino Acids to Discriminate Anticancer and Non-Anticancer Peptides

Yiting Deng, Shuhan Ma, Jiayu Li, Bowen Zheng, Zhibin Lv

Summary: Anticancer peptides (ACPs) are a promising new therapeutic approach in cancer treatment, as they can selectively target cancer cells. This study utilized machine learning algorithms to predict potential ACP sequences based on physicochemical features extracted from peptide sequences. By using feature selection methods, 19 key amino acid physicochemical properties were identified that can predict the likelihood of a peptide sequence functioning as an ACP. The study aims to enhance the efficiency of designing peptide sequences for cancer treatment.

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES (2023)

Article Chemistry, Multidisciplinary

A Stacking Machine Learning Method for IL-10-Induced Peptide Sequence Recognition Based on Unified Deep Representation Learning

Jiayu Li, Jici Jiang, Hongdi Pei, Zhibin Lv

Summary: A new IL-10-induced peptide recognition method called IL10-Stack was introduced in this research, which utilized unified deep representation learning and a stacking algorithm. Feature extraction from peptide sequences was done using two approaches, Amino Acid Index (AAindex) and sequence-based unified representation (UniRep). The IL10-Stack model, constructed using a 1900-dimensional UniRep feature vector, demonstrated excellent performance in IL-10-induced peptide recognition with an accuracy of 0.910 and MCC of 0.820. Compared to existing methods, IL-10Pred and ILeukin10Pred, the IL10-Stack approach showed improved accuracy by 12.1% and 2.4% respectively. The IL10-Stack method has the potential to identify IL-10-induced peptides, aiding in the development of immunosuppressive drugs.

APPLIED SCIENCES-BASEL (2023)

Review Multidisciplinary Sciences

Review of T cell proliferation regulatory factors in treatment and prognostic prediction for solid tumors

Jiayu Li, Shuhan Ma, Hongdi Pei, Jici Jiang, Quan Zou, Zhibin Lv

Summary: This review focuses on the development of Tcprs for solid tumor therapy and prognostic prediction, and proposes strategies to enhance CAR-T cells through targeting different Tcprs, which may lead to the development of a new generation of cell therapies.

HELIYON (2023)

暂无数据