4.7 Article

Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2010.51

Keywords

Biology and genetics; text analysis; bioinformatics (genome or protein) databases

Funding

  1. Science Foundation Arizona [CAA 0277-08]
  2. Arizona Alzheimer's Disease Data Management Core under NIH [NIA P30 AG-19610]
  3. State of Arizona Alzheimer's Disease Research Consortium
  4. US National Science Foundation (NSF) [0412000]
  5. SFAZ [CAA 0289-08]
  6. NSF [OCI 0950440]
  7. Fulton School of Engineering
  8. Div Of Information & Intelligent Systems
  9. Direct For Computer & Info Scie & Enginr [0412000] Funding Source: National Science Foundation
  10. Office of Advanced Cyberinfrastructure (OAC)
  11. Direct For Computer & Info Scie & Enginr [0950440] Funding Source: National Science Foundation

Ask authors/readers for more resources

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information ( see Appendix).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Multidisciplinary Sciences

ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets

Ari Z. Klein, Arjun Magge, Graciela Gonzalez-Hernandez

Summary: This study developed and evaluated a method for automatically identifying the exact age of users based on self-reports in their tweets. They achieved high accuracy in age identification using natural language processing and a deep neural network classifier, and successfully applied the method to a large Twitter dataset.

PLOS ONE (2022)

Article Psychology, Developmental

Adolescent Perceptions of Menstruation on Twitter: Opportunities for Advocacy and Education

Shelby H. Davies, Miriam D. Langer, Ari Klein, Graciela Gonzalez-Hernandez, Nadia Dowshen

Summary: This study examines youth perceptions about menstruation on Twitter and finds that there is a negative expectation and shame surrounding menstruation. A significant portion of tweets are related to advocacy or education, supporting the potential use of Twitter as a platform to improve public health messaging, health outcomes, and equity for youth who menstruate.

JOURNAL OF ADOLESCENT HEALTH (2022)

Review Health Care Sciences & Services

Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review

Su Golder, Robin Stevens, Karen O'Connor, Richard James, Graciela Gonzalez-Hernandez

Summary: Social media data is increasingly used in health research. This study aims to identify different methods to extract race or ethnicity from social media and report on the challenges of using these methods. Through a scoping review, the study found that there is currently no standard approach to extract or infer the race or ethnicity of Twitter users, and there are challenges in terms of accuracy and ethical issues.

JOURNAL OF MEDICAL INTERNET RESEARCH (2022)

Article Public, Environmental & Occupational Health

Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States

Ari Z. Klein, Steven Meanley, Karen O'Connor, Jose A. Bauermeister, Graciela Gonzalez-Hernandez

Summary: By developing an automated natural language processing (NLP) pipeline, MSM at risk of HIV acquisition can be identified on Twitter, laying the groundwork for targeted PrEP-related interventions for this population on a large scale.

JMIR PUBLIC HEALTH AND SURVEILLANCE (2022)

Article Public, Environmental & Occupational Health

Patient-Reported Reasons for Switching or Discontinuing Statin Therapy: A Mixed Methods Study Using Social Media

Su Golder, Davy Weissenbacher, Karen O'Connor, Sean Hennessy, Robert Gross, Graciela Gonzalez Hernandez

Summary: Social media analysis revealed that the main reason for discontinuation of statin therapy was patient experience of adverse events, with musculoskeletal and connective tissue disorders being the most common. 60% of posters identified as female, with the most common age category being 55-64 years. The unique patient perspectives found on social media may provide valuable insights for interventions to reduce medication discontinuation.

DRUG SAFETY (2022)

Article Immunology

Explorations of the Role of Digital Technology in HIV-Related Implementation Research: Case Comparisons of Five Ending the HIV Epidemic Supplement Awards

Jeb Jones, Justin Knox, Steven Meanley, Cui Yang, David W. Lounsbury, Terry T. Huang, Jose Bauermeister, Graciela Gonzalez-Hernandez, Victoria Frye, Christian Grov, Viraj Patel, Stefan D. Baral, Patrick S. Sullivan, Sheree R. Schwartz

Summary: The use of digital technology in HIV-related interventions and implementation strategies is significant and presents both challenges and opportunities. This article explores five case studies that highlight the role of technology in HIV-related implementation research, emphasizing the importance of study design, outcome measurement, and equity.

JAIDS-JOURNAL OF ACQUIRED IMMUNE DEFICIENCY SYNDROMES (2022)

Article Psychology, Clinical

Exploring content of misinformation about HPV vaccine on twitter

Melanie L. Kornides, Sarah Badlis, Katharine J. Head, Mary Putt, Joseph Cappella, Graciela Gonzalez-Hernadez

Summary: Nearly a quarter of #HPV Tweets contain disinformation or misinformation about the HPV vaccine, with adverse health effects, mandatory vaccination, and vaccine inefficacy being the most prevalent categories. These misleading tweets are more likely to be retweeted than supportive tweets.

JOURNAL OF BEHAVIORAL MEDICINE (2023)

Article Computer Science, Artificial Intelligence

Descriptor Comprehensively identifying Long Covid articles with human-in-the-loop machine learning

Robert Leaman, Rezarta Islamaj, Alexis Allot, Qingyu Chen, W. John Wilbur, Zhiyong Lu

Summary: A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms known as Long Covid. Identifying relevant scientific articles on Long Covid is challenging due to lack of standardized terminology. A machine learning framework combining data programming with active learning shows higher specificity and sensitivity compared to other methods. Analysis of the Long Covid Collection reveals that most articles do not refer to Long Covid by any name, and when mentioned, Long Covid is the most frequently used term associated with disorders in various body systems. The Long Covid Collection is regularly updated and searchable on the LitCovid portal.

PATTERNS (2023)

Letter Health Care Sciences & Services

Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy

Ari Z. Klein, Shriya Kunatharaju, Karen O'Connor, Graciela Gonzalez-Hernandez

JOURNAL OF MEDICAL INTERNET RESEARCH (2023)

Letter Health Care Sciences & Services

Automatically Identifying Self-Reports of COVID-19 Diagnosis on Twitter: An Annotated Data Set, Deep Neural Network Classifiers, and a Large-Scale Cohort

Ari Z. Klein, Shriya Kunatharaju, Karen O'Connor, Graciela Gonzalez-Hernandez

JOURNAL OF MEDICAL INTERNET RESEARCH (2023)

Editorial Material Urology & Nephrology

Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature?

Qiao Jin, Robert Leaman, Zhiyong Lu

JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY (2023)

Review Health Care Sciences & Services

The Role of Social Media for Identifying Adverse Drug Events Data in Pharmacovigilance: Protocol for a Scoping Review

Su Golder, Karen O'Connor, Yunwen Wang, Graciela Gonzalez Hernandez

Summary: This study aims to evaluate and characterize the use of social media in adverse drug event detection and pharmacovigilance compared to other data sources. By comparing social media data with other sources, the added value of social media in monitoring adverse drug events can be concluded.

JMIR RESEARCH PROTOCOLS (2023)

Article Biochemical Research Methods

AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning

Ling Luo, Chih-Hsuan Wei, Po-Ting Lai, Robert Leaman, Qingyu Chen, Zhiyong Lu

Summary: Biomedical named entity recognition (BioNER) aims to automatically identify biomedical entities in natural language text, providing a necessary foundation for downstream text mining tasks and applications. Due to the expensive and domain-specific expertise required for manual annotation of training data, current BioNER approaches suffer from data scarcity and limitations in generalizability and entity coverage. In this paper, we propose an all-in-one (AIO) scheme that utilizes external annotated resources to enhance the accuracy and stability of BioNER models. We introduce AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO scheme, and demonstrate its effectiveness, robustness, and advantages over existing methods on 14 BioNER benchmark tasks and three independent tasks.

BIOINFORMATICS (2023)

Article Mathematical & Computational Biology

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII

Robert Leaman, Rezarta Islamaj, Virginia Adams, Mohammed A. Alliheedi, Joao Rafael Almeida, Rui Antunes, Robert Bevan, Yung-Chun Chang, Arslan Erdengasileng, Matthew Hodgskiss, Ryuki Ida, Hyunjae Kim, Keqiao Li, Robert E. Mercer, Lukrecia Mertova, Ghadeer Mobasher, Hoo-Chang Shin, Mujeen Sung, Tomoki Tsujimura, Wen-Chao Yeh, Zhiyong Lu

Summary: The BioCreative National Library of Medicine (NLM)-Chem track is a community effort to improve automated recognition of chemical names in biomedical literature. The track consists of two tasks: chemical identification and chemical indexing. The community challenge demonstrated the achievements in deep learning technologies and the challenges of the chemical indexing task. Further development of biomedical text-mining methods is expected to respond to the rapid growth of biomedical literature.

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION (2023)

Article Health Care Sciences & Services

Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers

Ari Z. Klein, Karen O'Connor, Lisa D. Levine, Graciela Gonzalez-Hernandez

Summary: This study examined the utility of Twitter data for analyzing the outcomes of pregnancies where beta-blockers were taken. The results suggest that Twitter can be a useful resource for cohort studies on drug safety during pregnancy.

JMIR FORMATIVE RESEARCH (2022)

No Data Available