4.6 Article

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records

出版社

OXFORD UNIV PRESS
DOI: 10.1093/jamia/ocv172

关键词

phylogeography; information extraction; natural language processing

资金

  1. National Institute of Allergy and Infectious Diseases of the National Institutes of Health [R56AI102559]

向作者/读者索取更多资源

Objective The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. Materials and Methods We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. Results We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. Discussion Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. Conclusion Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Review Health Care Sciences & Services

Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review

Su Golder, Robin Stevens, Karen O'Connor, Richard James, Graciela Gonzalez-Hernandez

Summary: Social media data is increasingly used in health research. This study aims to identify different methods to extract race or ethnicity from social media and report on the challenges of using these methods. Through a scoping review, the study found that there is currently no standard approach to extract or infer the race or ethnicity of Twitter users, and there are challenges in terms of accuracy and ethical issues.

JOURNAL OF MEDICAL INTERNET RESEARCH (2022)

Article Public, Environmental & Occupational Health

Identifying mitigation strategies for COVID-19 superspreading on flights using models that account for passenger movement

Sirish Namilae, Yuxuan Wu, Anuj Mubayi, Ashok Srinivasan, Matthew Scotch

Summary: Conventional models fail to explain superspreading patterns on flights, with passenger movement playing a significant role in infection spread. The use of FFP2/N95 masks is more effective in reducing infection risk, and leaving middle seats vacant is also effective. The results emphasize the importance of implementing stricter guidelines to reduce aviation-related transmission.

TRAVEL MEDICINE AND INFECTIOUS DISEASE (2022)

Article Public, Environmental & Occupational Health

Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States

Ari Z. Klein, Steven Meanley, Karen O'Connor, Jose A. Bauermeister, Graciela Gonzalez-Hernandez

Summary: By developing an automated natural language processing (NLP) pipeline, MSM at risk of HIV acquisition can be identified on Twitter, laying the groundwork for targeted PrEP-related interventions for this population on a large scale.

JMIR PUBLIC HEALTH AND SURVEILLANCE (2022)

Article Public, Environmental & Occupational Health

Patient-Reported Reasons for Switching or Discontinuing Statin Therapy: A Mixed Methods Study Using Social Media

Su Golder, Davy Weissenbacher, Karen O'Connor, Sean Hennessy, Robert Gross, Graciela Gonzalez Hernandez

Summary: Social media analysis revealed that the main reason for discontinuation of statin therapy was patient experience of adverse events, with musculoskeletal and connective tissue disorders being the most common. 60% of posters identified as female, with the most common age category being 55-64 years. The unique patient perspectives found on social media may provide valuable insights for interventions to reduce medication discontinuation.

DRUG SAFETY (2022)

Article Immunology

Explorations of the Role of Digital Technology in HIV-Related Implementation Research: Case Comparisons of Five Ending the HIV Epidemic Supplement Awards

Jeb Jones, Justin Knox, Steven Meanley, Cui Yang, David W. Lounsbury, Terry T. Huang, Jose Bauermeister, Graciela Gonzalez-Hernandez, Victoria Frye, Christian Grov, Viraj Patel, Stefan D. Baral, Patrick S. Sullivan, Sheree R. Schwartz

Summary: The use of digital technology in HIV-related interventions and implementation strategies is significant and presents both challenges and opportunities. This article explores five case studies that highlight the role of technology in HIV-related implementation research, emphasizing the importance of study design, outcome measurement, and equity.

JAIDS-JOURNAL OF ACQUIRED IMMUNE DEFICIENCY SYNDROMES (2022)

Article Psychology, Clinical

Exploring content of misinformation about HPV vaccine on twitter

Melanie L. Kornides, Sarah Badlis, Katharine J. Head, Mary Putt, Joseph Cappella, Graciela Gonzalez-Hernadez

Summary: Nearly a quarter of #HPV Tweets contain disinformation or misinformation about the HPV vaccine, with adverse health effects, mandatory vaccination, and vaccine inefficacy being the most prevalent categories. These misleading tweets are more likely to be retweeted than supportive tweets.

JOURNAL OF BEHAVIORAL MEDICINE (2023)

Article Microbiology

Genome Sequences of Anelloviruses, Genomovirus, and Papillomavirus Isolated from Nasal Pharyngeal Swabs

Courtney L. Collins, Simona Kraberger, Rafaela S. Fontenele, Temitope O. C. Faleye, Deborah Adams, Sangeet Adhikari, Helen Sandrolini, Sarah Finnerty, Rolf U. Halden, Matthew Scotch, Arvind Varsani

Summary: In this study, multiple viral infections including anelloviruses, papillomavirus, and influenza viruses were identified from nasopharyngeal swabs.

MICROBIOLOGY RESOURCE ANNOUNCEMENTS (2022)

Letter Health Care Sciences & Services

Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy

Ari Z. Klein, Shriya Kunatharaju, Karen O'Connor, Graciela Gonzalez-Hernandez

JOURNAL OF MEDICAL INTERNET RESEARCH (2023)

Letter Health Care Sciences & Services

Automatically Identifying Self-Reports of COVID-19 Diagnosis on Twitter: An Annotated Data Set, Deep Neural Network Classifiers, and a Large-Scale Cohort

Ari Z. Klein, Shriya Kunatharaju, Karen O'Connor, Graciela Gonzalez-Hernandez

JOURNAL OF MEDICAL INTERNET RESEARCH (2023)

Article Microbiology

Rhizobium Phage-Like Microvirus Genome Sequence Identified in Wastewater in Arizona, USA, in November 2020 Encodes an Endolysin and a Putative Multiheme Cytochrome c-like Protein

Ainsley R. Chapman, Jillian M. Wright, Nicole A. Kaiser, Peter M. Jones, Erin M. Driver, Rolf U. Halden, Arvind Varsani, Matthew Scotch, Temitope O. C. Faleye

Summary: This paper describes the genome of MAZ-Nov-2020, a microvirus identified from municipal wastewater in Maricopa County, Arizona, USA, in November 2020. The genome is 4,696 nucleotides long, with a GC content of 56% and a coverage of 3,641x. It encodes major capsid protein, endolysin, replication initiator protein, and two hypothetical proteins, one of which is predicted to be a membrane-associated multiheme cytochrome c.

MICROBIOLOGY RESOURCE ANNOUNCEMENTS (2023)

Review Health Care Sciences & Services

The Role of Social Media for Identifying Adverse Drug Events Data in Pharmacovigilance: Protocol for a Scoping Review

Su Golder, Karen O'Connor, Yunwen Wang, Graciela Gonzalez Hernandez

Summary: This study aims to evaluate and characterize the use of social media in adverse drug event detection and pharmacovigilance compared to other data sources. By comparing social media data with other sources, the added value of social media in monitoring adverse drug events can be concluded.

JMIR RESEARCH PROTOCOLS (2023)

Article Computer Science, Information Systems

Foundational domains and competencies for baccalaureate health informatics education

Saif Khairat, Sue S. Feldman, Arif Rana, Mohammad Faysel, Saptarshi Purkayastha, Matthew Scotch, Christina Eldredge

Summary: This article presents the foundational domains and corresponding competencies developed by AMIA's Academic Forum Baccalaureate Education Committee (BEC) for undergraduate health informatics education.

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION (2023)

Article Infectious Diseases

Canine Parvovirus 2C Identified in Dog Feces from Poop Bags Collected from Outdoor Waste Bins in Arizona USA, June 2022

Temitope O. C. Faleye, Erin M. Driver, Devin A. Bowes, Abriana Smith, Nicole A. Kaiser, Jillian M. Wright, Ainsley R. Chapman, Rolf U. Halden, Arvind Varsani, Matthew Scotch

Summary: In this study, CPV genomes were sequenced from dog feces collected in poop bags, and a variant of CPV-2c with amino acid substitutions in NS1 and NS2 was identified in Arizona, USA in June 2022. This genome is the only CPV genome described in the USA from the 2022 season, despite reports of CPV outbreaks and fatalities in dogs. Further studies and experimental research are needed to enhance our understanding of the evolutionary process of CPV.

TRANSBOUNDARY AND EMERGING DISEASES (2023)

Article Health Care Sciences & Services

Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers

Ari Z. Klein, Karen O'Connor, Lisa D. Levine, Graciela Gonzalez-Hernandez

Summary: This study examined the utility of Twitter data for analyzing the outcomes of pregnancies where beta-blockers were taken. The results suggest that Twitter can be a useful resource for cohort studies on drug safety during pregnancy.

JMIR FORMATIVE RESEARCH (2022)

Article Health Care Sciences & Services

Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification

Ari Z. Klein, Karen O'Connor, Graciela Gonzalez-Hernandez

Summary: This preliminary study aimed to use Twitter data to identify a cohort for epidemiologic studies of COVID-19 vaccination during pregnancy. By developing regular expressions and utilizing natural language processing tools, the study identified users who reported receiving COVID-19 vaccination during pregnancy and their pregnancy outcomes. Manual verification confirmed a portion of users received vaccination during pregnancy and reported outcomes, suggesting that Twitter can serve as a complementary resource to generate acceptance of COVID-19 vaccination in pregnant populations.

JMIR FORMATIVE RESEARCH (2022)

暂无数据