4.6 Article

Improved transcriptome assembly using a hybrid of long and short reads with StringTie

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 18, Issue 6, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1009730

Keywords

-

Funding

  1. National Science Foundation [DBI-1759518]

Ask authors/readers for more resources

Short-read RNA sequencing and long-read RNA sequencing have their own strengths and weaknesses. The new release of StringTie allows for hybrid-read assembly, combining the strengths of both short and long reads to achieve higher accuracy and faster speed.
Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie. Author summary Identifying the genes that are active in a cell is a critical step in studying cell development, disease, the response to infection, the effects of mutations, and much more. During the last decade, high-throughput RNA-sequencing data have proven essential in characterizing the set of genes expressed in different cell types and conditions, which has driven a strong need for highly efficient, scalable and accurate computational methods to process these data. As sequencing costs have dropped, ever-larger experiments have been designed, often capturing hundreds of millions or even billions of reads in a single study. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. Recently developed long-read technology now allows researchers to capture entire transcripts in a single long read, enabling more accurate reconstruction of the full exon-intron structure of genes, although these reads have higher error rates and higher costs. In this study we use the high accuracy of short reads to correct the alignments of long RNA reads, with the goal of improving the identification of novel gene isoforms, and ultimately our understanding of transcriptome complexity.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Genetics & Heredity

Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies

Michael Alonge, Alaina Shumate, Daniela Puiu, Aleksey Zimin, Steven L. Salzberg

GENETICS (2020)

Article Biochemical Research Methods

Liftoff: accurate mapping of gene annotations

Alaina Shumate, Steven L. Salzberg

Summary: Advancements in DNA sequencing and computational methods have led to a significant increase in high-quality genome assemblies for many species. To annotate gene features in these genomes, a common strategy is to map genes from a previously annotated reference genome to new or improved assemblies. The tool Liftoff can accurately map genes between the same or closely related species, ensuring high sequence identity and preserving gene structure.

BIOINFORMATICS (2021)

Article Multidisciplinary Sciences

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

Zhixing Feng, Jose C. Clemente, Brandon Wong, Eric E. Schadt

Summary: Cellular genetic heterogeneity is common across biological conditions, yet high error rates in long-read sequencing technologies limit their application to this subject. iGDA is introduced as a tool for accurate detection and phasing of minor variants, enabling precise haplotype reconstruction.

NATURE COMMUNICATIONS (2021)

Article Genetics & Heredity

A reference-quality, fully annotated genome from a Puerto Rican individual

Aleksey Zimin, Alaina Shumate, Ida Shinder, Jakob Heinz, Daniela Puiu, Mihaela Pertea, Steven L. Salzberg

Summary: Until 2019, there was only one fully annotated version of the human genome. In 2019, a second individual genome was successfully assembled and annotated, which was from an individual of African descent. The new genome is more complete and contiguous than previous genomes.

GENETICS (2022)

Article Multidisciplinary Sciences

Epigenetic patterns in a complete human genome

Ariel Gershman, Michael E. G. Sauria, Xavi Guitart, Mitchell R. Vollger, Paul W. Hook, Savannah J. Hoyt, Miten Jain, Alaina Shumate, Roham Razaghi, Sergey Koren, Nicolas Altemose, Gina Caldas, Glennis A. Logsdon, Arang Rhie, Evan E. Eichler, Michael C. Schatz, Rachel J. O'Neill, Adam M. Phillippy, Karen H. Miga, Winston Timp

Summary: This study presents a high-resolution epigenetic analysis of the telomere-to-telomere human reference genome, revealing important insights into gene activity, clinical regulation, and providing a framework for investigating elusive regions of the genome.

SCIENCE (2022)

Article Multidisciplinary Sciences

A complete reference genome improves analysis of human genetic variation

Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Chunlin Xiao, Justin Wagner, Jennifer McDaniel, Nathan D. Olson, Michael E. G. Sauria, Mitchell R. Vollger, Arang Rhie, Melissa Meredith, Skylar Martin, Joyce Lee, Sergey Koren, Jeffrey A. Rosenfeld, Benedict Paten, Ryan Layer, Chen-Shan Chin, Fritz J. Sedlazeck, Nancy F. Hansen, Danny E. Miller, Adam M. Phillippy, Karen H. Miga, Rajiv C. McCoy, Megan Y. Dennis, Justin M. Zook, Michael C. Schatz

Summary: Compared to its predecessors, the Telomere-to-Telomere CHM13 genome has significant improvements in sequence length and structural accuracy, enabling more comprehensive study of the human genome. The application of the T2T-CHM13 reference has demonstrated improved accuracy in read mapping and variant calling for globally diverse samples, leading to the discovery of previously unresolved variants and the removal of false positives. These advancements position T2T-CHM13 as a potential replacement for GRCh38 as the dominant reference for human genetics.

SCIENCE (2022)

Article Biotechnology & Applied Microbiology

A comparison framework and guideline of clustering methods for mass cytometry data

Xiao Liu, Weichen Song, Brandon Y. Wong, Ting Zhang, Shunying Yu, Guan Ning Lin, Xianting Ding

GENOME BIOLOGY (2019)

No Data Available