4.7 Article

A k-mer grammar analysis to uncover maize regulatory architecture

Journal

BMC PLANT BIOLOGY
Volume 19, Issue -, Pages -

Publisher

BMC
DOI: 10.1186/s12870-019-1693-2

Keywords

Gene regulatory regions; Machine learning models; Crops genomics

Categories

Funding

  1. NSF Plant Genome Project [1238014]
  2. USDA-ARS

Ask authors/readers for more resources

BackgroundOnly a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified.ResultsWe developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) bag-of-words which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built bag-of-k-mers and vector-k-mers models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our bag-of-k-mers achieved higher overall accuracy, while the vector-k-mers models were more useful in highlighting key groups of sequences within the regulatory regions.ConclusionsThese models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available