Abstract: Formal grammars can used for describing complex repeatable structures such as DNA sequences. In
this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar.
L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant
development, and model the morphology of a variety of organisms. We believe that parallel grammars also can
be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory
DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for
successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species,
but there are many exceptions which makes the promoter recognition a complex problem. We replace the
problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for
the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and
vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a
Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived Lgrammar
rules are analyzed and compared with natural promoter sequences.
Keywords: stochastic context-free L-grammar, DNA modeling, machine learning, data mining, bioinformatics.
ACM Classification Keywords: F.4.2 Grammars and Other Rewriting Systems; I.2.6 Knowledge acquisition; I.5
Pattern recognition; J.3 Life and medical sciences.
DERIVATION OF CONTEXT-FREE STOCHASTIC L-GRAMMAR RULES
FOR PROMOTER SEQUENCE MODELING USING SUPPORT VECTOR MACHINE