Computing Genomic Signatures Using de Bruijn Chains

Lenwood S. Heath, Amrita Pati


Genomic DNA sequences have both deterministic and random aspects and exhibit features at numerous scales, from codons to regions of conserved or divergent gene order. Genomic signatures work by capturing one or more such features efficiently into a compact mathematical structure. We examine the unique manner in which oligonucleotides constitute a genome, within a graph-theoretic setting. A de Bruijn chain (DBC) is a kind of de Bruijn graph that includes a finite Markov chain. By representing a DNA sequence as a walk over a DBC and retaining specific information at nodes and edges, we obtain the de Bruijn chain genomic signature θdbc, based on graph structure and the stationary distribution of the DBC. We demonstrate that the θdbc signature is information-rich, efficient, sufficiently representative of the sequence from which it is derived, and superior to existing genomic signatures such as the dinucleotide odds ratio and word frequency based signatures. We develop a mathematical framework to elucidate the power of the θdbc signature to distinguish between sequences hypothesized to be generated by DBCs of distinct parameters. We study the effect of order of the θdbc signature, genome size, and variation within a genome on accuracy. We illustrate its superior performance over existing genomic signatures in predicting the origin of short DNA sequences.

Full Text:



S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Basic local alignment search tool, Journal of Molecular Biology, 215 (1990), pp. 403–410.

A. Anguiano and A. Potti, Genomic signatures individualize therapeutic decisions in nonsmall-cell lung cancer, Future Drugs, 7 (2007), pp. 837–844.

D. N. Baker and B. Langmead, Dashing: Fast and accurate genomic distances with HyperLogLog, Genome Biology, 20 (2019), p. 12 pages.

A. Z. Broder, On the resemblance and containment of documents, in Compression and Complexity of Sequences 1997 — Proceedings, 1998, pp. 21–29.

C. T. Brown and L. Irber, sourmash: a library for MinHash sketching of DNA, The Journal of Open Source Software, 1 (2016), p. 27.

A. M. Campbell, J. Mrazek, and S. Karlin, Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA, Proceedings of the National Academy of Sciences of the United States of America, 96 (1999), pp. 9184–9189.

C. H. Cannon, C. S. Kua, E. K. Lobenhofer, and P. Hurban, capturing genomic signatures of DNA sequence variation using a standard anonymous microarray platform, Nucleic Acids Research, 34 (2006).

A. Carbone, F. Kepes, and A. Zinovyev, Codon bias signatures, organization of microorganisms in codon space, and lifestyle, Molecular Biology and Evolution, 22 (2005), pp. 547–561.

J. T. Chang and J. R. Nevins, GATHER: A systems approach to interpreting genomic signatures, Bioinformatics, 22 (2006), pp. 2926–2933.

T. Coenye and P. Vandamme, Use of the genomic signature in bacterial classification and identification, Systematic and Applied Microbiology, 27 (2004), pp. 175–185.

D. Dalevi, D. Dubhashi, and M. Hermansson, Bayesian classifiers for detecting HGT using fixed and variable order Markov models of genomic signatures, Bioinformatics, 22 (2006), pp. 517–522.

P. J. Deschavanne, A. Giron, J. Vilain, G. Fagot, and B. Fertil, Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences, Molecular Biology and Evolution, 16 (1999), pp. 1391–1399.

H. K. Dressman, A. Bild, J. Garst, D. H. Jr, and A. Potti, Genomic signatures in non-small-cell lung cancer: Targeting the targeted therapies, Current Oncology Reports, 8 (2006), pp. 252–257.

C. Dufraigne, B. Fertil, S. Lespinats, A. Giron, and P. Deschavanne, Detection and characterization of horizontal transfers in prokaryotes using genomic signature, Nucleic Acids Research, 33 (2005), p. 12 pages.

C. Dutta and J. Das, Mathematical characterization of Chaos Game Representation: New algorithms for nucleotide sequence analysis, Journal of Molecular Biology, 228 (1992), pp. 715– 719.

W. Feller, An Introduction to Probability Theory and Its Applications, vol. I, John Wiley & Sons Inc., New York, third ed., 1968.

B. Fertil, M. Massin, S. Lespinats, C. Devic, P. Dumee, and A. Giron, GENSTYLE: Exploration and analysis of DNA sequences with genomic signature, Nucleic Acids Research, 33 (Web Server issue) (2005), pp. W512–W515.

J. W. Fickett, D. C. Torney, and D. R. Wolf, Base compositional structure of genomes, Genomics, 13 (1992), pp. 1056–1064.

A. J. Gentles and S. Karlin, Genome-scale compositional comparisons in eukaryotes, Genome Research, 11 (2001), pp. 540–546.

J. Goris, K. T. Konstantinidis, J. A. Klappenbach, T. Coenye, P. Vandamme, and J. M. Tiedje, DNA-DNA hybridization values and their relationship to whole-genome sequence similarities, International Journal of Systematic and Evolutionary Microbiology, 57 (2007), pp. 81–91.

M. P. Hande, T. V. Azizova, C. R. Geard, L. E. Burak, C. R. Mitchell, V. F. Khokhryakov, E. K. Vasilenko, and D. J. Brenner, Past exposure to densely ionizing radiation leaves a unique permanent signature in the genome, The American Society of Human Genetics, 72 (2003), pp. 1162–1170.

L. S. Heath and A. Pati, Genomic signatures from DNA word graphs, in Proceedings of the Third International Symposium on Bioinformatics Research and Applications (ISBRA), Springer-Verlag, 2007, pp. 317–328.

Genomic signatures in de Bruijn chains, in Proceedings of the Workshop on Algorithms in Bioinformatics (WABI), Springer-Verlag, 2007, pp. 216–227.

Predicting Markov chain order in genomic sequences, in Proceedings of the IEEE In-ternational Conference on Bioinformatics and Biomedicine (BIBM), IEEE Computer Society, 2007, pp. 159–164.

K. A. Hill, N. J. Schisler, and S. M. Singh, Chaos game representation of coding regions of human globin genes and alcohol dehydrogenase genes of phylogenetically divergent species, Journal of Molecular Evolution, 35 (1992), pp. 261–269.

H. J. Jeffrey, Chaos game representation of gene structure, Nucleic Acids Research, 18 (1990), pp. 2163–2170.

R. W. Jernigan and R. H. Baran, Pervasive properties of the genomic signature, BMC Genomics, 3 (2002), p. 9 pages.

P. Kaposi-Novak, J.-S. Lee, L. G`omez-Quiroz, C. Coulouarn, V. M. Factor, and S. S. Thorgeirsson, Met-regulated expression signature defines a subset of human hepatocellular carcinomas with poor prognosis and aggressive phenotype, Journal of Clinical Investigation, 116 (2006), pp. 1582–1595.

S. Karlin and C. Burge, Dinucleotide relative abundance extremes — A genomic signature, Trends in Genetics, 11 (1995), pp. 283–290.

S. Karlin, J. Mrazek, and A. M. Campbell, Compositional biases of bacterial genomes and evolutionary implications, Journal of Bacteriology, 179 (1997), pp. 3899–3913.

K. T. Konstantinidis and J. M. Tiedje, Towards a genome-based taxonomy for prokaryotes, Journal of Bacteriology, 187 (2005), pp. 6258–6264.

S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. L. Salzberg, Versatile and open software for comparing large genomes, Genome Biology, 5 (2004), p. 9 pages.

D. R. Maddison, K. S. Schulz, and W. P. Maddison, The Tree of Life Web project, Zootaxa, (2007), pp. 19–40.

S. Mandruzzato, A. Callegaro, G. Turcatel, S. Francescato, M. C. Montesco, V. Chiarion-Sileni, S. Mocellin, C. R. Rossi, S. Bicciato, E. Wang, F. M. Marincola, and P. Zanovello, A gene expression signature associated with survival in metastatic melanoma, Journal of Translational Medicine, 4 (2006).

J. H. Mao, J. Z. Li, T. Jiang, Q. Li, D. Wu, J. Perez-Losada, R. DelRosario, L. Peterson, A. Balmain, and W. W. Cai, Genomic instability in radiation-induced mouse lymphoma from p53 heterozygous mice, Oncogene, 24 (2005), pp. 7924–7934.

H. Marakeby, E. Badr, H. Torkey, Y. Song, S. Leman, C. L. Monteil, L. S. Heath, and B. A. Vinatzer, A system to automatically classify and name any individual genome-sequenced organism independently of current biological classification and nomenclature, PLOS One, 9 (2014), p. 12 pages.

G. Marcais, D. DeBlasio, P. Pandey, and C. Kingsford, Locality-sensitive hashing for the edit distance, Bioinformatics, 35 (2019), pp. I127–I135.

P. Mendiratta and P. Febbo, Genomic signatures associated with the development, progression, and outcome of prostate cancer, Molecular Diagnosis and Therapy, 11 (2007), pp. 345– 354.

M. Mitzenmacher and E. Upfal, Probability and Computing, Cambridge University Press, first ed., 2005.

B. B. Normark, O. P. Judson, and N. A. Moran, Genomic signatures of ancient asexual lineages, Biological Journal of the Linnean Society, 79 (2003), pp. 69–84.

B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren, and A. M. Phillippy, Mash: Fast genome and metagenome distance estimation using MinHash, Genome Biology, 17 (2016), p. 14 pages.

A. Pati, Graph-based Genomic Signatures, PhD thesis, Virginia Tech, 2008.

A. Pati, L. S. Heath, N. C. Kyrpides, and N. Ivanova, ClaMS: A classifier for metagenomics sequences, Standards in Genomic Sciences, 5 (2011), pp. 248–253.

P. A. Pevzner, DNA physical mapping and alternating Eulerian cycles in colored graphs, Algorithmica, 13 (1995), pp. 77–105.

P. A. Pevzner, H. X. Tang, and M. S. Waterman, An Eulerian path approach to DNA fragment assembly, Proceedings of The National Academy of Sciences of the United States of America, 98 (2001), pp. 9748–9753.

L. Pritchard, R. H. Glover, S. Humphris, J. G. Elphinstone, and I. K. Toth, Genomics and taxonomy in diagnostics for food security: Soft-rotting enterobacterial plant pathogens, Analytical Methods, 8 (2016), pp. 12–24.

B. Raphael, D. G. Zhi, H. X. Tang, and P. Pevzner, A novel method for multiple alignment of sequences with repeated and shuffled elements, Genome Research, 14 (2004), pp. 2336– 2346.

A. L. Rosenberg and L. S. Heath, Graph Separators, With Applications, Frontiers of Computer Science, Kluwer Academic/Plenum Publishers, 2000.

R. Sandberg, C. I. Branden, I. Ernberg, and J. Coster, Quantifying the species specificity in genomics signatures, synonymous codon choice, amino acid usage, and G+C

content, Gene, 311 (2003), pp. 35–42.

E. Solan and N. Vieille, Perturbed Markov chains, Journal of Applied Probability, 40 (2003), pp. 107–122.

H. Teeling, A. Meyerdierks, M. Buaer, R. Amann, and F. O. Glockner, Application of tetranucleotide frequencies for the assignment of genomic fragments, Environmental Microbiology, 6 (2004), pp. 938–947.

L. Tian, C. J. Huang, R. Mazloom, L. S. Heath, and B. A. Vinatzer, Linbase: A Web server for genome-based identification of prokaryotes as members of crowdsourced taxa, Nucleic Acids Research, 48 (2020), pp. W529–W537.

L. Tian, R. Mazloom, L. S. Heath, and B. A. Vinatzer, Linflow: A computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes, Peerj, 9 (2021), p. 17 pages.

V. Urquidia and S. Goodison, Genomic signatures of breast cancer metastasis, Cytogenetic and Genome Research, 118 (2007), pp. 116–129.

M. W. J. van Passel, A. Bart, H. H. Thygesen, A. C. M. Luyf, A. H. C. van Kampen, and A. van der Ende, An acquisition account of genomic islands based on genome signature comparisons, BMC Genomics, 6 (2005), p. 10 pages.

M. W. J. van Passel, E. E. Kuramae, A. C. M. Luyf, A. Bart, and T. Boekhout, The reach of the genome signature in prokaryotes, BMC Evolutionary Biology, 6 (2006), p. 8 pages.

B. A. Vinatzer, H. A. Elmarakeby, A. J. Weisberg, C. L. Monteil, and L. S. Heath, A new exclusively genome-based species-independent taxonomic framework for all life forms applied to Pseudomonas syringae, Phytopathology, 105 (2015), p. 143.

B. A. Vinatzer, L. Tian, and L. S. Heath, A proposal for a portal to make earth’s microbial diversity easily accessible and searchable, Antonie Van Leeuwenhoek International Journal of General and Molecular Microbiology, 110 (2017), pp. 1271–1279.

A. J. Weisberg, H. A. Elmarakeby, L. S. Heath, and B. A. Vinatzer, Similarity based codes sequentially assigned to Ebolavirus genomes are informative of species membership, associated outbreaks, and transmission chains, Open Forum Infectious Diseases, 2 (2015), p. 11 pages.

Y. Zhang and M. S. Waterman, An Eulerian path approach to global multiple alignment for DNA sequences, Journal of Computational Biology, 10 (2003), pp. 803–819.

An Eulerian path approach to local multiple alignment for DNA sequences, Proceedings of The National Academy of Sciences of the United States Of America, 102 (2005), pp. 1285– 1290.

X. F. Zhao, BinDash, software for fast genome distance estimation on a typical personal laptop, Bioinformatics, 35 (2019), pp. 671–673.



  • There are currently no refbacks.

Copyright (c) 2021 Lenwood S. Heath

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Advances in Computing and Engineering
E-ISSN: 2735-5985
P-ISSN: 2735-5977

Published by:

Academy Publishing Center (APC)
Arab Academy for Science, Technology and Maritime Transport (AASTMT)
Alexandria, Egypt