We have several open lines of investigation with informational complexity metrics in different sets of organisms within our group and, we needed a centralised solution to fit our needs. Of these complexity metrics, the most used in our analysis are BioBit and GS.
Genomic Signature (GS) (1) is an informational index anchored to k-mers. It is calculated by identifying the k that yields the most significant separation between the observed distributions of k-mer classes and their theoretical equifrequent expectations. This approach relies on the relative abundances of short oligonucleotides alongside the use of the chaos game representation applied to the genome. Meanwhile, BioBit (2) is a k-mer based metric of complexity that quantifies the deficit of entropy: the difference between the maximal possible entropy for a k-mer (e.g., in a random genome of equivalent length) and the genome's own calculated entropy for that same k-mer.
Because of that, a web app is proposed integrated with a database of reference and representative genomes as well as some genomic parameters and the calculation of the mentioned metrics.
A workflow was designed to treat and filter the genomes available on RefSeq. Genomes with assembly level "Chromosome" or greater and those with RefSeq category of "Representative genome" or "Reference genome" were kept. Then, two informational complexity metrics were calculated, as well as several genomic parameters. An auto-updatable system was designed so that every two months, a new version of the database is created with the newest genomes available. All this workflow was implemented using Python, Bash and Django.
