Looking into the literature and scientific forums, there isn’t any software that can explore the diversity of a database or a sequence subset by applying the similarity measures reported to delimit the twilight zone according all previously mentioned thresholds. So far, in order to retrieve several similarity measures like identity, similarity and scores in an all-vs-all pairwise sequence comparison, users should run previously software like needle (global alignment), water (local alignment), blast (local alignment) and even multiple sequence alignments (MSAs) tools (http://imed.med.ucm.es/Tools/sias.html), then results should be parsed to be presented in a nxn matrix. However, going through all these steps to get at the final similarity matrix require programming skills.
Here, we present SeqDivA, a python-based tool with a friendly GUI allowing non-expert users to run alignment algorithms (water, needle and blast) to compare all vs all protein, DNA and RNA sequences. SeqDivA provides similarity, identity and bit-score matrixes to explore the diversity/homology of the sequences, enabling the delimitation of the twilight zone. The resulting matrixes are visualized using dot plot-like graphs representing pairwise similarity measures (identities, similarity and bit-scores). SeqDivA also allows redundancy reduction by exploring amino acid identities from global alignments and can be connected to the output of software simulating related sequences with a known evolutionary history i.e. ROSE [1] and INDELible [2] in order to get subsets of homologous sequences at different identities or bit-scores ranges. The software can be freely downloaded at https://github.com/eancedeg/SeqDivA. The software was published as part of the paper published at https://doi.org/10.3390/biom10010026
1- Stoye, J., D. Evers, and F. Meyer, Rose: generating sequence families. Bioinformatics, 1998. 14(2): p. 157-163.
2- Fletcher, W. and Z. Yang, INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol, 2009. 26(8): p. 1879-88
The accurate definition of the twilight zone for the alignment algorithms will be an important step for the application of alternative methodologies for homology detection. In this sense, machine learning-based models are useful tools for remote homology detection at the twilight zone where alignment-based algorithms start failing. Artificial Intelligence will be playing a crucial role to address many bioinformatics problems, not solved for the standar methodologies
Thanks for your comments
Is there a market niche for an spin off or start up launching this a software based on this kind of models?
Have you ever considered to become an entrepreneur adventurer?
Thanks for your interest in the software.
Currenlty, the software is public available for academic purposes. Probably in the future we can incorporate some improvements to manage genomic and transcriptomic data (big data), thus its use would deserve an stepforward to incoporated in a pipeline analyzing genomic/trasncriptomic and proteomic data We would like to found an small enterprise :)