MetAlgNet : Metabolic Pathway Network Reconstruction from Algae Genome Annotation Data

Post-genomic molecular biology embodies high-throughput experimental techniques and hence it is a data-rich field. The goal of development of this tool is to utilize free available biological data of green algae in order to produce new metabolic pathway knowledge and to aid mining of newly generated data. The variety of biological sequence and functional information are stored in different online database, so getting annotation information of genome from different database is challenging task for reconstruction of pathways. Here we apply data integration approach to provide rich representation that enables pathway names based text mining of biological data in terms of integrated networks and conceptual spaces. The publicly available green algae genome annotated data can be used to aid mining of important biological enzymes in metabolic networks. We developed an integrative bioinformatics approach that utilizes publicly available knowledge of enzyme-metabolites interactions, network topological analysis like betweenness, closeness and degree for assigning node importance with quantitative values. The application of our software is revealed importance of role of potential enzymes in biological functions in view of network centrality values, which were calculated by various algorithms. The results provided in this work indicate that integration of heterogeneous biological data facilitates advanced mining of data to create metabolic pathway networks. The methods can be applied for gaining insight into functions of enzymes, metabolites and other molecules, as well as for offering interpretation of functional evolution of metabolites with help of topological analysis and reconstruction of phylogenetic tree from sequence data.


Introduction
The advancement of new technology leads to production of large amount of biological data such as high-throughput sequencing data, metabolomics data, transcriptomics data and SciForum http://sciforum.net/conference/mol2net-1many more.The metabolic networks are complex due to their size and the presence of bimolecular reactions; so combined knowledge of biology, computer science and graph theory will help understand molecular network complexity [1].Within the biological sciences, one of the primary challenges is to investigate how the collective behavior of cells, tissues, or organisms can be understood in terms of the properties of their molecular constituents from a metabolic network [2].There is an essential role of metabolic networks in all biological processes of a living cell.Some are like biochemical pathways to protein interactions and gene regulation to cellular communication.Traditionally, genes and proteins involved in different functionalities have been studied in isolation or in small clusters.However, the complex nature of a cell cannot be fully understood by studying individual components in isolation.To investigate this intricate connectivity of cellular systems, the analysis of complex networks has become an important part of molecular biology [3].Cellular system can be viewed as a combination of omics technologies, data integration, analysis, mining, and visualization often involving use of these techniques iteratively over hypothesis driven systematic experimental design to gain increased understanding of the structure and dynamics of the biological systems [4].In Fig- 1 an integrative bioinformatics starts with the integration of multiple datasets from one or more omics and also possibly from multiple organisms, and forms the basis for systems biology analysis.The MetAlgNet creates a network from annotated data and the purpose for developing http://sciforum.net/conference/mol2net-1this network is to get knowledge of potential element from topological analysis with generated network.The software created random network from GMT file (see Fig. 3).We created 55 different pathway networks from the standard biological pathway as per KEGG database; the main purpose of creating this network is getting inference from it.However, the tool has ability to generate more number of networks with respective search term.
Along with centrality, we also reconstruct phylogenetic tree from respective annotated data of particular pathway.Here, we summarize result of pathways which generated from MetAlgNet data mining that were four main result in consideration 1) Network generated from particular pathway from specific search term 2) identification of potential node of respective network with help of node ranking algorithm 3) degree, closeness and betweenness centrality calculation and bar chart generation and 4) phylogenetic tree generation from sequence data.The interpretation lead to identification of major role of particular enzyme in network and chemical compound.The networks given below are generated using 1.Chlamydomonas reinhardtii, 2.Ostreococcus lucimarinus, 3.Ostreococcus tauriand 4.Volvox carteri.So, collecting all data from each of the organism database tool, creates a comprehensive GMT file.
The GMT file is further utilized for creating a network of enzymes and metabolites.The resulting networks showed surprisingly high level of connectivity across different stages of linear metabolic pathways via enzyme and metabolite interactions.The centrality analysis plays major role to identify a potential node in the network.If the network has a very high average closeness value, it leads to more organized functional units or modules.The degree could indicate a central role in a biological network.It may indicate relevance of a node as functionally capable of holding together other nodes in the network.Betweenness of a node effectively indicates the capability of a node to bring in distant nodes to perform communication in network (see Fig 4) .

Materials and Methods
Primary requirement for annotation is collecting genome data of desired organisms.However, if complete annotation data is not available, so we can annotate data with available genome annotationpipeline.The raw data collected from NCBI sequence read archive database or DDBJ or EBI-SRA database [5][6][7] for 1.Chlamydomonas reinhardtii, 2.Ostreococcus lucimarinus, 3.Ostreococcus tauri and 4.Volvox carteri.In context of four different algae, there is list of data available, which we have downloaded and used in annotation.The major data used in making database are listed below in table 1.However, there are lots of incomplete data, so we try to avoid use in study [8][9][10].
We took permanent draft and complete sequenced data for study.The study also considers other source annotated data of respected algae.There are many community based genome annotation projects going on like OrcAE (Online Resource for Community Annotation of Eukaryotes) and Phytozome (JGI annotation resource).Both online databases provide very much useful annotation data which contain Gene locus name, transcript name, protein name, PFAM, Panther ID, KOG, EC NO, KEGG Orthology and Gene ontology [11].http://sciforum.net/conference/mol2net-1Data from various public data sources were collected into our local database systems The curation of a free available data from database involves several steps required in the curation process.
Multiple molecular biology databases provide descriptions of biological systems at different levels of abstraction.Some common biological information, along with names of primary databases providing information is indicated in figure-2.

Conclusions
The creation of interactions network is through retrieval of data from multiple annotated databases, and the MetAlgNet software system visualization of the networks.Integrative text-based mining of the data from 24 various databases is facilitated by representing the annotated data as raw material for network construction, and visualizing the similarities using different python library.
The MetAlgNet-based data mining approach may facilitate discovery of novel or unexpected relationships among enzymes and metabolites, formulation of new hypotheses, data annotation, interpretation of new experimental data, and construction and validation of new networkbased models of biological systems.Our approach takes advantage of connectivity of different annotated metabolic data of respective green algae in heterogeneous interactome network constructed by MetAlgNet, and shows that connectivity-based approach is superior to traditional pathway analysis.The findings from this study establish the applicability of our network analysis strategy, and support the hypothesis that modeling of local network topology dynamics can be used as an effective tool to study the activity of biological modules.Also, omics data are ever expanding and this poses challenges to updating and mining of data.The data warehousing approaches for data integration are really useful and effective from user point of view.It is not possible to completely avoid these problems, but by taking standards-based approach to data integration, we can minimize the problem of data integration.The integration approach is still found missing in online biological data available with different databases.It is better to develop databases which are interconnected with specific groups of organisms.The diversity of the data and the fact that not all data sources adapt the standards forces us to create our own schemas.We adapted a combination of multiple approaches in data integration.Although we imported all the databases to the local warehouse, the individual schemas were kept intact.We created an additional semantic mapping with the help of Python cursor and SQLite database to facilitate resolution of entities across databases, which often doesn't need to change even when a new data source is added.The integration of data across databases and sophisticated queries are handled using Python programs.The technique of data integration is applicable more broadly to any organism for which we have large scale genome annotation data availability.As enzyme identifiers are the central entities to data integration in our method, data mining shows different interaction databases that use consistent identifiers.