Please login first
Bioinformatical characterization and machine learning classification of seed storage proteins
, *
1  Grain Science and Industry at Kansas State University
Academic Editor: Antonello Santini

Abstract:

Introduction: Seed storage proteins have traditionally been classified using T.B. Osborne's sequential extraction method, categorizing them into albumins (water-soluble), globulins (salt-soluble), prolamins (alcohol-soluble), and glutelins (acid/base-soluble). While this classification system is widely used, the molecular differences between these protein classes remain unclear. Therefore, this study aims to identify the distinct properties of proteins in each class to provide insights into their solubility in different systems. Additionally, this study focuses on constructing an efficient classification model to discriminate between Osborne classes, which could aid in the design of transgenic proteins with desirable functional and nutritional properties.

Methods: The physicochemical properties of 898 seed storage proteins from 175 species were characterized using both protein sequences and predicted structures from AlphaFold 2. Bioinformatics tools such as ChimeraX and Quilt were used to extract key features. Classification models, including linear discriminant analysis (LDA), support vector machine (SVM), and k-nearest neighbors (KNN), were developed to categorize these proteins into Osborne classes. Additionally, all-atomic and coarse-grained molecular dynamics simulations (MDS) were employed to study the behavior of selected model proteins in different solvent systems.

Results: Among the four Osborne classes (albumin, globulin, prolamin, and glutelin), albumins and prolamins exhibited distinct characteristics, with albumins showing higher sulfur content and prolamins having greater hydrophobicity. Globulins and glutelins had similar profiles. Non-linear SVM performed best, achieving 96.02% accuracy on an independent test set, whereas conventional methods like PCA and PLS-DA were less effective. MDS revealed how solvent environments impact protein dynamics and aggregation.

Conclusion: This research provides an effective model for identifying seed storage proteins and a fundamental dataset on their characteristics. It also showcases how model plant proteins behave in different solvent systems. Therefore, this study offers insights that can be applied to improve crop quality, protein engineering, and understandings of plant protein behavior in various environments.

Keywords: Seed storage proteins; Bioinformatics; Machine learning; Molecular dynamic simulation; Osborne fractionation

 
 
Top