Bioinformatical characterization and machine learning classification of seed storage proteins

Hyukjin Kwon; Yonghui Li

Previous Article in event

Evaluating the potential for partial reduction in sugar in milk chocolate using sweet whey powder

Previous Article in session

Trends and Perspectives of AI-based technology in food safety and toxicity prediction

Next Article in event

Sustainable Food Security in Romania and Neighboring Countries: Trends, Challenges, and Solutions

Next Article in session

Artificial Intelligence in Food Safety Assessment and Monitoring: A Comprehensive Review

Bioinformatical characterization and machine learning classification of seed storage proteins

Hyukjin Kwon

Yonghui Li

¹ Grain Science and Industry at Kansas State University

Academic Editor: Antonello Santini

Published: 25 October 2024 by MDPI in The 5th International Electronic Conference on Foods session Application of Artificial Intelligence (AI) and Machine Learning in The Food Industry

Abstract:

Introduction: Seed storage proteins have traditionally been classified using T.B. Osborne's sequential extraction method, categorizing them into albumins (water-soluble), globulins (salt-soluble), prolamins (alcohol-soluble), and glutelins (acid/base-soluble). While this classification system is widely used, the molecular differences between these protein classes remain unclear. Therefore, this study aims to identify the distinct properties of proteins in each class to provide insights into their solubility in different systems. Additionally, this study focuses on constructing an efficient classification model to discriminate between Osborne classes, which could aid in the design of transgenic proteins with desirable functional and nutritional properties.

Methods: The physicochemical properties of 898 seed storage proteins from 175 species were characterized using both protein sequences and predicted structures from AlphaFold 2. Bioinformatics tools such as ChimeraX and Quilt were used to extract key features. Classification models, including linear discriminant analysis (LDA), support vector machine (SVM), and k-nearest neighbors (KNN), were developed to categorize these proteins into Osborne classes. Additionally, all-atomic and coarse-grained molecular dynamics simulations (MDS) were employed to study the behavior of selected model proteins in different solvent systems.

Results: Among the four Osborne classes (albumin, globulin, prolamin, and glutelin), albumins and prolamins exhibited distinct characteristics, with albumins showing higher sulfur content and prolamins having greater hydrophobicity. Globulins and glutelins had similar profiles. Non-linear SVM performed best, achieving 96.02% accuracy on an independent test set, whereas conventional methods like PCA and PLS-DA were less effective. MDS revealed how solvent environments impact protein dynamics and aggregation.

Conclusion: This research provides an effective model for identifying seed storage proteins and a fundamental dataset on their characteristics. It also showcases how model plant proteins behave in different solvent systems. Therefore, this study offers insights that can be applied to improve crop quality, protein engineering, and understandings of plant protein behavior in various environments.

Keywords: Seed storage proteins; Bioinformatics; Machine learning; Molecular dynamic simulation; Osborne fractionation

22 Reads
0 Recommendations

Hyukjin Kwon

Yonghui Li