Peptide Sequencing using Neural Machine Translation based on Sequence-2-Sequence Architecture and Long- Short-Term Memory Networks

Sobhan Naderian; Vihar Georgiev

doi:10.3390/ecsa-11-20402

Previous Article in event

Analysis of multiple emotions from EEG signal using machine learning models

Next Article in event

Development of Crop Reflectance Sensor for Precision Agriculture

Next Article in session

Ensemble Projected Gated Recurrent Unites For State Of Charge Estimation: A Case Study On Lithium-Ion Batteries in Electric Vehicles

Peptide Sequencing using Neural Machine Translation based on Sequence-2-Sequence Architecture and Long- Short-Term Memory Networks

Sobhan Naderian

^*,

Vihar Georgiev

¹ University of Glasgow
² James Watt School of Engineering

Academic Editor: Jean-marc Laheurte

Published: 25 November 2024 by MDPI in The 11th International Electronic Conference on Sensors and Applications session Sensors and Artificial Intelligence

https://doi.org/10.3390/ecsa-11-20402

Abstract:

Mass spectrometry is the most reliable and accurate approach for analyzing a complex biological sample and identifying its protein content, which is time-consuming and reasonably expensive. One possible option to overcome such limitations is to use potentiometric sensors based on transistors. However, for such technology to work, a protein database that contains information for billions of small peptides and amino acids (AA) is required. The only practical way to build such database is to use machine learning and this paper shows the initial steps towards achieving this aim. This study sheds light on the possibility of a new approach for peptide sequencing combining analytical simulations with Large Language Models (LLM) based on Sequence-2-Sequence (Seq-2-Seq) architecture built by Long-Short-Term Memory (LSTM) networks. 11573 tokenized data points (voltage and capacitance cross-over points) with a vocabulary size of 504 are fed to the model, 80% of data is used for training and validation, and 20% is used for testing. The model is tested on unseen data and the accuracy during the test is 71.74%, which is significant if compared to expensive and time-consuming conventional methods, i.e., spectrometry. In conclusion, the output results of this study show that the proposed Seq-2-Seq LLM architecture could be used to build a material database for a potentiometric sensor to replace the mass spectrometry method.

Keywords: Neural Machine Translation; Large Language Models; Peptide Sequencing; Amino Acids; Long-Short-Term Memory

View paper

0 Reads
0 Recommendations

Sobhan Naderian

Vihar Georgiev