Please login first
Peptide Sequencing using Neural Machine Translation based on Sequence-2-Sequence Architecture and Long- Short-Term Memory Networks
* ,
1  University of Glasgow
2  James Watt School of Engineering
Academic Editor: Jean-marc Laheurte

https://doi.org/10.3390/ecsa-11-20402 (registering DOI)
Abstract:

Mass spectrometry is the most reliable and accurate approach for analyzing a complex biological sample and identifying its protein content, which is time-consuming and reasonably expensive. One possible option to overcome such limitations is to use potentiometric sensors based on transistors. However, for such technology to work, a protein database that contains information for billions of small peptides and amino acids (AA) is required. The only practical way to build such database is to use machine learning and this paper shows the initial steps towards achieving this aim. This study sheds light on the possibility of a new approach for peptide sequencing combining analytical simulations with Large Language Models (LLM) based on Sequence-2-Sequence (Seq-2-Seq) architecture built by Long-Short-Term Memory (LSTM) networks. 11573 tokenized data points (voltage and capacitance cross-over points) with a vocabulary size of 504 are fed to the model, 80% of data is used for training and validation, and 20% is used for testing. The model is tested on unseen data and the accuracy during the test is 71.74%, which is significant if compared to expensive and time-consuming conventional methods, i.e., spectrometry. In conclusion, the output results of this study show that the proposed Seq-2-Seq LLM architecture could be used to build a material database for a potentiometric sensor to replace the mass spectrometry method.

Keywords: Neural Machine Translation; Large Language Models; Peptide Sequencing; Amino Acids; Long-Short-Term Memory
Top