Please login first
Application of Random Forest ML algorithm to spectral recognition of MD vibrational spectra of nucleotides SERS sensor model
* ,
1  Toyama University, Faculty of Engineering
Academic Editor: Jun-Jie Zhu

Abstract:

The research on DNA/RNA and protein identification down to the single oligomer level has significant advances. Nanopore-inspired systems have been extensively developed for applications in genome sequencing and are being adapted for protein sequencing. The surface-enhanced Raman scattering (SERS) method has the detectability of single oligomers but includes spectral variations related to environments and conformations. The machine learning (ML) methods were successfully applied in the spectral measurements. Molecular dynamics (MD) can provide us with simulated vibrational spectra in various environment. Identification of oligomers can be done by an ML Random Forest (RF) algorithm that has shown high accuracy for the experimental SERS. We investigate the applicability of the RF algorithm to identify nucleotides by vibrational spectra in MD sensors. The ring-averaged vibrational spectra of the DNA nucleotides were used. The spectra were obtained in interaction with the system of Au nanoparticles attached to a graphene sheet with nanopore. The first step was to apply the baseline correction to the decay component of the velocity correlation function present in the MD vibrational spectra to adjust intensities because the intensity of the peaks becomes comparable with the subtracted decay component at low frequencies. The 20 points b-spline and piece-wise linear baseline corrections have been tested. The frequencies f, amplitudes I, and differences for adjacent grid points were used as training test for RF algorithm that has shown accuracy of 93-96% on the grid of some 170 spectral points. The RF algorithm identifies the methylated forms of cytosine and reproduces differences in nucleotide spectra. Still, the lower frequency part of spectra [<1000 cm-1] is reproduced with higher validity as compared to the higher frequency part above 1000 cm-1 for nucleotides.

Keywords: Vibrational spectra; DNA nucleotides; Random Forest ML algorithm; SERS

 
 
Top