Machine learning (ML) algorithms in molecular simulations have been recently extended to models for machine learning tensorial properties such as molecular dipole moments and polarizability tensors enabling calculations of the IR and Raman spectra. The use of ML methods in DNA/RNA and protein research could enable the automated identification of individual oligomers. The parallel use of ML in surface-enhanced Raman scattering (SERS) sensors and their simulated models can enhance the detection of single oligomers by analyzing spectral variations linked to environmental interactions and conformational changes in the models.
Molecular dynamics (MD) provides vibrational spectra in various interaction environments and molecular conformations reflected in spectral maps of individual bonds. The identification of oligomers relative to environments can be performed by an ML Random Forest (RF) algorithm used for the experimental Raman spectra. In the MD model of the numerical SERS sensor, we applied the RF algorithm for the identification of pyrimidine and purine DNA nucleotides by ring-averaged vibrational spectra obtained during translocation through the nanopore in a graphene sheet with Au nanoparticles (1 to 4 NP) attached to the pore’s edge. The baseline-corrected ring-averaged equal-weight vibrational spectra showed nucleotide recognition by RF on a dataset of 170 points. The vibrational spectral maps of nucleobase bonds were calculated for the ring averages. We demonstrate that the implementation of the bond polarizability model (BPM), which assumes that the overall molecular polarizability is a sum over bond contributions, makes use of bond polarizabilities as weighting coefficients of each bond spectrum possible. The existing literature data for the bond polarizabilities of oligomers were approximated for the weighting coefficients. The calculated spectral maps were baseline-corrected as a whole matrix using the SpectroChemPy (SCPy) framework for processing spectroscopic data with masking of the frequency region below 100 cm-1. A spectral map weighted by bond polarizabilitieswas added to the averaged spectra in the dataset and used as training test data in the RF algorithm. While for only ring-averaged MD data, the RF algorithm reproduces differences in nucleotide spectra and identifies the methylated forms of cytosine, the accuracy is only qualitative. The use of bond polarizability weights for the cytosine pyrimidine ring spectral map with the ring-averaged spectrum dramatically improved the averaged spectrum reproduction by the RF algorithm. The mode frequencies and intensities were correctly reproduced quantitively by the RF algorithm closely to the calculated data.