EEG is widely used as an efficient tool in clinical diagnosis and human cognitive studies due to its high temporal resolution. While EEG provides a wealth of physiological information about the brain, the execution of data analysis algorithms is rather time-consuming due to the massive amount of EEG data and the complexity of the algorithms used in the process. A typical example is Independent Component Analysis (ICA), which is widely used for separating unwanted noise artifacts from neural signals and in cortical source localization. ICA is essentially a statistical signal unmixing method that performs multiple iterations to update the unmixing/mixing matrix to achieve maximum statistical independence among the components, where each iteration/propagation involves numerous matrix multiplication operations.
The tensor core is a hardware unit in most modern GPUs first introduced in the NVIDIA Volta architecture, which seems to be ideal for speeding up the time-consuming matrix multiplication operations in the ICA algorithm. Similarly to the well-known CUDA core, the tensor core is also a computing unit of the Streaming Multiprocessor (SM), but the input data to the tensor cores are a set of matrixes rather than single values processed by the CUDA cores. Each Tensor Core provides a 4x4x4 matrix processing array that operates D = A * B + C, where A, B, C, and D are 4×4 matrices. Each tensor core can perform 64 floating-point FMA operations per clock cycle, 64 times more than a traditional CUDA core. Therefore, for implementing algorithms involving many matrix multiplication and addition operations, tensor cores can provide multiple times the performance of the CUDA cores.
In the presentation, we will introduce the implementation strategy and details of using the tensor core for accelerating the Infomax ICA algorithm, including performance profiling/comparison and numerical error analysis.