Probabilistic Programming for Spectroscopic Data Analysis
: Applied to Vibrational Spectroscopy

  • J.P.C. (Johan) van Nispen

Student thesis: Master's Thesis

Abstract

Chemometrics is the science of extracting and interpreting chemical data using mathematical and statistical methods. One application of chemometrics is the analysis of spec troscopic data for the classification of chemical mixtures. To improve the performance of the data analysis, a common step is to pre-process the raw data to remove unwanted artefacts originating from instrumental and experimental sources. Data pre-processing usually involves multiple steps, and no clear guidelines exist on how to achieve an optimal result. Moreover, choosing a wrong order of steps can even decrease the data analysis performance.
Results from earlier research show that the use of a simple convolutional neural network is able to surpass the performance of standard chemometric methods on raw vibrational spectral data, and by including a pre-processing step, the analysis performance is increased even more. A drawback of using neural networks for data analysis is the limited model interpretability, while model interpretability is considered an important requirement for application within the chemical domain. Also, spectral datasets often contain a low number of samples, which inherently limits neural network performance.
A potential alternative to the use of neural networks for spectroscopic data analysis, is the use of probabilistic modelling. Probabilistic models are constructed from hidden and observed parameters, which are described by probability distributions. After construction, in a process known as inference, the probabilistic model is conditioned on observed data, and the model prior probability distribution is updated to the model posterior probability distribution. Probabilistic modelling applied within general purpose programming is known as probabilistic programming. The main objective of this research is to explore the usefulness of probabilistic programming for spectroscopic data analysis.
The inference results on spectroscopic datasets show, that with the probabilistic model developed during this research, broad spectral features of a vibrational spectrum can be captured, but that the characteristic spectral features, needed for further data analysis, remain largely unnoticed. Furthermore, the inferred noise level, which represents random noise from the measurement, is found to be much higher than in the observed spectroscopic data, which also decreases the usefulness of the model for spectroscopic data analysis.
To gain insight into the underlying cause of the inferred high noise levels, it was investigated how induced misalignments between the probabilistic model and the data affect the inference outcome. For this purpose, a dataset generator was built with the ability to generate spectral datasets. In a set of scenarios, the effects of induced misalignments between model and data on the parameter inference outcome was systematically investigated. The major effect observed is that, as the misalignment between the probabilistic model and the data grows larger, the inferred noise level also increases.
It is concluded that the current probabilistic model is not yet ready to be used for the data analysis on real-world spectroscopic datasets. A list of recommendations on how to improve the model is provided as future work.
Date of Award23 Jun 2020
Original languageEnglish
SupervisorTwan van Laarhoven (Examinator) & Arjen Hommersom (Co-assessor)

Cite this

'