Seminar Series in Quantitative Life Sciences and Medicine
Getting more out of mass spectrometry-based proteomics using supervised learning approaches and on-the-fly data analysis
Mathieu Lavallee (University of Ottawa)
Tuesday February 5, 12-1pm
Montreal Neurological Institute, deGrandpre Communications Centre
Abstract:聽Mass spectrometry-based proteomics is widely used to identify proteins in complex biological samples. Current proteomics approaches generate hundreds of thousands of mass spectra, yet, on average, only 25% of the mass spectra acquired in a mass spectrometry experiment are computationally matched to protein sequences. Furthermore, since this computational matching typically takes place after mass spectrometry data acquisition, many abundant proteins are analyzed in excess than what is necessary for a confident identification, leaving little mass spectrometry time for the analysis of lower abundance proteins. Increasing protein identification sensitivity is critical to provide a comprehensive understanding of the underlying biology of complex samples. Protein-protein interactions contain information that can improve protein identification rate in mass spectrometry; information that is not used by most current algorithms. We therefore propose a novel machine learning algorithm that assesses the confidence of protein identifications using mass spectrometry data features and confidence scores along with protein-protein interaction data. Our approach is based on the hypothesis that the confidence of the identification of a given protein P in a sample increases when proteins interacting with P are also observed in the same sample. Upon benchmarking against a state-of-the-art approach, our algorithm identifies more spectra, peptides and proteins at low false discovery rates. Also, to improve identification sensitivity of low abundance proteins, we designed a machine learning classifier that evaluates the reliability of protein identifications on the fly, as mass spectra are acquired. Proteins that are deemed confidently identified are excluded from further analysis in real-time, saving mass spectrometry resources for lower abundance proteins. We show in silico that our approach can identify a similar number of proteins using significantly less mass spectrometry time than a traditional proteomics analysis, thereby freeing resources for more protein identifications. Finally, our algorithms improve our ability to identify proteins in complex samples and will provide a more comprehensive understanding of the biological mechanisms of the cell.