Hi all, for the first time, I've started digging into a few untargeted LC-MS datasets produced by a core facility and had a few questions about dealing with some quirks of the data available to me when running the MS-Dial > MS-CleanR > MS-Finder pipeline.
Skipping MS-Dial and using a feature/peak table for MS-CleanR
The core put some work in already to do some clean up:
They loaded into MS-Dial initially to make an Excel of features, and then (manually apparently) normalizaed based on the sum of their custom internal standards mix, blank extraction, duplicate and isotope removal, MSMS matching, and some basic annotations. Leaving me with a spreadsheets of about 5500 - 7000 features (combined between +/- ESI) per dataset.
I’d like to do additional adduct/noise removal using MS-CleanR, but it seems like that means I need to step back to the raw data and start at MS-Dial to get the right kind of files and info. Or is there a shortcut possible with a csv input, since the core did lots of the work that happens in MS-Dial? I’ve tested Binner for this (https://binner.med.umich.edu/), but wanted to see if MS-CleanR/Finder may yield a more robust final dataset.
If I need to go back to MS-Dial, I have one lipid and two metabolite (amines and polyphenols) datasets that might have issues for easy use in this pipeline.
Partial MSMS coverage
For the lipidomics and amines (QTOF), I only have MSMS data from pool injections meant to be used as a kind of reference for MS2 data that can be associated to matching MS1 data from other samples (blanks and experimental samples). Furthermore, the two lipid MSMS pools are split into “lower” and “upper” files for the lipids.
I know CleanR requires MSMS or kicks out a feature, so I’m not sure if I can correlate the pooled MS2 to their putative MS1 in the individual samples. How best to handle this? I suppose I could run the analysis on the MSMS pools only, and then try to manually match MS1 values to attach peaks values from the samples of interest with R. Although on a test run, it looks like the program is possibly smart enough to do this? The output file 'MS_peak-final_selection.csv' has about 600 features (+ ESI only) now with one or the other MSMS pool as the spectrum reference, but some kind of peak values for all my sample types.
Here is the lipids sample list for clarity:
4x Blanks_core (from the core facility, spaced over the injection series)
3x Blanks_experiment (we made these as mock-collection tubes, injected in random order, as experimental samples)
4x QC_pool (made at the facility, spaced over the injection series)
2x QC_pool_MSMS_lower (injected at the end of the experiment samples)
(only major difference in amines data is that they have 4x QC_pool_MSMS_all)
Missing QC pools
For the polyphenols (QE), I believe I have MSMS for every sample, but have no QC_pools at all. Is there a way to make that work in this pipeline? (they simply “failed upon acquisition” apparently)
Internal standard mixes
For all datasets, each sample was given a mix of internal standards (not SPLASH for the lipids), is there a way to normalize by a group of internal standards? In the core’s manual process they use average peak heights of the standards (rather than a specified input concentration or matching compound classes).
Ok, that was a lot, but thank you!
After looking deeper, I think at least the MSMS coming from a subset of samples is not a problem once everything is aligned (but do correct me if wrong). (Although since this was DDA, I only get to use about 1/4th of my aligned peaks in CleanR then)
I think the main thing for me right now is properly normalizing by a custom mix of internal standards, even if I have to do that in R, I'm not sure how to go from a text file to something MS-Dial can look at and turn into a .mat for MS-CleanR.