It is always a good idea to randomize samples across batches. There are special cases where you might want to perform block randomization, such as what Jan described. I guess you could say that the goal is to make systematic/technical variation orthogonal to biological variation.
Given that, how different are the distributions? Say there are 3 conditions, where all samples were randomly assigned to. It would be fine to have one batch with 30/40/50 samples from each treatment (assuming they are run in a randomized order). However, it wouldn't be good to have a distribution of something like 50/0/60. Even worst would be 10/0/100.
For pooled QCs. It's a good idea to create the pooled samples and put them with the biological samples as soon as possible. This ensures they will capture as much technical variation as possible i.e. all the variation from sample storage, defrosting, extraction, running etc. This is especially important when batches are processed separately (extracted separately or stored in different freezers).
## Open the mzML file that you want to convert dat<-openMSfile('myMSdata.mzML')
## Get spectra data pks <- spectra(dat) ## Get file header hdr <- header(dat)
## Remove all scans but ms2 scans pks<-pks[hdr$msLevel==2] hdr<-hdr[hdr$msLevel==2,]
## Take a quick look and make sure everything looks ok head(hdr)
## Provide new scan/acquisition numbers hdr$seqNum<-hdr$acquisitionNum<-seq(nrow(hdr))
## Clear all precursor charge/Intensity/MZ/ScanNum columns ## Not sure if this is required and how precursor ion scans will differ hdr$precursorCharge<-hdr$precursorIntensity<-hdr$precursorMZ<-hdr$precursorScanNum<-0
## Overwrite msLevel to 'pretend' to be MS1 data hdr$msLevel<-1
## Write out the new 'MS1' data writeMSData(pks,'myAlteredMSdata.mzML',header=hdr)
There are a few fields that still contain MS2 information, so I'm not sure if they will conflict with anything downstream. So, I'm not sure how well this will work for you. But let everyone know if it's helpful.
AmelieV, if you only have one product ion per file, it's probably sufficient to "pretend" the MS2 data are MS1 scans. Pretty sure I have a solution for that somewhere and some precursor ion scan data to test it on. I'll see if I can dig it up.
However, if you have already found the precursor masses, it's probably quickest to setup a Masshunter Quant or Qual method to integrate all the peaks (assuming this is what you are interested in?)
There are probably more detailed descriptions/tutorials around, including on this forum. However, I've found that XCMS is fairly robust to most (reasonable) settings in MSconvert. I'd say the most important thing is to have 'peak picking' as a filter and 'MS levels' with a 1 in the first position (indicating you want MS1 centroided). Both of the links above show this. The difference in settings between the links shouldn't affect XCMS function (mzXML vs mzML, 32 vs 64 bit, zlib compression).
I've been using different packages for annotation of isotopes, adducts, dimers and fragments. There are, of course, many ways that this can be done. From using just exact mass; examination of co-elution peak shape; and correlations of peak areas.
I'd be interested in looking at how your network algorithm can improve our annotations.
Look forward to some interesting discussions in the future!
I don't have a lot of HRMS experience, but I've gradually been doing more. Our focus is high-throughput, so robustness is something we care about as well. As a general rule, the top spec instruments tend to be less robust than the lower tier models, although there are always exceptions.
So far I've used an Agilent qTOF (6540) and two Thermo Orbitraps (Fusion Lumos and HF-X). All of these were setup to do analytical flow LC-MS. The Agilent system had quite a few boards replaced over a few years, but I can't say that the instrument was looked after particularly well. The Fusion Lumos ran well for over 1000 continuous samples (plasma). But other than that, I am not sure how robust the machine is outside of that. The HF-X had a bad reputation for 'dirtying', but Thermo says they have fixed this in the current generation. If you are willing to wait a few months I can fill you in on how it goes
For the work that I was doing (lipidomics), the much higher resolution given by the orbitrap was very useful. Overall sensitivity was higher with the orbitraps as well - but we are comparing 2 very new instruments to a much older one. However, both the orbitraps have very rough ion funnels. This causes in-source fragmentation of fragile molecules. This isn't exclusive to Thermo instruments. Most vendors are aware of this, so you should bring it up with them if this is a concern.
Regarding quantitation, there are different levels of quantitation that people expect. Most metabolomics/lipidomics people have a relaxed view on quantitation. That is, there are caveats and assumptions that everyone excepts can't be resolved (not enough internal standards, shotgun vs LC...) Most newer instruments provide a fairly decent dynamic range. Certainly much more than the expected variability you would see in a single metabolite in a group of people. So I guess it's important to ensure that sample prep is correct to put the right amount of 'stuff' in the machine.
I'm sure others can chime in with some more experiences and knowledge.
mass <- seq(peaks[1, "mz"], peaks[nrow(peaks), "mz"] + binSize, by = binSize / 2)
Then, it finds which peaks fall in each window and performs density based grouping on them (which is based on 'bw').
So, if you know that your instrument has very high accuracy/precision, you can reduce binsize to a smaller number. This should ensure peaks with relatively different masses aren't grouped together. Thereafter, you can reduce 'bw' to fine tune based on RT.
You could get a rough visualization of the mass groupings with some basic plotting:
## Get peaks that you want to find in a feature specificPeaks<-chromPeaks(xdata,rt=rtr_1,mz=mzr)
## Plot the peak locations plot(specificPeaks[,"rt"],specificPeaks[,"mz"])
## Create approximate grouping windows (not the values that 'groupChromPeaks' will generate) mass<-seq(min(specificPeaks[, "mz"]), max(specificPeaks[, "mz"]) + binSize, by = binSize / 2) abline(h=mass,col=rep(c('red','green'))) ## Note these are overlapping windows. So look at groupings between the coloured lines
Both highlightChromPeaks and plotChromPeakDensity use this to find the peaks inside the ranges you specify.
What you should know, however, is that plotChromPeakDensity doesn't plot 'features'. It plots the density of the peaks found (see above) using the parameters you gave it (gparam). So, it is possible that 'groupChromPeaks' is including additional nearby (mz or rt) peaks, so that the median rt and mz are outside the range you specify in mzr and rtr_1.
So I would start by increasing the mzr range and trying to look again.
Could you provide a link to some of the forum discussions that you mention. I haven't seen them and would be interested in reading about what has been done before.
There are potentially a few places that you can try to limit these side/partial peaks. You seem to have tried a few approaches with some success. I would suggest trying to alter the grouping algorithm settings. Increasing bw will eventually cause these to merge into the larger features, but this may have unwanted effects on other features. So I would look at changing the minfrac and minsamp settings. As you are using fillPeaks, it will "find" the missing peaks in the rest of the samples. Looking at the images you posted, it doesn't look like XCMS will find the same side/partial peaks in many samples.
If that doesn't work, perhaps we can start to think of some post-processing solutions to handle it.
# Extract out spectra data spec<-spectra(raw_data)
# Iterate through the spectra list and obtain relevant information specl<-lapply(spec,function(x) c(x@msLevel,x@rt,x@scanIndex))
# Compact list to matrix specl<-do.call(rbind,specl)
This will create a matrix that contains the msLevel, retention time and scan number for every scan in the data. It shouldn't take long to cross reference a couple peaks to confirm whether 'scmin/scmax' need to be adjusted. If so, you have just generated a matrix that can be used to calculate the new scan counts!
Regarding lmin and lmax: I am not too sure about these. They can be found in the CentWave code, where they are found to be the minimum 'continuous wavelet transform' (CWT) coefficients on either side of a peak. Interestingly, it looks like rtmin/rtmax are defined using the same values. I could be way off, but I think scmin/scmax are estimates of the peak bounds based on the CWT.
As it looks like 'into' is calculated using the bounds defined in lmin/lmax, perhaps these are the better columns for you to use. But do note that these values are not always referenced from scan 1.
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.
Depending on what you are trying to do, you could get the median peak area from each of the triplicates. That will cut your sample rows in nearly a third.
Alternatively, you could filter the dataset to remove isotopes. Depending on the average number of carbons in your metabolites and signal/abundance, you might be able to reduce the dimensions 2-4 fold. The same can be done with removing adducts. I've briefly played around with CAMERA for this, but ended up using mz.unity.
As a less useful suggestion, is it possible to use R for your data analysis? For a lot of multivariate analysis, mixOmics does pretty well. The website has a lot of examples and the inbuilt plotting functions have come a long way.
One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.
My thoughts on this differ to many in the 'untargeted' scene. I'm really only interested in features that are present in nearly all samples (<10% missing values). So, I always ask if people expect to see features entirely missing from certain samples/groups.
The nice thing about XCMS is that you can set these parameters fairly loose early in the workflow. Then after fillChromPeaks, you can be more stringent. With so many samples, I would imagine that seeing the same feature in multiple groups is an almost certainty. So maybe put every sample in 1 group, but set minFraction=10/600 (or something of that sort).
I'd love to hear other peoples thoughts on this, as well.
Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?
This is certainly a way you could go. Perhaps there is a way to empirically determine a good CV cutoff? If CV is mostly related to biological grouping, then you could determine the difference in CV when all injections are used compared to when you have averaged the triplicates. Determine the threshold CV by permuting the biological grouping and repeating the process (you will end up averaging non-triplicates randomly). Whatever the 95th percentile is, that is your critical value.
I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.
Characteristics: -- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples -- 13 batches -- All extracts are from the same species -- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
The differences between QCs and samples shouldn't be that big of a deal. Depending on what batch correction method you use, you can assess the improvement in CV (RSD) of the QC features to determine how useful the batch correction was. Now, if the batch correction method optimized itself based on minimizing QC variation, then this approach is biased. Cross-validation should then be used to assess performance.
A simple visualization is to plot the pre-corrected CVs on the x-axis and the post-corrected CVs on the y-axis. Points that fall below the diagonal were improved; points on the diagonal weren't affected; points above the diagonal were negatively affected. This may be an easy way to get a 'gut' feel for what method works best for you.
-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor. -- This is a pilot dataset and was not intended for publication.
My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?
This is a fairly common approach. Of course, you should always try to keep the sample:solvent ratio as consistent across all samples as possible. Remember that different sample:solvent ratios will cause variability in extraction efficiency, ionization and detector response.
If you are concerned about introducing associations into your data, consider using a linear model to remove the correction factor. Get the residuals from lm(peakArea~correctionFactor). This allows the detector response to not be 1:1, but doesn't do much for non-linearity.
Yes, I do believe you are correct with your interpretation of the documentation. It looks like the prefilter check is done here: do_findChromPeaks-functions.R. The vapply iterates over each ROI index list and counts the number of scans with intensity above the prefilter threshold (I). If this is above 'k', the ROI is kept.
Also, 'scans' would probably be a better word to use