I have been using the 'IPO' package to determine my XCMS parameters. The resulting peak table contains over 50,000 features. I'm ok with this for now. Has anyone else used IPO? Thoughts?
I successfully ran CAMERA (thanks CoreyG for the suggestions) with the goal of finding redundant features (eg. adducts/isotopes). It identified~7,000 pc groups in the ~50,000 features. Now, if I am understanding correctly, I need to select one feature from each pc group to serve as the "representative" for that compound. Any recommendations on how to make this selection?
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.
One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.
Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?
I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.
I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.
Characteristics: -- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples -- 13 batches -- All extracts are from the same species -- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds. -- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor. -- This is a pilot dataset and was not intended for publication.
My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?
I ran PCA on just the batch QC samples. I also plotted the intensity of several, randomly selected markers across all my QC samples. I've attached a few of the plots.
There does appear to be a pretty consistent increasing trend over the batches (with the exception of batch 20180728). This seems suspicious. I'd expect the intensities to decrease or increase over time depending on when I cleaned the sample cone.
The trend seen in the individual markers does seems consistent with the trend seen in PC1, though.
Also, I'm wondering what the lower intensity points are. Perhaps they were generated during the peak filling step. I'm not sure. If that's the case, maybe my bw=1 in PeakDensityParam is--as you suggested--too low?
These are not known compounds as you suggested. They they are just randomly selected markers. My reasoning was that all markers are contributing to the PCA. However, it does seem wise to quantify a few peaks using QuanLynx and see if the trend matches with what I'm seeing in the markers generated by XCMS.
Btw, here's the code I used to generate the plots of individual markers:
Here is the scores plot with loadings vectors. It looks like the overwhelming majority of features are moving in one direction. To me, this suggests a systematic drift in intensity between batches. What do you think?
Also, are there any R-based batch correction methods that you can recommend? It's taken me a while to get comfortable with R, and I'd prefer not have to learn any other languages.
I re-ran XCMS peak detection on just the 205 batch QC injections. I used the same peak detection parameters as were used for the set of 1948 samples (as seen in my previous PCA scores plot). I made boxplots for log2 of peak intensities for each of the QC samples. Out of curiosity, I also did box plots for the un-transformed intensities. Three plots are attached (log2; untransformed; untransformed and zoomed-in to show IQR's)
Each batch is represented by a separate color. There do appear to be some inter-batch differences in intensity. Could these differences be enough to be causing the differences that I'm seeing in the PCA?
Yes, I do get a lot of features (15,000+). It seems that the vast majority of the features have relatively high intensity and low variation. I've attached a plot of relative standard deviation (rsd) vs. mean for each feature. Could these be the false positives that you refer to?
I set bw=1 for groupChromPeaks based on the optimization procedure illustrated in Figure 12 (https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms.html). I was using a subset of ~20 samples from across all my batches; these samples represented a wide range of retention times. I set bw this low in order to resolve some features that are not baseline separated. I was assuming that the Obiwarp step, which is upstream of the peak grouping step, would have corrected for shifts in retention time. Is this not a safe assumption?
Hello everyone! I have an update and some more questions. Thanks in advance for your help. You are all going into the Acknowledgements section of my dissertation!
--Update-- I successfully processed 2500+ injections with XCMS on the big server I mentioned before (Ubuntu 18.04, 350+ GB RAM, 80 cores). Total run time--from importing raw data to exporting a table of grouped features--was about 2 days. I set the number of cores available for parallel processing to 20.
Interestingly, I noticed as the script progressed that more and more RAM was being "used" by R (according to htop). In fact, the amount approached 400 GB (i.e. over my 300 GB cap). This amount was over an order of magnitude greater that total object size in R's memory. Based on my research, it seems most likely that R was simply being "lazy" and not returning any RAM back to the OS. At any rate, this did not seem to affect the stability of the server.
---Questions--- I ran a PCA on my set of markers from 1948 injections (I re-ran XCMS on a subset of the aforementioned 2500+ injections). The biggest trend I'm seeing is a batch effect. See attached scores plot. Note that PC1 accounts for over 73% of total variability.
I inspected one ion from the same peak across all of my 200+ QC samples; the m/z range is about 40 ppm for almost all of the samples. The m/z values for a small minority of samples are way off, and this was due to some LockSpray issues. The vast majority are in the 40 ppm range. The retention time range for this peak is 0.06 min for all but the first batch.
My hunch is that the batch effect I'm seeing is largely due to the m/z drift. What do you think?
If so, how would I correct this? I imagine that the batch effect is due to a lack of correspondence caused by the m/z drift. Is there a way to increase the m/z tolerance range in the correspondence step? I'm seeing that there is a binSize param in the groupChromPeaks-density function: "numeric(1) defining the size of the overlapping slices in mz dimension". It's not obvious to me how to set this parameter. Is it in terms of Da, ppm, or what? How would I set it to allow for a 40ppm difference?
Here are the calls that I used for the correspondence. I did not define binSize, and I think the default is 0.25
Is it even possible to correct for these inter-batch differences by using XCMS? Or would I need to employ other software? I've done some literature research on the topic (see below), although I haven't attempted any of these yet.
Rusilowicz, M., Dickinson, M., Charlton, A., O’Keefe, S., & Wilson, J. (2016). A batch correction method for liquid chromatography–mass spectrometry data that does not depend on quality control samples. Metabolomics, 12(3), 1–11. https://doi.org/10.1007/s11306-016-0972-2
Wehrens, R., Hageman, J. A., van Eeuwijk, F., Kooke, R., Flood, P. J., Wijnker, E., … de Vos, R. C. H. (2016). Improved batch correction in untargeted MS-based metabolomics. Metabolomics, 12(5). https://doi.org/10.1007/s11306-016-1015-8
Wang, S. Y., Kuo, C. H., & Tseng, Y. J. (2012). Batch Normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical chemistry, 85(2), 1037-1046.
Brunius, C., Shi, L., & Landberg, R. (2016). Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics, 12(11), 173.
johannes.rainer, I tried your suggested code. It seems that my largest files are ~150MB. Using this number to make a conservative estimate, I would need about 12GB of RAM to hold 78 raw data files in memory, which is well below the 350+ GB on the server. But as you said, there are also other processes/objects that need RAM.
CoreyG mentioned that 4 threads keeps the RAM below 16 GB on a low spec desktop. So roughly 4 GB per core, which is much higher than the 150 MB core estimated above. But also, you mentioned that the fillChromPeaks is the most memory intensive process in your script, requiring a limit of 1 thread on a 32GB system. I've also noticed that fillChromPeaks is the step that takes the longest.
It does seem like I'll need to do some optimization on our system. Any advice on how to determine how stable the system is with a given number of cores? I don't mind starting with a large # of available cores and working down, as CoreyG suggests. However, for various reasons, I do not want to test stability by whether or not it crashes the server. Do you think that using the 'ulimit' approach to limit RAM would help to prevent the server from crashing? Can an I/O bottleneck cause a crash?
Perhaps I could work through the script step by step, optimizing the number of cores at each step before moving on to the next...
Thanks CoreyG. I am proposing to limit RAM because of my previous experience with MZmine 2; it crashed the server multiple times when it was given access to all the RAM. Would this not be an issue with R/xcms?
As to the issue of I/O bottlenecking, some more questions.
Jan mentioned NOT to use all the cores for reading from raw data files. Are there any steps in the data processing for which having 78 cores available would be a good idea? Would there be any advantage to using all cores after the onDisk object has been created? I'm envisioning a script where I change the number of available cores at different steps in the data processing...
If this is not a good option, then can anyone suggest guidelines for determining the number of cores to use?
Dankeschön, mange tak, 谢谢, and thank you! (hopefully I didn't miss anyone)
Looking through the LC_MS Preprocessing vignette for XCMS, it seems like this simple call could do the trick...
If I understand correctly, this would set the number of available cores to 78. Would it be better to use a smaller number of cores? johannes.rainer had cautioned using too many cores. Would setting the limit on RAM (as mentioned in previous post) prevent the "serious troubles" that johannes.rainer mentioned?
I'm back on this project now. I'm getting ready to test out a script on the big server mentioned earlier (Ubuntu 18.04, 80 cores, 350+ GB RAM). Can anyone offer tips on how to set parameters for parallel processing? I just downloaded the BiocParallel package, and I'm trying to make sense of it all.
In an effort to limit RAM to 300GB, I am planning to use the command below. Any thoughts on this approach? ulimit -m 300000000
Before, when I was running MZmine, the server kept crashing when the heap size was set to auto. When I lowered the heap size to 300GB, it stopped crashing. Now, with XCMS, I'm planning to use the ulimit command instead.
I am trying to process over 2500 files from a UPLC-QTOF-MS dataset. The goal is to eventually increase this number to 10,000 and beyond. Currently I am using MZmine 2. I am fortunate to have access to a big server (80 core, 350+ GB RAM). However, it seems that the peak alignment step is not optimized for this number of samples. See my other post for more details about this issue: https://github.com/mzmine/mzmine2/issues/518
Any ideas on a more efficient peak alignment method? As far as I can tell, the raw data are already pretty well aligned; the UPLC seems to be fairly consistent. My main objective right now is to get all the samples/peaks into a single matrix.
I am actively trying different approaches, but it all takes time. I am hoping that someone else who has trod this ground before can offer advice to help save time and effort.