Skip to main content
Topic: Peak alignment with large dataset (over 2500 samples and growing) (Read 982 times) previous topic - next topic

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #15
Thanks all!!!

johannes.rainer, I tried your suggested code. It seems that my largest files are ~150MB. Using this number to make a conservative estimate, I would need about 12GB of RAM to hold 78 raw data files in memory, which is well below the 350+ GB on the server. But as you said, there are also other processes/objects that need RAM.

CoreyG mentioned that 4 threads keeps the RAM below 16 GB on a low spec desktop. So roughly 4 GB per core, which is much higher than the 150 MB core estimated above. But also, you mentioned that the fillChromPeaks is the most memory intensive process in your script, requiring a limit of 1 thread on a 32GB system. I've also noticed that fillChromPeaks is the step that takes the longest.

It does seem like I'll need to do some optimization on our system. Any advice on how to determine how stable the system is with a given number of cores? I don't mind starting with a large # of available cores and working down, as CoreyG suggests. However, for various reasons, I do not want to test stability by whether or not it crashes the server. Do you think that using the 'ulimit' approach to limit RAM would help to prevent the server from crashing? Can an I/O bottleneck cause a crash?

Perhaps I could work through the script step by step, optimizing the number of cores at each step before moving on to the next...

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #16
Hello everyone! I have an update and some more questions. Thanks in advance for your help. You are all going into the Acknowledgements section of my dissertation!

I successfully processed 2500+ injections with XCMS on the big server I mentioned before (Ubuntu 18.04, 350+ GB RAM, 80 cores). Total run time--from importing raw data to exporting a table of grouped features--was about 2 days. I set the number of cores available for parallel processing to 20.
Code: [Select]

In the Linux command line, before starting the run, I limited the amount of available RAM to 300 GB in an attempt to prevent R from crashing the server.
Code: [Select]
ulimit -m -300000000

Interestingly, I noticed as the script progressed that more and more RAM was being "used" by R (according to htop). In fact, the amount approached 400 GB (i.e. over my 300 GB cap). This amount was over an order of magnitude greater that total object size in R's memory. Based on my research, it seems most likely that R was simply being "lazy" and not returning any RAM back to the OS. At any rate, this did not seem to affect the stability of the server.

I ran a PCA on my set of markers from 1948 injections (I re-ran XCMS on a subset of the aforementioned 2500+ injections). The biggest trend I'm seeing is a batch effect. See attached scores plot. Note that PC1 accounts for over 73% of total variability.

I inspected one ion from the same peak across all of my 200+ QC samples; the m/z range is about 40 ppm for almost all of the samples. The m/z values for a small minority of samples are way off, and this was due to some LockSpray issues. The vast majority are in the 40 ppm range. The retention time range for this peak is 0.06 min for all but the first batch.

My hunch is that the batch effect I'm seeing is largely due to the m/z drift. What do you think?

If so, how would I correct this? I imagine that the batch effect is due to a lack of correspondence caused by the m/z drift. Is there a way to increase the m/z tolerance range in the correspondence step? I'm seeing that there is a binSize param in the groupChromPeaks-density function: "numeric(1) defining the size of the overlapping slices in mz dimension". It's not obvious to me how to set this parameter. Is it in terms of Da, ppm, or what? How would I set it to allow for a 40ppm difference?

Here are the calls that I used for the correspondence. I did not define binSize, and I think the default is 0.25
Code: [Select]
pdp <- PeakDensityParam(sampleGroups = sample.groups, minFraction = (1/length(file.paths)), bw = 1)
xdata <- groupChromPeaks(xdata, param = pdp)
xdata <- fillChromPeaks(xdata, param = FillChromPeaksParam(fixedRt = 15, fixedMz = 0.1))

Is it even possible to correct for these inter-batch differences by using XCMS? Or would I need to employ other software? I've done some literature research on the topic (see below), although I haven't attempted any of these yet.

Rusilowicz, M., Dickinson, M., Charlton, A., O’Keefe, S., & Wilson, J. (2016). A batch correction method for liquid chromatography–mass spectrometry data that does not depend on quality control samples. Metabolomics, 12(3), 1–11.

Wehrens, R., Hageman, J. A., van Eeuwijk, F., Kooke, R., Flood, P. J., Wijnker, E., … de Vos, R. C. H. (2016). Improved batch correction in untargeted MS-based metabolomics. Metabolomics, 12(5).

Wang, S. Y., Kuo, C. H., & Tseng, Y. J. (2012). Batch Normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical chemistry, 85(2), 1037-1046.

Brunius, C., Shi, L., & Landberg, R. (2016). Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics, 12(11), 173.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #17
You can set `binSize` in `PeakDensityParam`. It is set in in Dalton and you cannot set it in ppm. The default is 0.25 Da so it is unlikely to help you. You should probably set it much lower.
With such a large dataset it is very likely to suffer from big intensity drift issues you'd need to fix. XCMS does not provide tools for that but the papers you list have some tools.

A few other observations:
  • setting `minFraction` to 1/n is extremely aggressive. I assume you get a lot of features. `minSamples`/`minFraction` are critical for getting rid of false positives in the peak picking
  • are you sure `bw` is right? That means you don't expect more than ~1/60=0.017 min retention time difference between all your samples. Setting that so low might also give you batch effects if you have retention time differences between batches.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #18
Yes, I do get a lot of features (15,000+). It seems that the vast majority of the features have relatively high intensity and low variation. I've attached a plot of relative standard deviation (rsd) vs. mean for each feature. Could these be the false positives that you refer to?

I set bw=1 for groupChromPeaks based on the optimization procedure illustrated in Figure 12 ( I was using a subset of ~20 samples from across all my batches; these samples represented a wide range of retention times. I set bw this low in order to resolve some features that are not baseline separated. I was assuming that the Obiwarp step, which is upstream of the peak grouping step, would have corrected for shifts in retention time. Is this not a safe assumption?

I think the next diagnostic step will be to look at the peak density distribution per sample for all of my batch QC samples (as in figure 8,

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #19
I re-ran XCMS peak detection on just the 205 batch QC injections. I used the same peak detection parameters as were used for the set of 1948 samples (as seen in my previous PCA scores plot). I made boxplots for log2 of peak intensities for each of the QC samples. Out of curiosity, I also did box plots for the un-transformed intensities. Three plots are attached (log2; untransformed; untransformed and zoomed-in to show IQR's)

Each batch is represented by a separate color. There do appear to be some inter-batch differences in intensity. Could these differences be enough to be causing the differences that I'm seeing in the PCA?

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #20
I did not know you had done Obiwarp. If it works well then it could be fine.

Looking at intensity distributions IMO won't tell you enough. As illustrated in the Brunius paper features behaves differently and some remain constant. I would look at a few compounds I know and see how they behave across your batches.

Looking at the loadings of your PCA might also give you a clue. You should be able to see if your batch difference is dominated by a few features or it is all features moving in one direction.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #21
Here is the scores plot with loadings vectors. It looks like the overwhelming majority of features are moving in one direction. To me, this suggests a systematic drift in intensity between batches. What do you think?

Also, are there any R-based batch correction methods that you can recommend? It's taken me a while to get comfortable with R, and I'd prefer not have to learn any other languages.


Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #22
Yes the loadings do suggest that. But since you don't see that in your boxplots I think the pattern is too complex to appreciate there. So I think you need to look at some individual compounds. The corrections methods always work in each feature individually for the same reason.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #23
I ran PCA on just the batch QC samples. I also plotted the intensity of several, randomly selected markers across all my QC samples. I've attached a few of the plots.

There does appear to be a pretty consistent increasing trend over the batches (with the exception of batch 20180728). This seems suspicious. I'd expect the intensities to decrease or increase over time depending on when I cleaned the sample cone.

The trend seen in the individual markers does seems consistent with the trend seen in PC1, though.

Also, I'm wondering what the lower intensity points are. Perhaps they were generated during the peak filling step. I'm not sure. If that's the case, maybe my bw=1 in PeakDensityParam is--as you suggested--too low?

These are not known compounds as you suggested. They they are just randomly selected markers. My reasoning was that all markers are contributing to the PCA. However, it does seem wise to quantify a few peaks using QuanLynx and see if the trend matches with what I'm seeing in the markers generated by XCMS.

Btw, here's the code I used to generate the plots of individual markers:
Code: [Select]
library(dplyr) # for sorting the dataframe

dfr <- arrange(dfr, batch) # sort the dataframe by batch

m <- dim(dfr)[2] # number of columns in the dataframe

rand <- function() {round(runif(1, 16, m), 0)} # function to generate a random column number for indexing; random integer between 16 (i.e. the first column of features) and the last column <- function(){  # create a function to simplify plotting as below

n <- rand()

  , col=dfr$batch
  , main=c("marker #", (n-15), "batch QC samples only")
  , pch=20
  , ylab="peak area"
  , xlab="sample number"
) # end of plot call

  , levels(dfr$batch)
  , col=1:length(levels(dfr$batch))
  , cex=0.8
  , pch=20
  , title="Batch"
)# end of legend call
} # end function definition for "" # just call this to generate plot of a new random marker

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #24
Glad to see things moving along, Metabolon1.

Regarding batch correction, take a look at a recent topic that came up on the forums: Feature intensity drift correction suggestions. In addition, check out the Metabolomics Society Wiki page on batch effects for some other packages: Batch Effect.
There are also a few RandomForest based methods floating around. I'll grab the name of the package tomorrow.

In regards to the latest plots you put up, they looked remarkably similar to plots I made when I "exported" my data with featureValues but didn't specify 'value = "into"'. If you don't specify this, featureValues just returns the row number of chromPeaks for that peak.

Lastly, I too had to set 'binSize' (in PeakDensityParam) quite a bit lower as I observed the feature m/z range was a bit too large given the ppm error that I would expect.


Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #25
Random Forest batch correction (Quality Control – Random Forest Signal Correction; QC-RFSC) is incorporated into statTarget and is available in R (

Alternatively, there was a recent paper by Fiehn's lab on Systematical Error Removal using Random Forest (SERRF). Some details of the method can be found: Unfortunately, they have set it up as an online portal, where you have to upload your data to their server. This could constitute a breach of ethics for data sharing/storage, so be you have permission to do so if you use it.