Peak alignment with large dataset (over 2500 samples and growing)

Topic: Peak alignment with large dataset (over 2500 samples and growing) (Read 13018 times) previous topic - next topic

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #15 – February 07, 2019, 03:47:38 PM

Thanks all!!!

johannes.rainer, I tried your suggested code. It seems that my largest files are ~150MB. Using this number to make a conservative estimate, I would need about 12GB of RAM to hold 78 raw data files in memory, which is well below the 350+ GB on the server. But as you said, there are also other processes/objects that need RAM.

CoreyG mentioned that 4 threads keeps the RAM below 16 GB on a low spec desktop. So roughly 4 GB per core, which is much higher than the 150 MB core estimated above. But also, you mentioned that the fillChromPeaks is the most memory intensive process in your script, requiring a limit of 1 thread on a 32GB system. I've also noticed that fillChromPeaks is the step that takes the longest.

It does seem like I'll need to do some optimization on our system. Any advice on how to determine how stable the system is with a given number of cores? I don't mind starting with a large # of available cores and working down, as CoreyG suggests. However, for various reasons, I do not want to test stability by whether or not it crashes the server. Do you think that using the 'ulimit' approach to limit RAM would help to prevent the server from crashing? Can an I/O bottleneck cause a crash?

Perhaps I could work through the script step by step, optimizing the number of cores at each step before moving on to the next...

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #16 – March 07, 2019, 10:41:15 PM

Hello everyone! I have an update and some more questions. Thanks in advance for your help. You are all going into the Acknowledgements section of my dissertation!

--Update--
I successfully processed 2500+ injections with XCMS on the big server I mentioned before (Ubuntu 18.04, 350+ GB RAM, 80 cores). Total run time--from importing raw data to exporting a table of grouped features--was about 2 days. I set the number of cores available for parallel processing to 20.

Code: [Select]

register(bpstart(MulticoreParam(20)))

In the Linux command line, before starting the run, I limited the amount of available RAM to 300 GB in an attempt to prevent R from crashing the server.

Code: [Select]

ulimit -m -300000000

Interestingly, I noticed as the script progressed that more and more RAM was being "used" by R (according to htop). In fact, the amount approached 400 GB (i.e. over my 300 GB cap). This amount was over an order of magnitude greater that total object size in R's memory. Based on my research, it seems most likely that R was simply being "lazy" and not returning any RAM back to the OS. At any rate, this did not seem to affect the stability of the server.

---Questions---
I ran a PCA on my set of markers from 1948 injections (I re-ran XCMS on a subset of the aforementioned 2500+ injections). The biggest trend I'm seeing is a batch effect. See attached scores plot. Note that PC1 accounts for over 73% of total variability.

I inspected one ion from the same peak across all of my 200+ QC samples; the m/z range is about 40 ppm for almost all of the samples. The m/z values for a small minority of samples are way off, and this was due to some LockSpray issues. The vast majority are in the 40 ppm range. The retention time range for this peak is 0.06 min for all but the first batch.

My hunch is that the batch effect I'm seeing is largely due to the m/z drift. What do you think?

If so, how would I correct this? I imagine that the batch effect is due to a lack of correspondence caused by the m/z drift. Is there a way to increase the m/z tolerance range in the correspondence step? I'm seeing that there is a binSize param in the groupChromPeaks-density function: "numeric(1) defining the size of the overlapping slices in mz dimension". It's not obvious to me how to set this parameter. Is it in terms of Da, ppm, or what? How would I set it to allow for a 40ppm difference?

Here are the calls that I used for the correspondence. I did not define binSize, and I think the default is 0.25

Code: [Select]

pdp <- PeakDensityParam(sampleGroups = sample.groups, minFraction = (1/length(file.paths)), bw = 1)
xdata <- groupChromPeaks(xdata, param = pdp)
xdata <- fillChromPeaks(xdata, param = FillChromPeaksParam(fixedRt = 15, fixedMz = 0.1))

Is it even possible to correct for these inter-batch differences by using XCMS? Or would I need to employ other software? I've done some literature research on the topic (see below), although I haven't attempted any of these yet.

Rusilowicz, M., Dickinson, M., Charlton, A., O’Keefe, S., & Wilson, J. (2016). A batch correction method for liquid chromatography–mass spectrometry data that does not depend on quality control samples. Metabolomics, 12(3), 1–11. https://doi.org/10.1007/s11306-016-0972-2

Wehrens, R., Hageman, J. A., van Eeuwijk, F., Kooke, R., Flood, P. J., Wijnker, E., … de Vos, R. C. H. (2016). Improved batch correction in untargeted MS-based metabolomics. Metabolomics, 12(5). https://doi.org/10.1007/s11306-016-1015-8

Wang, S. Y., Kuo, C. H., & Tseng, Y. J. (2012). Batch Normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical chemistry, 85(2), 1037-1046.

Brunius, C., Shi, L., & Landberg, R. (2016). Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics, 12(11), 173.

[attachment deleted by admin]

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #17 – March 08, 2019, 06:22:00 AM

You can set `binSize` in `PeakDensityParam`. It is set in in Dalton and you cannot set it in ppm. The default is 0.25 Da so it is unlikely to help you. You should probably set it much lower.
With such a large dataset it is very likely to suffer from big intensity drift issues you'd need to fix. XCMS does not provide tools for that but the papers you list have some tools.

A few other observations:

setting `minFraction` to 1/n is extremely aggressive. I assume you get a lot of features. `minSamples`/`minFraction` are critical for getting rid of false positives in the peak picking
are you sure `bw` is right? That means you don't expect more than ~1/60=0.017 min retention time difference between all your samples. Setting that so low might also give you batch effects if you have retention time differences between batches.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #18 – March 08, 2019, 06:16:44 PM

Yes, I do get a lot of features (15,000+). It seems that the vast majority of the features have relatively high intensity and low variation. I've attached a plot of relative standard deviation (rsd) vs. mean for each feature. Could these be the false positives that you refer to?

I set bw=1 for groupChromPeaks based on the optimization procedure illustrated in Figure 12 (https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms.html). I was using a subset of ~20 samples from across all my batches; these samples represented a wide range of retention times. I set bw this low in order to resolve some features that are not baseline separated. I was assuming that the Obiwarp step, which is upstream of the peak grouping step, would have corrected for shifts in retention time. Is this not a safe assumption?

I think the next diagnostic step will be to look at the peak density distribution per sample for all of my batch QC samples (as in figure 8, https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms.html).

[attachment deleted by admin]

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #19 – March 08, 2019, 09:40:50 PM

I re-ran XCMS peak detection on just the 205 batch QC injections. I used the same peak detection parameters as were used for the set of 1948 samples (as seen in my previous PCA scores plot). I made boxplots for log2 of peak intensities for each of the QC samples. Out of curiosity, I also did box plots for the un-transformed intensities. Three plots are attached (log2; untransformed; untransformed and zoomed-in to show IQR's)

Each batch is represented by a separate color. There do appear to be some inter-batch differences in intensity. Could these differences be enough to be causing the differences that I'm seeing in the PCA?

[attachment deleted by admin]

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #20 – March 09, 2019, 03:56:59 AM

I did not know you had done Obiwarp. If it works well then it could be fine.

Looking at intensity distributions IMO won't tell you enough. As illustrated in the Brunius paper features behaves differently and some remain constant. I would look at a few compounds I know and see how they behave across your batches.

Looking at the loadings of your PCA might also give you a clue. You should be able to see if your batch difference is dominated by a few features or it is all features moving in one direction.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #21 – March 09, 2019, 01:55:19 PM

Here is the scores plot with loadings vectors. It looks like the overwhelming majority of features are moving in one direction. To me, this suggests a systematic drift in intensity between batches. What do you think?

Also, are there any R-based batch correction methods that you can recommend? It's taken me a while to get comfortable with R, and I'd prefer not have to learn any other languages.

[attachment deleted by admin]

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #22 – March 09, 2019, 02:06:28 PM

Yes the loadings do suggest that. But since you don't see that in your boxplots I think the pattern is too complex to appreciate there. So I think you need to look at some individual compounds. The corrections methods always work in each feature individually for the same reason.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #23 – March 09, 2019, 09:00:20 PM

I ran PCA on just the batch QC samples. I also plotted the intensity of several, randomly selected markers across all my QC samples. I've attached a few of the plots.

There does appear to be a pretty consistent increasing trend over the batches (with the exception of batch 20180728). This seems suspicious. I'd expect the intensities to decrease or increase over time depending on when I cleaned the sample cone.

The trend seen in the individual markers does seems consistent with the trend seen in PC1, though.

Also, I'm wondering what the lower intensity points are. Perhaps they were generated during the peak filling step. I'm not sure. If that's the case, maybe my bw=1 in PeakDensityParam is--as you suggested--too low?

These are not known compounds as you suggested. They they are just randomly selected markers. My reasoning was that all markers are contributing to the PCA. However, it does seem wise to quantify a few peaks using QuanLynx and see if the trend matches with what I'm seeing in the markers generated by XCMS.

Btw, here's the code I used to generate the plots of individual markers:

Code: [Select]

library(dplyr) # for sorting the dataframe

dfr <- arrange(dfr, batch) # sort the dataframe by batch

m <- dim(dfr)[2] # number of columns in the dataframe

rand <- function() {round(runif(1, 16, m), 0)} # function to generate a random column number for indexing; random integer between 16 (i.e. the first column of features) and the last column

plot.it <- function(){  # create a function to simplify plotting as below 

n <- rand()

plot(dfr[,n]
  , col=dfr$batch
  , main=c("marker #", (n-15), "batch QC samples only")
  , pch=20
  , ylab="peak area"
  , xlab="sample number"
) # end of plot call

legend("topleft"
  , levels(dfr$batch)
  , col=1:length(levels(dfr$batch))
  , cex=0.8
  , pch=20
  , title="Batch"
)# end of legend call
} # end function definition for "plot.it"

plot.it() # just call this to generate plot of a new random marker

[attachment deleted by admin]

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #24 – March 12, 2019, 06:44:25 AM

Glad to see things moving along, Metabolon1.

Regarding batch correction, take a look at a recent topic that came up on the forums: Feature intensity drift correction suggestions. In addition, check out the Metabolomics Society Wiki page on batch effects for some other packages: Batch Effect.
There are also a few RandomForest based methods floating around. I'll grab the name of the package tomorrow.

In regards to the latest plots you put up, they looked remarkably similar to plots I made when I "exported" my data with featureValues but didn't specify 'value = "into"'. If you don't specify this, featureValues just returns the row number of chromPeaks for that peak.

Lastly, I too had to set 'binSize' (in PeakDensityParam) quite a bit lower as I observed the feature m/z range was a bit too large given the ppm error that I would expect.

Cheers,
Corey

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #25 – March 13, 2019, 12:10:42 AM

Random Forest batch correction (Quality Control – Random Forest Signal Correction; QC-RFSC) is incorporated into statTarget and is available in R (https://bioconductor.org/packages/release/bioc/html/statTarget.html).

Alternatively, there was a recent paper by Fiehn's lab on Systematical Error Removal using Random Forest (SERRF). Some details of the method can be found: https://slfan2013.github.io/SERRF-online/. Unfortunately, they have set it up as an online portal, where you have to upload your data to their server. This could constitute a breach of ethics for data sharing/storage, so be you have permission to do so if you use it.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #26 – April 18, 2019, 05:04:54 PM

Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #27 – April 18, 2019, 06:10:16 PM

On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #28 – April 21, 2019, 08:43:10 PM

Quote from: metabolon1 – on April 18, 2019, 05:04:54 PM

Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.

Glad to hear I could be of help.

Quote from: metabolon1 – on April 18, 2019, 05:04:54 PM

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.

The differences between QCs and samples shouldn't be that big of a deal.
Depending on what batch correction method you use, you can assess the improvement in CV (RSD) of the QC features to determine how useful the batch correction was. Now, if the batch correction method optimized itself based on minimizing QC variation, then this approach is biased. Cross-validation should then be used to assess performance.

A simple visualization is to plot the pre-corrected CVs on the x-axis and the post-corrected CVs on the y-axis. Points that fall below the diagonal were improved; points on the diagonal weren't affected; points above the diagonal were negatively affected.
This may be an easy way to get a 'gut' feel for what method works best for you.

Quote from: metabolon1 – on April 18, 2019, 05:04:54 PM

-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?

This is a fairly common approach. Of course, you should always try to keep the sample:solvent ratio as consistent across all samples as possible. Remember that different sample:solvent ratios will cause variability in extraction efficiency, ionization and detector response.

If you are concerned about introducing associations into your data, consider using a linear model to remove the correction factor.
Get the residuals from lm(peakArea~correctionFactor). This allows the detector response to not be 1:1, but doesn't do much for non-linearity.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #29 – April 22, 2019, 04:27:03 AM

Quote from: metabolon1 – on April 18, 2019, 06:10:16 PM

On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.

Depending on what you are trying to do, you could get the median peak area from each of the triplicates. That will cut your sample rows in nearly a third.

Alternatively, you could filter the dataset to remove isotopes. Depending on the average number of carbons in your metabolites and signal/abundance, you might be able to reduce the dimensions 2-4 fold. The same can be done with removing adducts.
I've briefly played around with CAMERA for this, but ended up using mz.unity.

As a less useful suggestion, is it possible to use R for your data analysis?
For a lot of multivariate analysis, mixOmics does pretty well. The website has a lot of examples and the inbuilt plotting functions have come a long way.

Quote from: metabolon1 – on April 18, 2019, 06:10:16 PM

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.

My thoughts on this differ to many in the 'untargeted' scene. I'm really only interested in features that are present in nearly all samples (<10% missing values). So, I always ask if people expect to see features entirely missing from certain samples/groups.

The nice thing about XCMS is that you can set these parameters fairly loose early in the workflow. Then after fillChromPeaks, you can be more stringent.
With so many samples, I would imagine that seeing the same feature in multiple groups is an almost certainty. So maybe put every sample in 1 group, but set minFraction=10/600 (or something of that sort).

I'd love to hear other peoples thoughts on this, as well.

Quote from: metabolon1 – on April 18, 2019, 06:10:16 PM

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!

This is certainly a way you could go.
Perhaps there is a way to empirically determine a good CV cutoff?
If CV is mostly related to biological grouping, then you could determine the difference in CV when all injections are used compared to when you have averaged the triplicates. Determine the threshold CV by permuting the biological grouping and repeating the process (you will end up averaging non-triplicates randomly). Whatever the 95th percentile is, that is your critical value.