Metabolomics Society Forum

Software => Other => Topic started by: metabolon1 on December 08, 2018, 10:07:34 PM

Title: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on December 08, 2018, 10:07:34 PM
Dear all,

I am trying to process over 2500 files from a UPLC-QTOF-MS dataset. The goal is to eventually increase this number to 10,000 and beyond. Currently I am using MZmine 2. I am fortunate to have access to a big server (80 core, 350+ GB RAM). However, it seems that the peak alignment step is not optimized for this number of samples. See my other post for more details about this issue: https://github.com/mzmine/mzmine2/issues/518

Any ideas on a more efficient peak alignment method? As far as I can tell, the raw data are already pretty well aligned; the UPLC seems to be fairly consistent. My main objective right now is to get all the samples/peaks into a single matrix.

I am actively trying different approaches, but it all takes time. I am hoping that someone else who has trod this ground before can offer advice to help save time and effort.

Many thanks!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Jan Stanstrup on December 09, 2018, 05:32:36 AM
MZMine is know to use a lot of memory. I imagine that is where your bottleneck is. But you should check that.

XCMS is much more memory efficient. Be aware that each core will use a certain amount of memory. So on a system like yours not using all cores will use less memory and might save you if memory is your bottleneck. Also don't use 80 cores on processes that are bottlenecked by HDD reads (like reading the raw data).

That said, with 10,000 samples you really need to be careful about how greedy you need to be in terms of how low in intensity you want to pick.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Guillermo Quintas on December 09, 2018, 02:29:04 PM
Hi, as a workaround I usually split big data sets into subsets (~250 runs) to process them independently using XCMS. Then I use a linear or non-linear (rsc, svr,..) fitting of the shift in the retention time using 'known' metabolites to match variables across peak tables. As said, it's just a workaround but you can process each subset in parallel and reduce (a lot) the computing time and memory needed.
g
ps. if the raw data are already pretty well aligned, peak tables can be aligned using m/z & RT tolerances and a 'master-slave' approach in matlab/R/python/etc
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: johannes.rainer on December 12, 2018, 06:27:14 AM
Just chiming in with some explanations how xcms works now with large projects. We use now an object from the MSnbase package to represent the raw data, that only reads the raw (spectrum) data if required. That way the memory use is minimized. Peak detection is then performed on a per-file basis, i.e. reading the full data from one file, performing the peak detection on that and then removing the spectrum data again from memory. As Jan mentioned, you should be careful to not have too many parallel processes running, as the I/O will be the bottleneck, not the number of CPUs. On our cluster I use not more than 24 CPUs in parallel (although we have 292) because otherwise I got serious troubles with the I/O (this is most likely because our disk setup is ... suboptimal).

Just have a look at a recent xcms vignette (R 3.5.1, Bioconductor 3.8) how to perform the analysis. xcms uses by default this "onDisk" mode.

cheers, jo
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on December 18, 2018, 12:35:24 AM
The "onDisk" mode of xcms has allowed us to process ~1,000 samples comfortably on a desktop machine - although it does take some time. Retention time alignment and correspondence happens quite fast and hasn't given us any trouble at all.
The only problem we've had is with fillChromPeaks, where we need to run it single threaded due to memory constraints.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: lzheng on December 19, 2018, 07:27:27 PM
Hi, I'm facing the same issue and my dataset is even bigger. My question is that after you extract the peaks and do very good alignment across all the dataset A containing 10,000 samples, say with xcms, then there comes another dataset B containing 10,000 more samples. How can you align dataset B to A? Should I combine A and B together, and select peak in these 20,000 samples? Thanks very much.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Jan Stanstrup on December 20, 2018, 03:20:41 AM
Asked here too with @johannes.rainer giving an answer: https://github.com/sneumann/xcms/issues/344
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on January 10, 2019, 03:04:23 PM
Thanks for all of your responses. I got sidetracked by some other projects, but I should be back on this one soon. Best wishes to you all in 2019!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Elena Legrand on January 28, 2019, 01:00:36 PM
Best wishes as well! And let us know if you figured it out!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on February 05, 2019, 06:58:50 PM
I'm back on this project now. I'm getting ready to test out a script on the big server mentioned earlier (Ubuntu 18.04, 80 cores, 350+ GB RAM). Can anyone offer tips on how to set parameters for parallel processing? I just downloaded the BiocParallel package, and I'm trying to make sense of it all.

In an effort to limit RAM to 300GB, I am planning to use the command below. Any thoughts on this approach?
ulimit -m 300000000

Before, when I was running MZmine, the server kept crashing when the heap size was set to auto. When I lowered the heap size to 300GB, it stopped crashing. Now, with XCMS, I'm planning to use the ulimit command instead.

Many thanks!


Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on February 05, 2019, 07:17:45 PM
Looking through the LC_MS Preprocessing vignette for XCMS, it seems like this simple call could do the trick...

register(bpstart(MulticoreParam()))

If I understand correctly, this would set the number of available cores to 78. Would it be better to use a smaller number of cores? johannes.rainer had cautioned using too many cores. Would setting the limit on RAM (as mentioned in previous post) prevent the "serious troubles" that johannes.rainer mentioned?

Any other thoughts?

Many, many thanks!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on February 06, 2019, 05:27:17 PM
Take my comments with a grain of salt, because I only ever work with windows desktop machines.

I would imagine it comes down to the capabilities of the machine (IO performance etc).
There is a batch of samples running on a desktop right next to me, using 6 threads and 20 GB of ram (including windows overhead). The hard drives are barely being touched; just the occasional access. So it would hardly be a problem here. But you could imagine if 78 cores are all calling/writing data to the hard drives - that would be a significant bottleneck.

Given what Johannes said, I would limit the core count. The worst thing you could do is limit the available ram so it has to keep paging the memory to the hard drives (I'm assuming servers would have to do this as well?).

Good luck!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on February 06, 2019, 06:27:03 PM
Thanks CoreyG. I am proposing to limit RAM because of my previous experience with MZmine 2; it crashed the server multiple times when it was given access to all the RAM. Would this not be an issue with R/xcms?

As to the issue of I/O bottlenecking, some more questions.

Jan mentioned NOT to use all the cores for reading from raw data files. Are there any steps in the data processing for which having 78 cores available would be a good idea?  Would there be any advantage to using all cores after the onDisk object has been created? I'm envisioning a script where I change the number of available cores at different steps in the data processing...

If this is not a good option, then can anyone suggest guidelines for determining the number of cores to use?

Dankeschön, mange tak, 谢谢, and thank you! (hopefully I didn't miss anyone)

Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on February 06, 2019, 10:02:22 PM
Hopefully someone with more experience will chime in when they get the chance.

In my experience, limiting the number of available cores will reduce how much total ram is used/allocated i.e. ram usage should be somewhat linear with the number of parallel processes. On our lower spec desktops, we use 4 threads to keep the ram usage below 16 GB.

Based on my understanding, readMSData generates the "onDisk" object without having to load all the raw data into ram. findChromPeaks will cause each parallel process to load a raw data file from the hard drive into memory, perform the peak detection, then clear the unneeded data from memory. So this step will be memory and IO intensive.
After that, retention time alignment may or may not be similarly demanding. Obiwarp works on the raw data, so it will need to be loaded into ram again. Peakgroups, on the otherhand, doesn't require the raw data.
The last operation I usually perform is fillChromPeaks. This, again, requires the raw data to be loaded. In my hands, this seems to be the most memory intensive step, requiring me to run it single threaded even with 32 GB of system ram.

You certainly could get away with changing the number of available cores at different steps. But you might need to experiment to determine what works best for your system. In our case, we ran scripts with different number of threads and monitored the systems - reducing the number until it was stable.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: johannes.rainer on February 07, 2019, 01:06:12 AM
Excellent summary @CoreyG ! In the end the 78 cores don't help unless you have enough memory to fit the full data from 78 mzML files at once into memory. Peak detection and filling in missing peak data both need to load the full data of a file (all spectra). You could load one of your mzML files into memory and then calculate the size of that using the object_size function from the pryr package:

Code: [Select]
obj <- readMSData(filename, mode = "inMem")
pryr::object_size(obj)

You could thus estimate how many processes you could run in parallel. On top of that you will however also need some more memory to fit the chromPeaks matrix and later the featureDefinitions data frame and R will need some more space to copy stuff in the background.

Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on February 07, 2019, 03:47:38 PM
Thanks all!!!

johannes.rainer, I tried your suggested code. It seems that my largest files are ~150MB. Using this number to make a conservative estimate, I would need about 12GB of RAM to hold 78 raw data files in memory, which is well below the 350+ GB on the server. But as you said, there are also other processes/objects that need RAM.

CoreyG mentioned that 4 threads keeps the RAM below 16 GB on a low spec desktop. So roughly 4 GB per core, which is much higher than the 150 MB core estimated above. But also, you mentioned that the fillChromPeaks is the most memory intensive process in your script, requiring a limit of 1 thread on a 32GB system. I've also noticed that fillChromPeaks is the step that takes the longest.

It does seem like I'll need to do some optimization on our system. Any advice on how to determine how stable the system is with a given number of cores? I don't mind starting with a large # of available cores and working down, as CoreyG suggests. However, for various reasons, I do not want to test stability by whether or not it crashes the server. Do you think that using the 'ulimit' approach to limit RAM would help to prevent the server from crashing? Can an I/O bottleneck cause a crash?

Perhaps I could work through the script step by step, optimizing the number of cores at each step before moving on to the next...

Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on March 07, 2019, 10:41:15 PM
Hello everyone! I have an update and some more questions. Thanks in advance for your help. You are all going into the Acknowledgements section of my dissertation!

--Update--
I successfully processed 2500+ injections with XCMS on the big server I mentioned before (Ubuntu 18.04, 350+ GB RAM, 80 cores). Total run time--from importing raw data to exporting a table of grouped features--was about 2 days. I set the number of cores available for parallel processing to 20.
Code: [Select]
register(bpstart(MulticoreParam(20))) 

In the Linux command line, before starting the run, I limited the amount of available RAM to 300 GB in an attempt to prevent R from crashing the server.
Code: [Select]
ulimit -m -300000000

Interestingly, I noticed as the script progressed that more and more RAM was being "used" by R (according to htop). In fact, the amount approached 400 GB (i.e. over my 300 GB cap). This amount was over an order of magnitude greater that total object size in R's memory. Based on my research, it seems most likely that R was simply being "lazy" and not returning any RAM back to the OS. At any rate, this did not seem to affect the stability of the server.


---Questions---
I ran a PCA on my set of markers from 1948 injections (I re-ran XCMS on a subset of the aforementioned 2500+ injections). The biggest trend I'm seeing is a batch effect. See attached scores plot. Note that PC1 accounts for over 73% of total variability.

I inspected one ion from the same peak across all of my 200+ QC samples; the m/z range is about 40 ppm for almost all of the samples. The m/z values for a small minority of samples are way off, and this was due to some LockSpray issues. The vast majority are in the 40 ppm range. The retention time range for this peak is 0.06 min for all but the first batch.

My hunch is that the batch effect I'm seeing is largely due to the m/z drift. What do you think?

If so, how would I correct this? I imagine that the batch effect is due to a lack of correspondence caused by the m/z drift. Is there a way to increase the m/z tolerance range in the correspondence step? I'm seeing that there is a binSize param in the groupChromPeaks-density function: "numeric(1) defining the size of the overlapping slices in mz dimension". It's not obvious to me how to set this parameter. Is it in terms of Da, ppm, or what? How would I set it to allow for a 40ppm difference?

Here are the calls that I used for the correspondence. I did not define binSize, and I think the default is 0.25
Code: [Select]
pdp <- PeakDensityParam(sampleGroups = sample.groups, minFraction = (1/length(file.paths)), bw = 1)
xdata <- groupChromPeaks(xdata, param = pdp)
xdata <- fillChromPeaks(xdata, param = FillChromPeaksParam(fixedRt = 15, fixedMz = 0.1))



Is it even possible to correct for these inter-batch differences by using XCMS? Or would I need to employ other software? I've done some literature research on the topic (see below), although I haven't attempted any of these yet.

Rusilowicz, M., Dickinson, M., Charlton, A., O’Keefe, S., & Wilson, J. (2016). A batch correction method for liquid chromatography–mass spectrometry data that does not depend on quality control samples. Metabolomics, 12(3), 1–11. https://doi.org/10.1007/s11306-016-0972-2

Wehrens, R., Hageman, J. A., van Eeuwijk, F., Kooke, R., Flood, P. J., Wijnker, E., … de Vos, R. C. H. (2016). Improved batch correction in untargeted MS-based metabolomics. Metabolomics, 12(5). https://doi.org/10.1007/s11306-016-1015-8

Wang, S. Y., Kuo, C. H., & Tseng, Y. J. (2012). Batch Normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical chemistry, 85(2), 1037-1046.

Brunius, C., Shi, L., & Landberg, R. (2016). Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction. Metabolomics, 12(11), 173.




[attachment deleted by admin]
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Jan Stanstrup on March 08, 2019, 06:22:00 AM
You can set `binSize` in `PeakDensityParam`. It is set in in Dalton and you cannot set it in ppm. The default is 0.25 Da so it is unlikely to help you. You should probably set it much lower.
With such a large dataset it is very likely to suffer from big intensity drift issues you'd need to fix. XCMS does not provide tools for that but the papers you list have some tools.

A few other observations:
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on March 08, 2019, 06:16:44 PM
Yes, I do get a lot of features (15,000+). It seems that the vast majority of the features have relatively high intensity and low variation. I've attached a plot of relative standard deviation (rsd) vs. mean for each feature. Could these be the false positives that you refer to?

I set bw=1 for groupChromPeaks based on the optimization procedure illustrated in Figure 12 (https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms.html). I was using a subset of ~20 samples from across all my batches; these samples represented a wide range of retention times. I set bw this low in order to resolve some features that are not baseline separated. I was assuming that the Obiwarp step, which is upstream of the peak grouping step, would have corrected for shifts in retention time. Is this not a safe assumption?

I think the next diagnostic step will be to look at the peak density distribution per sample for all of my batch QC samples (as in figure 8, https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms.html).

[attachment deleted by admin]
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on March 08, 2019, 09:40:50 PM
I re-ran XCMS peak detection on just the 205 batch QC injections. I used the same peak detection parameters as were used for the set of 1948 samples (as seen in my previous PCA scores plot). I made boxplots for log2 of peak intensities for each of the QC samples. Out of curiosity, I also did box plots for the un-transformed intensities. Three plots are attached (log2; untransformed; untransformed and zoomed-in to show IQR's)

Each batch is represented by a separate color. There do appear to be some inter-batch differences in intensity. Could these differences be enough to be causing the differences that I'm seeing in the PCA?

[attachment deleted by admin]
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Jan Stanstrup on March 09, 2019, 03:56:59 AM
I did not know you had done Obiwarp. If it works well then it could be fine.

Looking at intensity distributions IMO won't tell you enough. As illustrated in the Brunius paper features behaves differently and some remain constant. I would look at a few compounds I know and see how they behave across your batches.

Looking at the loadings of your PCA might also give you a clue. You should be able to see if your batch difference is dominated by a few features or it is all features moving in one direction.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on March 09, 2019, 01:55:19 PM
Here is the scores plot with loadings vectors. It looks like the overwhelming majority of features are moving in one direction. To me, this suggests a systematic drift in intensity between batches. What do you think?

Also, are there any R-based batch correction methods that you can recommend? It's taken me a while to get comfortable with R, and I'd prefer not have to learn any other languages.

[attachment deleted by admin]
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: Jan Stanstrup on March 09, 2019, 02:06:28 PM
Yes the loadings do suggest that. But since you don't see that in your boxplots I think the pattern is too complex to appreciate there. So I think you need to look at some individual compounds. The corrections methods always work in each feature individually for the same reason.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on March 09, 2019, 09:00:20 PM
I ran PCA on just the batch QC samples. I also plotted the intensity of several, randomly selected markers across all my QC samples. I've attached a few of the plots.

There does appear to be a pretty consistent increasing trend over the batches (with the exception of batch 20180728). This seems suspicious. I'd expect the intensities to decrease or increase over time depending on when I cleaned the sample cone.

The trend seen in the individual markers does seems consistent with the trend seen in PC1, though.

Also, I'm wondering what the lower intensity points are. Perhaps they were generated during the peak filling step. I'm not sure. If that's the case, maybe my bw=1 in PeakDensityParam is--as you suggested--too low?

These are not known compounds as you suggested. They they are just randomly selected markers. My reasoning was that all markers are contributing to the PCA. However, it does seem wise to quantify a few peaks using QuanLynx and see if the trend matches with what I'm seeing in the markers generated by XCMS.

Btw, here's the code I used to generate the plots of individual markers:
Code: [Select]
library(dplyr) # for sorting the dataframe

dfr <- arrange(dfr, batch) # sort the dataframe by batch

m <- dim(dfr)[2] # number of columns in the dataframe

rand <- function() {round(runif(1, 16, m), 0)} # function to generate a random column number for indexing; random integer between 16 (i.e. the first column of features) and the last column

plot.it <- function(){  # create a function to simplify plotting as below

n <- rand()

plot(dfr[,n]
  , col=dfr$batch
  , main=c("marker #", (n-15), "batch QC samples only")
  , pch=20
  , ylab="peak area"
  , xlab="sample number"
) # end of plot call

legend("topleft"
  , levels(dfr$batch)
  , col=1:length(levels(dfr$batch))
  , cex=0.8
  , pch=20
  , title="Batch"
)# end of legend call
} # end function definition for "plot.it"

plot.it() # just call this to generate plot of a new random marker


[attachment deleted by admin]
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on March 12, 2019, 06:44:25 AM
Glad to see things moving along, Metabolon1.

Regarding batch correction, take a look at a recent topic that came up on the forums: Feature intensity drift correction suggestions (http://www.metabolomics-forum.com/index.php?topic=1281.0). In addition, check out the Metabolomics Society Wiki page on batch effects for some other packages: Batch Effect (http://wiki.metabolomicssociety.org/index.php/Batch_Effect).
There are also a few RandomForest based methods floating around. I'll grab the name of the package tomorrow.

In regards to the latest plots you put up, they looked remarkably similar to plots I made when I "exported" my data with featureValues but didn't specify 'value = "into"'. If you don't specify this, featureValues just returns the row number of chromPeaks for that peak.

Lastly, I too had to set 'binSize' (in PeakDensityParam) quite a bit lower as I observed the feature m/z range was a bit too large given the ppm error that I would expect.

Cheers,
Corey
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on March 13, 2019, 12:10:42 AM
Random Forest batch correction (Quality Control – Random Forest Signal Correction; QC-RFSC) is incorporated into statTarget and is available in R (https://bioconductor.org/packages/release/bioc/html/statTarget.html).

Alternatively, there was a recent paper by Fiehn's lab on Systematical Error Removal using Random Forest (SERRF). Some details of the method can be found: https://slfan2013.github.io/SERRF-online/ (https://slfan2013.github.io/SERRF-online/). Unfortunately, they have set it up as an online portal, where you have to upload your data to their server. This could constitute a breach of ethics for data sharing/storage, so be you have permission to do so if you use it.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on April 18, 2019, 05:04:54 PM
Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on April 18, 2019, 06:10:16 PM
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on April 21, 2019, 08:43:10 PM
Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.
Glad to hear I could be of help.

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
The differences between QCs and samples shouldn't be that big of a deal.
Depending on what batch correction method you use, you can assess the improvement in CV (RSD) of the QC features to determine how useful the batch correction was. Now, if the batch correction method optimized itself based on minimizing QC variation, then this approach is biased. Cross-validation should then be used to assess performance.

A simple visualization is to plot the pre-corrected CVs on the x-axis and the post-corrected CVs on the y-axis. Points that fall below the diagonal were improved; points on the diagonal weren't affected; points above the diagonal were negatively affected.
This may be an easy way to get a 'gut' feel for what method works best for you.

-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?

This is a fairly common approach. Of course, you should always try to keep the sample:solvent ratio as consistent across all samples as possible. Remember that different sample:solvent ratios will cause variability in extraction efficiency, ionization and detector response.

If you are concerned about introducing associations into your data, consider using a linear model to remove the correction factor.
Get the residuals from lm(peakArea~correctionFactor). This allows the detector response to not be 1:1, but doesn't do much for non-linearity.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: CoreyG on April 22, 2019, 04:27:03 AM
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.
Depending on what you are trying to do, you could get the median peak area from each of the triplicates. That will cut your sample rows in nearly a third.

Alternatively, you could filter the dataset to remove isotopes. Depending on the average number of carbons in your metabolites and signal/abundance, you might be able to reduce the dimensions 2-4 fold. The same can be done with removing adducts.
I've briefly played around with CAMERA for this, but ended up using mz.unity (https://github.com/nathaniel-mahieu/mz.unity).

As a less useful suggestion, is it possible to use R for your data analysis?
For a lot of multivariate analysis, mixOmics (http://mixomics.org/) does pretty well. The website has a lot of examples and the inbuilt plotting functions have come a long way.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.
My thoughts on this differ to many in the 'untargeted' scene. I'm really only interested in features that are present in nearly all samples (<10% missing values). So, I always ask if people expect to see features entirely missing from certain samples/groups.

The nice thing about XCMS is that you can set these parameters fairly loose early in the workflow. Then after fillChromPeaks, you can be more stringent.
With so many samples, I would imagine that seeing the same feature in multiple groups is an almost certainty. So maybe put every sample in 1 group, but set minFraction=10/600 (or something of that sort).

I'd love to hear other peoples thoughts on this, as well.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!
This is certainly a way you could go.
Perhaps there is a way to empirically determine a good CV cutoff?
If CV is mostly related to biological grouping, then you could determine the difference in CV when all injections are used compared to when you have averaged the triplicates. Determine the threshold CV by permuting the biological grouping and repeating the process (you will end up averaging non-triplicates randomly). Whatever the 95th percentile is, that is your critical value.
Title: Re: Peak alignment with large dataset (over 2500 samples and growing)
Post by: metabolon1 on June 08, 2019, 09:55:45 PM
Hello folks,

Some updates and a few more questions.

I have been using the 'IPO' package to determine my XCMS parameters. The resulting peak table contains over 50,000 features. I'm ok with this for now. Has anyone else used IPO? Thoughts?

I successfully ran CAMERA (thanks CoreyG for the suggestions) with the goal of finding redundant features (eg. adducts/isotopes). It identified~7,000 pc groups in the ~50,000 features. Now, if I am understanding correctly, I need to select one feature from each pc group to serve as the "representative" for that compound. Any recommendations on how to make this selection?

Many thanks!