Skip to main content

Topic: Peak alignment with large dataset (over 2500 samples and growing) (Read 337 times) previous topic - next topic

  • metabolon1
  • [*]
Peak alignment with large dataset (over 2500 samples and growing)
Dear all,

I am trying to process over 2500 files from a UPLC-QTOF-MS dataset. The goal is to eventually increase this number to 10,000 and beyond. Currently I am using MZmine 2. I am fortunate to have access to a big server (80 core, 350+ GB RAM). However, it seems that the peak alignment step is not optimized for this number of samples. See my other post for more details about this issue: https://github.com/mzmine/mzmine2/issues/518

Any ideas on a more efficient peak alignment method? As far as I can tell, the raw data are already pretty well aligned; the UPLC seems to be fairly consistent. My main objective right now is to get all the samples/peaks into a single matrix.

I am actively trying different approaches, but it all takes time. I am hoping that someone else who has trod this ground before can offer advice to help save time and effort.

Many thanks!

  • Jan Stanstrup
  • [*][*][*][*][*]
  • Administrator
Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #1
MZMine is know to use a lot of memory. I imagine that is where your bottleneck is. But you should check that.

XCMS is much more memory efficient. Be aware that each core will use a certain amount of memory. So on a system like yours not using all cores will use less memory and might save you if memory is your bottleneck. Also don't use 80 cores on processes that are bottlenecked by HDD reads (like reading the raw data).

That said, with 10,000 samples you really need to be careful about how greedy you need to be in terms of how low in intensity you want to pick.
  • University of Copenhagen, Denmark
Blog: stanstrup.github.io

Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #2
Hi, as a workaround I usually split big data sets into subsets (~250 runs) to process them independently using XCMS. Then I use a linear or non-linear (rsc, svr,..) fitting of the shift in the retention time using 'known' metabolites to match variables across peak tables. As said, it's just a workaround but you can process each subset in parallel and reduce (a lot) the computing time and memory needed.
g
ps. if the raw data are already pretty well aligned, peak tables can be aligned using m/z & RT tolerances and a 'master-slave' approach in matlab/R/python/etc
  • Last Edit: December 09, 2018, 02:34:26 PM by Guillermo Quintas

Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #3
Just chiming in with some explanations how xcms works now with large projects. We use now an object from the MSnbase package to represent the raw data, that only reads the raw (spectrum) data if required. That way the memory use is minimized. Peak detection is then performed on a per-file basis, i.e. reading the full data from one file, performing the peak detection on that and then removing the spectrum data again from memory. As Jan mentioned, you should be careful to not have too many parallel processes running, as the I/O will be the bottleneck, not the number of CPUs. On our cluster I use not more than 24 CPUs in parallel (although we have 292) because otherwise I got serious troubles with the I/O (this is most likely because our disk setup is ... suboptimal).

Just have a look at a recent xcms vignette (R 3.5.1, Bioconductor 3.8) how to perform the analysis. xcms uses by default this "onDisk" mode.

cheers, jo
  • Eurac Research, Bolzano, Italy

  • CoreyG
  • [*]
Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #4
The "onDisk" mode of xcms has allowed us to process ~1,000 samples comfortably on a desktop machine - although it does take some time. Retention time alignment and correspondence happens quite fast and hasn't given us any trouble at all.
The only problem we've had is with fillChromPeaks, where we need to run it single threaded due to memory constraints.
  • Baker Heart and Diabetes Institute

  • lzheng
  • [*]
Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #5
Hi, I'm facing the same issue and my dataset is even bigger. My question is that after you extract the peaks and do very good alignment across all the dataset A containing 10,000 samples, say with xcms, then there comes another dataset B containing 10,000 more samples. How can you align dataset B to A? Should I combine A and B together, and select peak in these 20,000 samples? Thanks very much.
  • Stanford University

  • Jan Stanstrup
  • [*][*][*][*][*]
  • Administrator
Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #6
  • Last Edit: January 11, 2019, 02:25:20 AM by Jan Stanstrup
  • University of Copenhagen, Denmark
Blog: stanstrup.github.io

  • metabolon1
  • [*]
Re: Peak alignment with large dataset (over 2500 samples and growing)
Reply #7
Thanks for all of your responses. I got sidetracked by some other projects, but I should be back on this one soon. Best wishes to you all in 2019!