Peak alignment with large dataset (over 2500 samples and growing)

Topic: Peak alignment with large dataset (over 2500 samples and growing) (Read 13739 times) previous topic - next topic

Peak alignment with large dataset (over 2500 samples and growing)

December 08, 2018, 10:07:34 PM

Dear all,

I am trying to process over 2500 files from a UPLC-QTOF-MS dataset. The goal is to eventually increase this number to 10,000 and beyond. Currently I am using MZmine 2. I am fortunate to have access to a big server (80 core, 350+ GB RAM). However, it seems that the peak alignment step is not optimized for this number of samples. See my other post for more details about this issue: https://github.com/mzmine/mzmine2/issues/518

Any ideas on a more efficient peak alignment method? As far as I can tell, the raw data are already pretty well aligned; the UPLC seems to be fairly consistent. My main objective right now is to get all the samples/peaks into a single matrix.

I am actively trying different approaches, but it all takes time. I am hoping that someone else who has trod this ground before can offer advice to help save time and effort.

Many thanks!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #1 – December 09, 2018, 05:32:36 AM

MZMine is know to use a lot of memory. I imagine that is where your bottleneck is. But you should check that.

XCMS is much more memory efficient. Be aware that each core will use a certain amount of memory. So on a system like yours not using all cores will use less memory and might save you if memory is your bottleneck. Also don't use 80 cores on processes that are bottlenecked by HDD reads (like reading the raw data).

That said, with 10,000 samples you really need to be careful about how greedy you need to be in terms of how low in intensity you want to pick.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #2 – December 09, 2018, 02:29:04 PM

Hi, as a workaround I usually split big data sets into subsets (~250 runs) to process them independently using XCMS. Then I use a linear or non-linear (rsc, svr,..) fitting of the shift in the retention time using 'known' metabolites to match variables across peak tables. As said, it's just a workaround but you can process each subset in parallel and reduce (a lot) the computing time and memory needed.
g
ps. if the raw data are already pretty well aligned, peak tables can be aligned using m/z & RT tolerances and a 'master-slave' approach in matlab/R/python/etc

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #3 – December 12, 2018, 06:27:14 AM

Just chiming in with some explanations how xcms works now with large projects. We use now an object from the MSnbase package to represent the raw data, that only reads the raw (spectrum) data if required. That way the memory use is minimized. Peak detection is then performed on a per-file basis, i.e. reading the full data from one file, performing the peak detection on that and then removing the spectrum data again from memory. As Jan mentioned, you should be careful to not have too many parallel processes running, as the I/O will be the bottleneck, not the number of CPUs. On our cluster I use not more than 24 CPUs in parallel (although we have 292) because otherwise I got serious troubles with the I/O (this is most likely because our disk setup is ... suboptimal).

Just have a look at a recent xcms vignette (R 3.5.1, Bioconductor 3.

how to perform the analysis. xcms uses by default this "onDisk" mode.

cheers, jo

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #4 – December 18, 2018, 12:35:24 AM

The "onDisk" mode of xcms has allowed us to process ~1,000 samples comfortably on a desktop machine - although it does take some time. Retention time alignment and correspondence happens quite fast and hasn't given us any trouble at all.
The only problem we've had is with fillChromPeaks, where we need to run it single threaded due to memory constraints.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #5 – December 19, 2018, 07:27:27 PM

Hi, I'm facing the same issue and my dataset is even bigger. My question is that after you extract the peaks and do very good alignment across all the dataset A containing 10,000 samples, say with xcms, then there comes another dataset B containing 10,000 more samples. How can you align dataset B to A? Should I combine A and B together, and select peak in these 20,000 samples? Thanks very much.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #6 – December 20, 2018, 03:20:41 AM

Asked here too with @johannes.rainer giving an answer: https://github.com/sneumann/xcms/issues/344

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #7 – January 10, 2019, 03:04:23 PM

Thanks for all of your responses. I got sidetracked by some other projects, but I should be back on this one soon. Best wishes to you all in 2019!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #8 – January 28, 2019, 01:00:36 PM

Best wishes as well! And let us know if you figured it out!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #9 – February 05, 2019, 06:58:50 PM

I'm back on this project now. I'm getting ready to test out a script on the big server mentioned earlier (Ubuntu 18.04, 80 cores, 350+ GB RAM). Can anyone offer tips on how to set parameters for parallel processing? I just downloaded the BiocParallel package, and I'm trying to make sense of it all.

In an effort to limit RAM to 300GB, I am planning to use the command below. Any thoughts on this approach?
ulimit -m 300000000

Before, when I was running MZmine, the server kept crashing when the heap size was set to auto. When I lowered the heap size to 300GB, it stopped crashing. Now, with XCMS, I'm planning to use the ulimit command instead.

Many thanks!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #10 – February 05, 2019, 07:17:45 PM

Looking through the LC_MS Preprocessing vignette for XCMS, it seems like this simple call could do the trick...

register(bpstart(MulticoreParam()))

If I understand correctly, this would set the number of available cores to 78. Would it be better to use a smaller number of cores? johannes.rainer had cautioned using too many cores. Would setting the limit on RAM (as mentioned in previous post) prevent the "serious troubles" that johannes.rainer mentioned?

Any other thoughts?

Many, many thanks!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #11 – February 06, 2019, 05:27:17 PM

Take my comments with a grain of salt, because I only ever work with windows desktop machines.

I would imagine it comes down to the capabilities of the machine (IO performance etc).
There is a batch of samples running on a desktop right next to me, using 6 threads and 20 GB of ram (including windows overhead). The hard drives are barely being touched; just the occasional access. So it would hardly be a problem here. But you could imagine if 78 cores are all calling/writing data to the hard drives - that would be a significant bottleneck.

Given what Johannes said, I would limit the core count. The worst thing you could do is limit the available ram so it has to keep paging the memory to the hard drives (I'm assuming servers would have to do this as well?).

Good luck!

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #12 – February 06, 2019, 06:27:03 PM

Thanks CoreyG. I am proposing to limit RAM because of my previous experience with MZmine 2; it crashed the server multiple times when it was given access to all the RAM. Would this not be an issue with R/xcms?

As to the issue of I/O bottlenecking, some more questions.

Jan mentioned NOT to use all the cores for reading from raw data files. Are there any steps in the data processing for which having 78 cores available would be a good idea? Would there be any advantage to using all cores after the onDisk object has been created? I'm envisioning a script where I change the number of available cores at different steps in the data processing...

If this is not a good option, then can anyone suggest guidelines for determining the number of cores to use?

Dankeschön, mange tak, 谢谢, and thank you! (hopefully I didn't miss anyone)

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #13 – February 06, 2019, 10:02:22 PM

Hopefully someone with more experience will chime in when they get the chance.

In my experience, limiting the number of available cores will reduce how much total ram is used/allocated i.e. ram usage should be somewhat linear with the number of parallel processes. On our lower spec desktops, we use 4 threads to keep the ram usage below 16 GB.

Based on my understanding, readMSData generates the "onDisk" object without having to load all the raw data into ram. findChromPeaks will cause each parallel process to load a raw data file from the hard drive into memory, perform the peak detection, then clear the unneeded data from memory. So this step will be memory and IO intensive.
After that, retention time alignment may or may not be similarly demanding. Obiwarp works on the raw data, so it will need to be loaded into ram again. Peakgroups, on the otherhand, doesn't require the raw data.
The last operation I usually perform is fillChromPeaks. This, again, requires the raw data to be loaded. In my hands, this seems to be the most memory intensive step, requiring me to run it single threaded even with 32 GB of system ram.

You certainly could get away with changing the number of available cores at different steps. But you might need to experiment to determine what works best for your system. In our case, we ran scripts with different number of threads and monitored the systems - reducing the number until it was stable.

Re: Peak alignment with large dataset (over 2500 samples and growing)

Reply #14 – February 07, 2019, 01:06:12 AM

Excellent summary @CoreyG ! In the end the 78 cores don't help unless you have enough memory to fit the full data from 78 mzML files at once into memory. Peak detection and filling in missing peak data both need to load the full data of a file (all spectra). You could load one of your mzML files into memory and then calculate the size of that using the object_size function from the pryr package:

Code: [Select]

obj <- readMSData(filename, mode = "inMem")
pryr::object_size(obj)

You could thus estimate how many processes you could run in parallel. On top of that you will however also need some more memory to fit the chromPeaks matrix and later the featureDefinitions data frame and R will need some more space to copy stuff in the background.