Skip to main content
Recent Posts
21
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by metabolon1 -
Thanks all!!!

johannes.rainer, I tried your suggested code. It seems that my largest files are ~150MB. Using this number to make a conservative estimate, I would need about 12GB of RAM to hold 78 raw data files in memory, which is well below the 350+ GB on the server. But as you said, there are also other processes/objects that need RAM.

CoreyG mentioned that 4 threads keeps the RAM below 16 GB on a low spec desktop. So roughly 4 GB per core, which is much higher than the 150 MB core estimated above. But also, you mentioned that the fillChromPeaks is the most memory intensive process in your script, requiring a limit of 1 thread on a 32GB system. I've also noticed that fillChromPeaks is the step that takes the longest.

It does seem like I'll need to do some optimization on our system. Any advice on how to determine how stable the system is with a given number of cores? I don't mind starting with a large # of available cores and working down, as CoreyG suggests. However, for various reasons, I do not want to test stability by whether or not it crashes the server. Do you think that using the 'ulimit' approach to limit RAM would help to prevent the server from crashing? Can an I/O bottleneck cause a crash?

Perhaps I could work through the script step by step, optimizing the number of cores at each step before moving on to the next...

22
XCMS / Re: Implementing custom retention time alignment algorithms
Last post by johannes.rainer -
Thanks for clarifying!

Regarding your concern, the fillChromPeaks will always use the adjusted retention times if retention time adjustment has been performed. So, the results should be the same, with or without applyAdjustedRtime.

Regarding the error: I fixed that. You can install the updated version from github:

For the developmental version you are using:
Code: [Select]
devtools::install_github("sneumann/xcms")

For the Bioconductor 3.8 release (R-3.5.x)
Code: [Select]
devtools::install_github("sneumann/xcms", ref = "RELEASE_3_8")

23
XCMS / Re: Implementing custom retention time alignment algorithms
Last post by CoreyG -
Hi Johannes,
Just to be clear, I am using 'adjustedRtime' to apply the adjusted retention times (and not following it with 'applyAdjustedRtime').
Code: [Select]
adjustedRtime(xdata)<-scans

If I use applyAdjustedRtime after adjustedRtime, I do not get an error with fillChromPeaks. This is likely because hasAdjustedRtime returns FALSE, so the processHistory check never gets performed (methods-XCMSnExp.R#L651, I think)
My concern with using applyAdjustedRtime, is that the data will be slightly warped and so the integration during fillChromPeaks will be slightly off. That is, unless it using the retention time when loading the raw data again? In which case the whole thing is solved  :))

Nonetheless, I am using 'R Under development (unstable) (2019-01-14 r75992)', 'xcms_3.5.1' and 'MSnbase_2.9.3'.
I compiled this version of xcms from the github page ("sneumann/xcms") to utilize the subset feature, so I'm not sure if that version number above is necessarily correct.

Thanks
24
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by johannes.rainer -
Excellent summary @CoreyG ! In the end the 78 cores don't help unless you have enough memory to fit the full data from 78 mzML files at once into memory. Peak detection and filling in missing peak data both need to load the full data of a file (all spectra). You could load one of your mzML files into memory and then calculate the size of that using the object_size function from the pryr package:

Code: [Select]
obj <- readMSData(filename, mode = "inMem")
pryr::object_size(obj)

You could thus estimate how many processes you could run in parallel. On top of that you will however also need some more memory to fit the chromPeaks matrix and later the featureDefinitions data frame and R will need some more space to copy stuff in the background.

25
XCMS / Re: Implementing custom retention time alignment algorithms
Last post by johannes.rainer -
This looks like a nasty bug in xcms. I will have a look into it, as you should not get this error when doing fillChromPeaks. Also, applyAdjustedRtime is a valid approach. But this should not affect fillChromPeaks.

I have opened an issue at the xcms github repo (https://github.com/sneumann/xcms/issues/349)

Could you just provide the info which version of R/xcms you are using?
26
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by CoreyG -
Hopefully someone with more experience will chime in when they get the chance.

In my experience, limiting the number of available cores will reduce how much total ram is used/allocated i.e. ram usage should be somewhat linear with the number of parallel processes. On our lower spec desktops, we use 4 threads to keep the ram usage below 16 GB.

Based on my understanding, readMSData generates the "onDisk" object without having to load all the raw data into ram. findChromPeaks will cause each parallel process to load a raw data file from the hard drive into memory, perform the peak detection, then clear the unneeded data from memory. So this step will be memory and IO intensive.
After that, retention time alignment may or may not be similarly demanding. Obiwarp works on the raw data, so it will need to be loaded into ram again. Peakgroups, on the otherhand, doesn't require the raw data.
The last operation I usually perform is fillChromPeaks. This, again, requires the raw data to be loaded. In my hands, this seems to be the most memory intensive step, requiring me to run it single threaded even with 32 GB of system ram.

You certainly could get away with changing the number of available cores at different steps. But you might need to experiment to determine what works best for your system. In our case, we ran scripts with different number of threads and monitored the systems - reducing the number until it was stable.
27
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by metabolon1 -
Thanks CoreyG. I am proposing to limit RAM because of my previous experience with MZmine 2; it crashed the server multiple times when it was given access to all the RAM. Would this not be an issue with R/xcms?

As to the issue of I/O bottlenecking, some more questions.

Jan mentioned NOT to use all the cores for reading from raw data files. Are there any steps in the data processing for which having 78 cores available would be a good idea?  Would there be any advantage to using all cores after the onDisk object has been created? I'm envisioning a script where I change the number of available cores at different steps in the data processing...

If this is not a good option, then can anyone suggest guidelines for determining the number of cores to use?

Dankeschön, mange tak, 谢谢, and thank you! (hopefully I didn't miss anyone)

28
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by CoreyG -
Take my comments with a grain of salt, because I only ever work with windows desktop machines.

I would imagine it comes down to the capabilities of the machine (IO performance etc).
There is a batch of samples running on a desktop right next to me, using 6 threads and 20 GB of ram (including windows overhead). The hard drives are barely being touched; just the occasional access. So it would hardly be a problem here. But you could imagine if 78 cores are all calling/writing data to the hard drives - that would be a significant bottleneck.

Given what Johannes said, I would limit the core count. The worst thing you could do is limit the available ram so it has to keep paging the memory to the hard drives (I'm assuming servers would have to do this as well?).

Good luck!
29
Announcements / EMN Travel Grant 2019: Award Scheme
Last post by Elena Legrand -
The Early-career Members Network of the Metabolomics Society Travel Grant 2019: Award Scheme

The Early-career Members Network of the Metabolomics Society Travel Grant aims at providing support to Early-Career scientists* to attend and actively participate in national and international events related to the metabolomics field that will promote their professional development.

Eligible applicants must be:
A member of the Metabolomics Society; and
Within 5 years of obtaining their highest degree.

Up to 4 travel awards for (i) graduate students (master or PhD students) and (ii) post-doctoral researchers will cover travel and accommodation costs and conference fees up to $500. Applicants must be presenting their own metabolomics research and be willing to promote the Metabolomics Society at the event through the use of logos and acknowledgements. Applications will be reviewed by the Early-career Members Network Committee following a pre-established scoring system and the application deadline is February 28th 2019, 12 pm UTC. All entries must be received before the stated deadline in order to be considered. To apply please download the travel award application form. The application form will be accompanied by a short CV of the applicant (max. 2 pages), an abstract of the research project (max. 300 words), a letter of support from the supervisor and a letter of motivation (max. 1 page). By signing this form you agree to the terms and conditions. Send your application to info.emn@metabolomicssociety.org and include the following line in your email subject: “Application: EMN TRAVEL AWARD 2019”. All documents should be submitted in one single pdf file.

Winners will be notified by e-mail by May 2019 and announced on the Metabolomics Society website and in the MetaboNews. Travel awards will be valid through 12 months from the date of announcement. Selected abstract and conference reports will be published in the MetaboNews and/or on the Metabolomics Society website.

Please visit the link to download the application form and for more information:
http://metabolomicssociety.org/members-center/early-career-members-network

*All members of the Metabolomics Society that are graduate students (Masters or PhD) or within 5-years post PhD while being in a training position (e.g. post-doc) are considered Early-Career scientists. In certain cases where someone if beyond the 5-year post-PhD limit, exceptions can be made if time off was taken for personal reasons (e.g. family, career break). Any persons with 5-years full-time work experience (or equivalent) after obtaining their PhD shall not be considered an Early Career scientist.
30
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by metabolon1 -
Looking through the LC_MS Preprocessing vignette for XCMS, it seems like this simple call could do the trick...

register(bpstart(MulticoreParam()))

If I understand correctly, this would set the number of available cores to 78. Would it be better to use a smaller number of cores? johannes.rainer had cautioned using too many cores. Would setting the limit on RAM (as mentioned in previous post) prevent the "serious troubles" that johannes.rainer mentioned?

Any other thoughts?

Many, many thanks!