Long runtime while grouping!

Topic: Long runtime while grouping! (Read 7028 times) previous topic - next topic

Long runtime while grouping!

March 27, 2013, 06:38:20 AM

Hello everyone,

We are currently establishing a new LC-MS system in our group at the IPK Gatersleben
(LECO Citius LC-HRT + Agilent 1290 Infinity + Gerstel MPS).

I am still a newbie in using R and the packages xcms and CAMERA.
So, playing around with the first test data sets generated, I face some problems:

R takes a very long times for generating the xcmsSet-object and even longer for the first grouping (multiple hours / days).
Moreover there are problems with the correct peak finding / alignment across the samples. I was adjusting many parameters, but still have no idea what’s the problem.

The raw data were preprocessed by using LECO “chromaTOF-HRT” software (only mass calibration) and exported as netCDF-files. One file has the size of around 1 GB.

Specified below you find the characterization of the system R is operation on:

   Win7 Ultimate (Service Pack1) 64 bit version
   Intel core i7-3770 processor CPU 3.40 GHz
   16 GB RAM
   Windows, R and Tinn-R are installed / operation on a kingston SV300S37A 240G SATA SSD

and the first lines of my script:

setwd("C:/LCMS_Processed_Data/data_processing/20130320_Test")
list.files("C:/LCMS_Processed_Data/data_processing/20130320_Test/netCDF_data", recursive = TRUE, full.names = TRUE)

library(CAMERA)

xcmsSet(method=centWave, ppm=2, peakwidth=c(6,15), snthr=3)

An "xcmsSet" object with 5 samples
Time range: 1-659 seconds (0-11 minutes)
Mass range: 49.9804-2408.8713 m/z
Peaks: 1446480 (about 289296 per sample)
Peak Groups: 0
Sample classes: netCDF_data
Profile settings: method = bin
step = 0.1
Memory usage: 122 MB

#up to this point it needs some hours

group(xs, method=nearest, mzCheck=0.008, rtCheck=10, kNN=1)

# This grouping needs days

I have tried the group.density function before but from manually inspecting the data files I found that peaks were not correctly aligned.

Hope someone can help me -Thanks a lot in advance

Dominic

ps. If it is helpful I can provide the chromatograms / files as download

Re: Long runtime while grouping!

Reply #1 – March 27, 2013, 07:44:32 AM

Hey Dominic,

1.) Which R-version are you actually deploying? 32 or 64 bit? Whats the output of

Code: [Select]

R.version

Can you see from the task manager how much memory is allocated to Rsession?

2.) Your CPU has 4 cores, you should make use of it! At least xcmsSet can be parallelized by

Code: [Select]

xcmsSet( ..., nSlaves = 4 )

However, I made the experience that using 3 slaves on a low number of chromatograms might be faster in the end, especially considering how Ive Bridge's Turbo Boost works. In your case (just 5 chromatograms) I would even use only 2 slaves. In order to use the multicore functionality you have to have Rmpi or Snow as parallelization backend installed.

3.) I am not sure but your netCDF-files seam to be really huge. Are the MS data in profile mode? If yes, could you transform them prior to xcms analysis in centroid mode?

4.) Do 289.296 peaks per sample seam reasonable to you? It's this peak count which makes the grouping so slow. Are you really expecting a scan-to-scan accuracy of 2ppm? Our qTOF which is advertised with accurate mass accuracy less than 5ppm (meaning a weighted mz-mean over an entire peak) exhibits actually a scan-to-scan accuracy up to 35ppm. Choosing the ppm-parameter too low might lead to disrupted peaks along the mz dimension. Additionally peaks might be disrupted in time dimension as well, when choosing peakwidth not properly. I guess you are running a UPLC (in UPLC mode) when choosing the peakwidth to (6,15)?

5.) You mention that your peaks are not correctly aligned. Aligned along retention time? You should consider to call rector prior to grouping then. But watch out, retcor.orbiwarp took for me always 10-50 times longer than calling group.density on the same data set. So, if it makes any sense to you, try to lower the number of found peaks before proceeding.

Cheers,
Isam

Re: Long runtime while grouping!

Reply #2 – March 27, 2013, 10:18:27 AM

Hey Isam,

Thank so much for the rapid help!

Quote

1.) Which R-version are you actually deploying? 32 or 64 bit? Whats the output of

Code: Select all
R.version

Can you see from the task manager how much memory is allocated to Rsession?

I am using R version 2.15.3 (64bit)

> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 2
minor 15.3
year 2013
month 03
day 01
svn rev 62090
language R
version.string R version 2.15.3 (2013-03-01)
nickname Security Blanket
>

According to the task manager around 1 GB RAM is allocated to the R session.

Quote

2.) Your CPU has 4 cores, you should make use of it! At least xcmsSet can be parallelized by

Code: Select all
xcmsSet( ..., nSlaves = 4 )

However, I made the experience that using 3 slaves on a low number of chromatograms might be faster in the end, especially considering how Ive Bridge's Turbo Boost works. In your case (just 5 chromatograms) I would even use only 2 slaves. In order to use the multicore functionality you have to have Rmpi or Snow as parallelization backend installed.

Yes I am aware about the possibility of using multiple cores. Just today we found a way to run Rmpi on Win 7.

The runtime for "xcmsSet()" has been reduced to one hour. But the grouping would still take days if we apply group.nearest.

Quote

3.) I am not sure but your netCDF-files seam to be really huge. Are the MS data in profile mode? If yes, could you transform them prior to xcms analysis in centroid mode?

The MS data are in centroid mode. We are also surprised about the size but so far we found no indication they are erratic.

Quote

4.) Do 289.296 peaks per sample seam reasonable to you? It's this peak count which makes the grouping so slow. Are you really expecting a scan-to-scan accuracy of 2ppm? Our qTOF which is advertised with accurate mass accuracy less than 5ppm (meaning a weighted mz-mean over an entire peak) exhibits actually a scan-to-scan accuracy up to 35ppm. Choosing the ppm-parameter too low might lead to disrupted peaks along the mz dimension. Additionally peaks might be disrupted in time dimension as well, when choosing peakwidth not properly. I guess you are running a UPLC (in UPLC mode) when choosing the peakwidth to (6,15)?

Of course we are aware that the number is incredibly high and slows down the procedure. Our TOF has an accuracy below 1ppm (Resolution = 50000) and we have not seen large deviations in accuracy betweens scans. We also looked at the data and for some targets we know the accuracy was typically below 1ppm and rarely higher. We just now ran the Peakfind with 4ppm instead of 2 and it reduced the peak number only marginally from 302275 to 293018. We than set ppm to 8 and again got 293018 peaks. We received the following warning at 8ppm: "Warning: There were 5518 peak data insertion problems. Please try lowering the "ppm" parameter." I am not sure but I think at 8ppm ions from other peaks might be crossing. We are using an UHPLC and estimated the chromatographic peakwidth to be 6-15 seconds (above baseline, not FWHM).

Quote

5.) You mention that your peaks are not correctly aligned. Aligned along retention time? You should consider to call rector prior to grouping then. But watch out, retcor.orbiwarp took for me always 10-50 times longer than calling group.density on the same data set. So, if it makes any sense to you, try to lower the number of found peaks before proceeding.

I meant the grouping by group.density. The retention times were pretty stable with apices of solid peaks beeing only 1-2 seconds distant between chromatograms.

Cheers,
Dominic

ps. just tell me if I can run some test that may help to solve this problem

Re: Long runtime while grouping!

Reply #3 – March 27, 2013, 12:43:27 PM

Dominic,

I am afraid the grouping just takes that long. Usually I use group.density for my data and just tried for comparison group.nearest on an example data set (6 Samples, around 3000 Peaks/Sample). My machine has almost the same configuration as yours (but running Mac OS-X). Grouping takes about 1.5 minutes. From the description I expect that the algorithm runs with something like O(n) = n^2, meaning, that if you have 100 times more peaks, your runtime should be around 10000 times longer. I am curious if someone else has an idea but I am afraid that you either have to deal with it or to lower the number of detected peaks somehow.

Before going on to grouping I'd try to better understand the peaktable. Maybe you should start looking only at the peaks of one sample

Code: [Select]

pks <- peaks(xset)
pksSample1 <- pks[ pks[ ,"sample" ] == 1, ]

.

I'd focus on three aspects:

a) How many peaks overlap in RT? This could give you an idea if you have to compete with many adducts or insource fragmentations.
b) How many peaks are very close in m/z and how close are they on the RT axis to each other?
c) How are the abundances distributed? Maybe you are collecting too much noise at low abundance and could avoid this directly when calling xcmsSet.

Just a couple of ideas. Sorry that I couldn't help more.

Cheers,
Isam

Re: Long runtime while grouping!

Reply #4 – April 04, 2013, 04:28:23 AM

Hey Isam,

Thanks for the advice.

It really looks like if we are collecting too much noise at low abundance.
We decided to use the density algorithm for our further approach and tried to optimize some parameters like bw and mzwid.

Code: [Select]

xsneg   <- xcmsSet("C:/LCMS_Processed_Data/data_processing/20130327_Test/netCDF_data/negative", method="centWave", ppm=4, peakwidth=c(3,8), snthr=3, nSlaves=8)             
xspos   <- xcmsSet("C:/LCMS_Processed_Data/data_processing/20130327_Test/netCDF_data/positive", method="centWave", ppm=4, peakwidth=c(3,8), snthr=3, nSlaves=8) 

gnego    <- group(xsneg, method="density", bw=0.01, mzwid=0.003, max=100, minsamp = 4)
gposo    <- group(xspos, method="density", bw=0.01, mzwid=0.003, max=100, minsamp = 4)

neg   <- xsAnnotate(gnego, nSlaves=8, polarity="negative") #, calcCiS=TRUE, calcIso=FALSE, calcCaS=FALSE)
neg   <- groupFWHM(neg)
neg   <- groupCorr(neg)   
neg   <- findAdducts(neg, polarity="negative", mzabs=0.002)
neg   <- findIsotopes(neg, mzabs = 0.002)

pos   <- xsAnnotate(gposo, nSlaves=8, polarity="positive")#, calcCiS=TRUE, calcIso=FALSE, calcCaS=FALSE)
pos   <- groupFWHM(pos)
pos   <- groupCorr(pos)   
pos   <- findAdducts(pos, polarity="positive", mzabs=0.002)
pos   <- findIsotopes(pos, mzabs = 0.002)


combipos <- combinexsAnnos(pos, neg, pos=TRUE, tol=2, ruleset=NULL)
combineg <- combinexsAnnos(pos, neg, pos=FALSE, tol=2, ruleset=NULL)

It resulted in around 30 thousand features in positive and 10 thousand features in negative mode being present in all four samples (around one third are isotope peaks).

Cheers,
Dominic

Re: Long runtime while grouping!

Reply #5 – April 04, 2013, 05:10:00 AM

Hey Dominic,

good to here that.

I am just curious: Within the CAMERA workflow, you call findIsotopes() as last step. Is there a reason for that? I am just asking because I use to call findIsotopes() between groupFWHM() and groupCorr() since the isotope information can be used during correlation grouping, avoiding to separate isotope peaks of the same compound in different pseudospectra.

Cheers,
Siam

Re: Long runtime while grouping!

Reply #6 – April 04, 2013, 07:14:19 AM

Hi Isam,

we thought the other way round that dissecting the pseudospectra of groupFWHM() by using groupCorr() into more potentially independent analytes would save us from assigning isotopic peaks wrong.
As an example: after groupFWHM() we have a pseudospectrum A with 100 peaks. After groupCorr() some peaks will be withdrawn due to low correlations and pseudospectrum A will only have lets say 70 peaks.
I thought it woud be better to assign isotopic peaks only within these 70 peaks then. Would it be better to do it as you suggested?

Cheers

Dominic

Re: Long runtime while grouping!

Reply #7 – April 04, 2013, 08:10:40 AM

Hi Dominic,

to be honest I am not a CAMERA-Ninja but just started to get my hands on. So this is just a thought:

Quote

after groupFWHM() we have a pseudospectrum A with 100 peaks. After groupCorr() some peaks will be withdrawn due to low correlations and pseudospectrum A will only have lets say 70 peaks.

Let's say within those 100 peaks you find two peaks which fulfill the criteria for being isotopes of each other (12C/13C ratio + mz-difference), but their peak shapes correlate less than groupCorr() "expects". Which of these contrary observations is more reliable? Especially when considering that the M+1 or M+2 peak's abundance is typically only a small fraction of this of the corresponding M peak? Consequently, the peak shapes might look different in certain cases, considering that large M peaks might be subject to ion suppression as well as low M+X peaks could interfere with noise and the baseline.

Now, if those peaks are separated by groupCorr() none of them will be withdrawn, but they will be assigned to independent pseudospectra A and B. If you now apply findIsotopes(), you will loose the M+1 information for the M peak (in ps A), making a subsequent formula generation less reliable. And you'll find in ps B the M+1 without any, or worse with a wrong, annotation.

Thats why I currently apply findIsotopes() prior to groupCorr() as suggested in the vignette. Does this make sense to you?

Cheers,
Isam

Re: Long runtime while grouping!

Reply #8 – April 04, 2013, 05:57:38 PM

That is a lot of features per sample.
I've not seen data from a LECO TOF before, but one of the first things I would check is the quality of the centroidization algorithm.
How do the centroid spectra look like compared to the profile spectra. Sometimes artifacts (split peaks, ringing effects, etc.) are generated when centroidization does not work that well. Does it apply some kind of thresholding?
If you feel like centWave is picking up too many features with very low intensity you should play with the prefilter option.

Re: Long runtime while grouping!

Reply #9 – April 05, 2013, 07:05:11 AM

Hi Isam,

Yes, this makes sense to me and from now on we will do it this way.
Thanks for the advice.

Hello Ralf,

We are also “pleasantly” surprised about the number of features.
Sorry, but about the quality of the centroidization algorithm and the profile spectra, I can’t really make a point.
The system, or better the ChromaTOF software is collecting the data directly in centroid mode because of the high data amount / scan rate!

Cheers Dominic