Metabolomics Society Forum

Software => R => XCMS => Topic started by: andgan on August 14, 2012, 08:09:01 AM

Title: memory error during fillpeak step
Post by: andgan on August 14, 2012, 08:09:01 AM
Dear all,

I'm aligning a large number of samples (~9,000). Everything works fine until I start the filling peak part.

Code: [Select]

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8      LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 [7] LC_PAPER=C                LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     

attached base packages:
[1] stats    graphics  grDevices utils    datasets  methods  base   

other attached packages:
[1] xcms_1.30.3

......

> xset3
An "xcmsSet" object with 8954 samples

Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 101230521 (about 11306 per sample)
Peak Groups: 18501
Sample classes: inj2, inj3

Profile settings: method = bin
                  step = 0.1

Memory usage: 9040 MB

>xset4 <- fillPeaks.chrom(xset3)

Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

> traceback()

4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))
3: .local(object, ...)
2: fillPeaks.chrom(xset3)
1: fillPeaks.chrom(xset3)


Note that fillPeaks.chrom worked fine when I lowered the number of peak groups (by lowering bw or mzwidth):

Code: [Select]

> xset3
An "xcmsSet" object with 8954 samples

Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 89205782 (about 9963 per sample)
Peak Groups: 13674
Sample classes: inj2, inj3

Profile settings: method = bin
                  step = 0.1

Memory usage: 8000 MB

> xset4 <- fillPeaks.chrom(xset3)
021228_TwinGene_000101 021228_TwinGene_000102 etc....


I have plenty of RAM available, since I'm using a node with 72GB of RAM and I even tried a node with 512GB of RAM. This process should use max 30GB of RAM.

I wonder if there is any memory limitation in the fillPeaks.chrom function.

Thanks in advance!
Title: Re: memory error during fillpeak step
Post by: Ralf on August 14, 2012, 10:32:30 AM
Can you run
Code: [Select]
options(error=recover)
xset4 <- fillPeaks.chrom(xset3)
the hit 3  (enter the local environment of the fillpeaks function)
and enter
Code: [Select]
sum(is.na(gvals))
?

I would like to see if sum(is.na(gvals)) is greater than .Machine$integer.max (2147483647)

Ralf
Title: Re: memory error during fillpeak step
Post by: andgan on August 14, 2012, 10:50:42 AM
Hi Ralf,

Code: [Select]

> xset4 <- fillPeaks.chrom(xset3)
Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

Enter a frame number, or 0 to exit 

1: fillPeaks.chrom(xset3)
2: fillPeaks.chrom(xset3)
3: .local(object, ...)
4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))

Selection: 3
Called from: .local(object, ...)
Browse[1]> sum(is.na(gvals))
[1] 133210130


It looks to be lower/ Other suggestions?

Andrea
Title: Re: memory error during fillpeak step
Post by: Ralf on August 14, 2012, 11:50:10 AM
It looks like the resulting matrix is simply too big for R (at the moment).

Code: [Select]
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

http://svn.r-project.org/R/trunk/src/main/bind.c (http://svn.r-project.org/R/trunk/src/main/bind.c)
Code: [Select]
#ifndef LONG_VECTOR_SUPPORT
    if (data->ans_length < 0)
errorcall(call, _("resulting vector exceeds vector length limit in '%s'"), "AnswerType");
#endif

Looks like the R team is working on it and long vector support is coming in 2.16
http://permalink.gmane.org/gmane.comp.l ... table/1125 (http://permalink.gmane.org/gmane.comp.lang.r.datatable/1125)
http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html (http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html)

Ralf
Title: Re: memory error during fillpeak step
Post by: andgan on August 14, 2012, 12:09:16 PM
Thanks great!
Title: Re: memory error during fillpeak step
Post by: Ralf on August 14, 2012, 06:28:52 PM
the example below works for me with today's Rdev
Code: [Select]
R Under development (unstable) (2012-08-14 r60264) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
> print(object.size(peakmat2),unit="Gb")
21 Gb
Title: Re: memory error during fillpeak step
Post by: andgan on September 20, 2012, 05:07:45 AM
Dear Ralf,

Now I have more samples and even with the new R release, the matrix seems to be too large:

Code: [Select]
R Under development (unstable) (2012-09-19 r60762) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> xset3
An "xcmsSet" object with 12634 samples

Time range: 2.2-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 118463468 (about 9377 per sample)
Peak Groups: 21258

Profile settings: method = bin
                  step = 0.1

Memory usage: 10700 MB

>options(error=recover)
>xset4 <- fillPeaks.chrom(xset3)
Error in rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat))) :
  long vectors not supported yet: bind.c:1524
Browse[1]> sum(is.na(gvals))
[1] 230456952

>peakmat <- matrix(0, nrow=118463468, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =230456952, ncol =ncol))
Error in rbind(peakmat, matrix(nrow = 230456952, ncol = ncol)) :
  long vectors not supported yet: bind.c:1524


Is there a way to split the grouped data in two sub-dataset (half samples in one and half in the other) and  then run the fillPeaks step in each dataset?. This shouldn't change the features names and it should be straightforward to merge them back together.

Best,
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on September 20, 2012, 01:23:04 PM
That is a lot of samples!!! ~200 days of run time.

You can split you xcmsset: see ?split.xcmsSet
But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?
Title: Re: memory error during fillpeak step
Post by: andgan on September 21, 2012, 12:26:34 AM
Yes indeed there are lot of samples. However I guess that in the future it will be more common to see this large metabolomics efforts.
I can align separate each study (there are two studies involved) and then find the common features with metaXCMS. The problem here is that metaXCMS needs the diffreport output and I don't need to do any association analysis (or I want it to do with other software). I just want to find the common features between the two separate aligned studies. Is there a way to do that with metaXCMS?
Title: Re: memory error during fillpeak step
Post by: Carsten on September 21, 2012, 04:47:18 AM
Quote from: "Jan Stanstrup"
You can split you xcmsset: see ?split.xcmsSet
But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?
I actually tried that yesterday but it turns out to be more tricky then I thought.
You need some adoption in findPeaks or a script which simply redo that step.
I will try again over the weekend and report back!

Carsten
Title: Re: memory error during fillpeak step
Post by: andgan on September 30, 2012, 03:41:49 PM
Hi all,
the problem with the split.xcmsSet function is that I lose both retention time correction and groups. Basically it is the same that aligning the two studies separately.

I directely aligned the two studies separately using the same XCMS parameters. I then need to find the common features in both studies.
There are two options to do that:

fixed approach where I specify the max deviation in masses and retention times between the two studies (I think this is similar to what metaXCMS does).
How does it work:

x seconds in study 2.  [/list]
[/list]

However, we know that the retention time correction is a function of the retention time and not uniform for all the retention times.

This is for example the median retention time correction for study 1. The dashed lines are 1.96*median absolute deviation.


[attachment=1:jzr26mk2]a1.png[/attachment:jzr26mk2]


This is the same graph for study 2.


[attachment=0:jzr26mk2]a2.png[/attachment:jzr26mk2]


The second possibility is:

dynamic approach  where I specify the max deviation in masses (e.g. +/- 0.02 Da) and a dynamic deviation in retention time.
A dynamic retention time means that when I try to find features with similar retention time, given that they have similar mass, I will not use a fixed range, but I allow the deviation in retention times to depend on the retention time.


Let's see an example:

I have this feature from study 1: M280.093T32.408

My goal is to determine if there is a similar feature in study 2.

This is the algorithm I used to compare features and find out if they are the same:


i of interest from study 1 with retention time t_i and mass m_i search for all the features in study 2 with mass m so that: |m_i-m| < 0.02. Obtain x(1)...x(n) features.

Example: M280.093T31.831 M280.089T144.406 M280.111T333.820 M280.108T454.073 M280.104T578.557 M280.103T890.490 M280.103T299.882

2.Determine the confidence intervals of  t_i.
To do that you need to get the median retention time correction (median(to)_s1) and the variability of the retention time correction (mad(to)_s1) for the retention time closest to t_i (Figure 1). These values can be obtained from XCMS (datasetname1@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of t_i is:
U_t_i=t_i+|median(to)_s1+1.96*mad(to)_s1|
and the lower confidence interval is:
L_t_i=t_i-|median(to)_s1-1.96*mad(to)_s1|.

Example: median(to)_s1+1.96*mad(to)_s1 of ~32.408=0.942 and median(to)_s1-1.96*mad(to)_s1 of ~32.408=-0.573 then U_t_i=33.350 and L_t_i=31.835


3.Start with x(1) feature and  determine the confidence intervals of t_x(1) (retention time of x(1)).
To do that you need to get the median retention time correction (median(to)_s2) and the variability of the retention time correction (mad(to)_s2) for the retention time closest to t_x(1) (Figure 2). These values can be obtained from XCMS (datasetname2@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of t_x(1) is:
U_t_x(1)=x(1)+|median(to)_s2+1.96*mad(to)_s2|
and the lower confidence interval is:
L_t_x(1)=x(1)-|median(to)_s2-1.96*mad(to)_s2|.

Example: median(to)_s2+1.96*mad(to)_s2 of ~31.831=1.912 and median(to)_s2-1.96*mad(to)_s2 of ~31.831=-1.061 then U_t_x(1)=33.743 and L_t_x(1)=30.770


4. Check if the confidence intervals of t_i and t_x(1) overlap.

Example: 31.835-33.350 ? 30.770-33.743 ? TRUE


5.Repeat steps 3 and 4 for x(2)...x(n).

Example: : TRUE FALSE FALSE FALSE FALSE FALSE FALSE[/list][/list]

Then I conclude that M280.093T32.408 from study 1 is the same as M280.093T31.831 from study 2.

What do you think? Is this approach making sense?

Best,

Andrea

[attachment deleted by admin]
Title: Re: memory error during fillpeak step
Post by: prasadpb on December 18, 2012, 06:28:33 AM
Dear XCMS users,
I am facing same problem with my just two data files of 30 min run from Q-exactive after converting them to cdf file (each file goes upto 1.3 GB )
Initially working with my data of 30 samples. I faced problem of memory allocation over 40000MB, So I decided to run with just two samples. They work fine until peakFill() and got following error.
How I can process my data using 64bit PC with 4GB RAM, as with just 2 samples only I am getting memory error.
I there any way out?

Please help.



> xset2<-fillPeaks(xset1)
NS-Coat-161212-STG-1 Error: cannot allocate vector of size 840.9 Mb
In addition: Warning messages:
1: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
2: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
3: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
4: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)







R Under development (unstable) (2012-12-01 r61188)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252   

attached base packages:
[1] parallel  stats    graphics  grDevices utils    datasets  methods  base   

other attached packages:
[1] multtest_2.14.0    Biobase_2.19.1    BiocGenerics_0.5.1 xcms_1.34.0        mzR_1.5.2          Rcpp_0.10.1     

loaded via a namespace (and not attached):
[1] codetools_0.2-8  MASS_7.3-22      splines_2.16.0  stats4_2.16.0    survival_2.36-14
>
Title: Re: memory error during fillpeak step
Post by: Ralf on December 18, 2012, 11:24:49 AM
Quote
I there any way out?

Yes there is. Simply purchase sufficient RAM for your workstation.
Title: Re: memory error during fillpeak step
Post by: prasadpb on December 20, 2012, 09:35:58 AM
Quote from: "Ralf"
Quote
I there any way out?

Yes there is. Simply purchase sufficient RAM for your workstation.
I mean can I generate just peak report (without 'mullest' comparison) for those two chromatograms without doing fillPeaks()?
Thanks
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on March 13, 2013, 08:16:52 AM
I think I found an unfortunate behavior that is causing fillpeaks to need an excessive amount of memory.

The problem seems to be here:
Code: [Select]
   argList <- apply(ft,1,function(x) list(file=x["file"],id=as.numeric(x["id"]),
                                          params=list(method="chrom",
                                            gvals=gvals,
                                            prof=prof,
                                            dataCorrection=object@dataCorrection,
                                            polarity=object@polarity,
                                            rtcor=object@rt$corrected[[as.numeric(x["id"])]],
                                            peakrange=peakrange)
                                          ))
gvals is repeated for each sample as far as I can see.

This is the object I used for testing (we have a bigger set where it runs out of memory):
Code: [Select]
An "xcmsSet" object with 456 samples

Time range: 0.6-361 seconds (0-6 minutes)
Mass range: 60.0429-999.9048 m/z
Peaks: 1050355 (about 2303 per sample)
Peak Groups: 4232
Sample classes: data_converted_pos

Profile settings: method = bin
                  step = 0.005

Memory usage: 266 MB

So gvals is 4232 x 456  and takes 7.6 MB. When it is repeated for each sample that eats 3.5GB alone in this case and would increase to the power of 2 when you increase the number of samples.
If I understand correctly gvals is the same in each index of the list. So can the structure of argList be changed to only have one copy?


edit: not to mention that the memory requirement is multiplied by the number of cores you want to use for gapfilling  :shock:
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on May 26, 2013, 12:07:15 PM
bump?


I also find that  fillpeaks seem now to "loose" the intb column. I couldn't find the cause though.
Title: Re: memory error during fillpeak step
Post by: gunnar on June 19, 2013, 01:55:32 AM
fillPeaks seems to loose all values except mz, mzmin, mzmax, rt, rtmin, rtmax, into, maxo and sample.

Is this somehow related to R3.0?

any solutions?

br
Gunnar
Title: Re: memory error during fillpeak step
Post by: sneumann on June 23, 2013, 01:26:59 PM
Hi,

Quote from: "gunnar"
fillPeaks seems to loose all values except mz, mzmin, mzmax, rt, rtmin, rtmax, into, maxo and sample.
I think that problem was fixed with

CHANGES IN VERSION 1.37.1
--------------------------
BUG FIXES
    o fixed fillPeaks, which 1) dropped non-standard columns
    and 2) failed if nothing to do, based on patches by Tony Larson.

Or is it still present there ?

Yours,
Steffen
Title: Re: memory error during fillpeak step
Post by: cbroeckl on January 23, 2014, 02:44:42 PM
I just wanted to refresh this thread with my own memory problems:
Windows 7 machine, 64 GB RAM
R 3.0.2
xcms 1.38.0

> xset
An "xcmsSet" object with 1610 samples

Time range: 1.7-1205.4 seconds (0-20.1 minutes)
Mass range: 55.0152-1199.8425 m/z
Peaks: 7274098 (about 4518 per sample)
Peak Groups: 6469

Memory usage: 1360 MB


xset <- fillPeaks.chrom(xset, nSlaves=2)
Error: cannot allocate vector of size 39.7 Mb

Windows task manager has the memory full upon failure, which gc() cleans up.
Title: Re: memory error during fillpeak step
Post by: sneumann on January 24, 2014, 02:54:43 AM
Hi,

how far do you get with nSlaves=1 ? The parallel fillPeaks currently
passes the xcmsSet down to all slaves, and the slaves operate on a subset
of them. The more clever way would be to pass trimmed xcmsSets to the slaves,
so the memory requirement is not mutliplied by the number of slaves.

Steffen
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on January 27, 2014, 11:18:49 AM
Quote from: "sneumann"
Hi,

how far do you get with nSlaves=1 ? The parallel fillPeaks currently
passes the xcmsSet down to all slaves, and the slaves operate on a subset
of them. The more clever way would be to pass trimmed xcmsSets to the slaves,
so the memory requirement is not mutliplied by the number of slaves.

Steffen
Though this is true the biggest problem is that the data is repeated for each sample...
Title: Re: memory error during fillpeak step
Post by: cbroeckl on January 30, 2014, 04:37:32 PM
nSlaves=1 still fails.  I don't know how far it gets, but the memory climbs pretty quickly until I receive the same error message as before.
Title: Re: memory error during fillpeak step
Post by: sneumann on February 05, 2014, 09:44:06 AM
Hi,

It would be great if someone could test xcms prior to 1.35.4, e.g. 1.34.0 from
http://bioconductor.org/packages/2.11/b ... /xcms.html (http://bioconductor.org/packages/2.11/bioc/html/xcms.html)

That would be a way to rule out that my changes for parallel fillPeaks()
cause the problem. If 1.34.0 is fine, I'd need to think about an optimisation
for the parallel case.

Yours,
Steffen
Title: Re: memory error during fillpeak step
Post by: cbroeckl on February 06, 2014, 02:45:42 PM
Steffen et all,

I just started testing the fillPeaks step using R v 2.15.2, xcms v1.34.0.  I haven't completed the process yet, but I am 99.9% certain that this version will complete the fillPeaks step on the dataset I began earlier.  I am ~10% through the data files and the memory usage by R has hardly changed over that time.
I will update this when I know for certain whether it succussfully finished.
Corey
Title: Re: memory error during fillpeak step
Post by: cbroeckl on February 06, 2014, 05:56:28 PM
It did succeed - seems like the newer versions of fillPeaks is rather memory inefficient.
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on March 27, 2014, 09:59:05 AM
I have a suggestion for a quick fix: using an environment to pass gvals. this way it is not repeated one time per sample. And the code change is only minimal.
In my test with 600 samples this used 140MB instead of 10.4GB on argList.

Code: [Select]
gvals_env <- new.env(parent=baseenv())
assign("gvals", gvals, envir = gvals_env)

argList <- apply(ft,1,function(x) {
  ## Add only those samples which actually have NA in them
  if (!any(is.na(gvals[,as.numeric(x["id"])]))) {
    ## nothing to do.
    list()
  } else {
    list(file=x["file"],id=as.numeric(x["id"]),
        params=list(method="chrom",
                    gvals=gvals_env,
                    prof=prof,
                    dataCorrection=object@dataCorrection,
                    polarity=object@polarity,
                    rtcor=object@rt$corrected[[as.numeric(x["id"])]],
                    peakrange=peakrange))
  }
})


fillPeaksChromPar would then need to do:
Code: [Select]
 gvals <- params$gvals$gvals
instead of
Code: [Select]
 gvals <- params$gvals
I don't know if there is a away around this. Is this used by any other functions? if not it seems like a fast fix though my knowledge of environments are very limited so I don't know if this has any unforeseen consequences.
Title: Re: memory error during fillpeak step
Post by: sneumann on March 28, 2014, 01:19:44 AM
Thank Jan, this sounds impressive!

We now have also created the github-bioc bridge
at https://github.com/sneumann/xcms (https://github.com/sneumann/xcms)
could you send this patch as a pull request ?

Thanks in advance,
yours,
Steffen
Title: Re: memory error during fillpeak step
Post by: Jan Stanstrup on March 28, 2014, 04:11:20 AM
Done. I also managed to compile xcms and test the function directly. It appears to work as intended :)