Skip to main content
Topic: memory error during fillpeak step (Read 10476 times) previous topic - next topic

memory error during fillpeak step

Dear all,

I'm aligning a large number of samples (~9,000). Everything works fine until I start the filling peak part.

Code: [Select]

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8      LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 [7] LC_PAPER=C                LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     

attached base packages:
[1] stats    graphics  grDevices utils    datasets  methods  base   

other attached packages:
[1] xcms_1.30.3

......

> xset3
An "xcmsSet" object with 8954 samples

Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 101230521 (about 11306 per sample)
Peak Groups: 18501
Sample classes: inj2, inj3

Profile settings: method = bin
                  step = 0.1

Memory usage: 9040 MB

>xset4 <- fillPeaks.chrom(xset3)

Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

> traceback()

4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))
3: .local(object, ...)
2: fillPeaks.chrom(xset3)
1: fillPeaks.chrom(xset3)


Note that fillPeaks.chrom worked fine when I lowered the number of peak groups (by lowering bw or mzwidth):

Code: [Select]

> xset3
An "xcmsSet" object with 8954 samples

Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 89205782 (about 9963 per sample)
Peak Groups: 13674
Sample classes: inj2, inj3

Profile settings: method = bin
                  step = 0.1

Memory usage: 8000 MB

> xset4 <- fillPeaks.chrom(xset3)
021228_TwinGene_000101 021228_TwinGene_000102 etc....


I have plenty of RAM available, since I'm using a node with 72GB of RAM and I even tried a node with 512GB of RAM. This process should use max 30GB of RAM.

I wonder if there is any memory limitation in the fillPeaks.chrom function.

Thanks in advance!

Re: memory error during fillpeak step

Reply #1
Can you run
Code: [Select]
options(error=recover)
xset4 <- fillPeaks.chrom(xset3)
the hit 3  (enter the local environment of the fillpeaks function)
and enter
Code: [Select]
sum(is.na(gvals))
?

I would like to see if sum(is.na(gvals)) is greater than .Machine$integer.max (2147483647)

Ralf

Re: memory error during fillpeak step

Reply #2
Hi Ralf,

Code: [Select]

> xset4 <- fillPeaks.chrom(xset3)
Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

Enter a frame number, or 0 to exit 

1: fillPeaks.chrom(xset3)
2: fillPeaks.chrom(xset3)
3: .local(object, ...)
4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))

Selection: 3
Called from: .local(object, ...)
Browse[1]> sum(is.na(gvals))
[1] 133210130


It looks to be lower/ Other suggestions?

Andrea

Re: memory error during fillpeak step

Reply #3
It looks like the resulting matrix is simply too big for R (at the moment).

Code: [Select]
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
Error in rbind(deparse.level, ...) :
  resulting vector exceeds vector length limit in 'AnswerType'

http://svn.r-project.org/R/trunk/src/main/bind.c
Code: [Select]
#ifndef LONG_VECTOR_SUPPORT
    if (data->ans_length < 0)
errorcall(call, _("resulting vector exceeds vector length limit in '%s'"), "AnswerType");
#endif

Looks like the R team is working on it and long vector support is coming in 2.16
http://permalink.gmane.org/gmane.comp.l ... table/1125
http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html

Ralf

Re: memory error during fillpeak step

Reply #4
Thanks great!

Re: memory error during fillpeak step

Reply #5
the example below works for me with today's Rdev
Code: [Select]
R Under development (unstable) (2012-08-14 r60264) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
> print(object.size(peakmat2),unit="Gb")
21 Gb

Re: memory error during fillpeak step

Reply #6
Dear Ralf,

Now I have more samples and even with the new R release, the matrix seems to be too large:

Code: [Select]
R Under development (unstable) (2012-09-19 r60762) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

> xset3
An "xcmsSet" object with 12634 samples

Time range: 2.2-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 118463468 (about 9377 per sample)
Peak Groups: 21258

Profile settings: method = bin
                  step = 0.1

Memory usage: 10700 MB

>options(error=recover)
>xset4 <- fillPeaks.chrom(xset3)
Error in rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat))) :
  long vectors not supported yet: bind.c:1524
Browse[1]> sum(is.na(gvals))
[1] 230456952

>peakmat <- matrix(0, nrow=118463468, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =230456952, ncol =ncol))
Error in rbind(peakmat, matrix(nrow = 230456952, ncol = ncol)) :
  long vectors not supported yet: bind.c:1524


Is there a way to split the grouped data in two sub-dataset (half samples in one and half in the other) and  then run the fillPeaks step in each dataset?. This shouldn't change the features names and it should be straightforward to merge them back together.

Best,

Re: memory error during fillpeak step

Reply #7
That is a lot of samples!!! ~200 days of run time.

You can split you xcmsset: see ?split.xcmsSet
But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?
Blog: stanstrup.github.io

Re: memory error during fillpeak step

Reply #8
Yes indeed there are lot of samples. However I guess that in the future it will be more common to see this large metabolomics efforts.
I can align separate each study (there are two studies involved) and then find the common features with metaXCMS. The problem here is that metaXCMS needs the diffreport output and I don't need to do any association analysis (or I want it to do with other software). I just want to find the common features between the two separate aligned studies. Is there a way to do that with metaXCMS?

Re: memory error during fillpeak step

Reply #9
[quote author="Jan Stanstrup"]
You can split you xcmsset: see ?split.xcmsSet
But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?[/quote]
I actually tried that yesterday but it turns out to be more tricky then I thought.
You need some adoption in findPeaks or a script which simply redo that step.
I will try again over the weekend and report back!

Carsten

Re: memory error during fillpeak step

Reply #10
Hi all,
the problem with the split.xcmsSet function is that I lose both retention time correction and groups. Basically it is the same that aligning the two studies separately.

I directely aligned the two studies separately using the same XCMS parameters. I then need to find the common features in both studies.
There are two options to do that:

    1. To use a
fixed approach where I specify the max deviation in masses and retention times between the two studies (I think this is similar to what metaXCMS does).
How does it work:

    a. Find all the masses in study 1 that are equal +/- 0.02 Dalton in study 2.
    b. For each common mass find the retention times that are equal +/-
x seconds in study 2.  [/list]
[/list]

However, we know that the retention time correction is a function of the retention time and not uniform for all the retention times.

This is for example the median retention time correction for study 1. The dashed lines are 1.96*median absolute deviation.


[attachment=1:jzr26mk2]a1.png[/attachment:jzr26mk2]


This is the same graph for study 2.


[attachment=0:jzr26mk2]a2.png[/attachment:jzr26mk2]


The second possibility is:

    2. To use a
dynamic approach  where I specify the max deviation in masses (e.g. +/- 0.02 Da) and a dynamic deviation in retention time.
A dynamic retention time means that when I try to find features with similar retention time, given that they have similar mass, I will not use a fixed range, but I allow the deviation in retention times to depend on the retention time.


Let's see an example:

I have this feature from study 1: M280.093T32.408

My goal is to determine if there is a similar feature in study 2.

This is the algorithm I used to compare features and find out if they are the same:


    1.Given a feature
i of interest from study 1 with retention time t_i and mass m_i search for all the features in study 2 with mass m so that: |m_i-m| < 0.02. Obtain x(1)...x(n) features.

Example: M280.093T31.831 M280.089T144.406 M280.111T333.820 M280.108T454.073 M280.104T578.557 M280.103T890.490 M280.103T299.882

2.Determine the confidence intervals of  t_i.
To do that you need to get the median retention time correction (median(to)_s1) and the variability of the retention time correction (mad(to)_s1) for the retention time closest to t_i (Figure 1). These values can be obtained from XCMS (datasetname1@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of t_i is:
U_t_i=t_i+|median(to)_s1+1.96*mad(to)_s1|
and the lower confidence interval is:
L_t_i=t_i-|median(to)_s1-1.96*mad(to)_s1|.

Example: median(to)_s1+1.96*mad(to)_s1 of ~32.408=0.942 and median(to)_s1-1.96*mad(to)_s1 of ~32.408=-0.573 then U_t_i=33.350 and L_t_i=31.835


3.Start with x(1) feature and  determine the confidence intervals of t_x(1) (retention time of x(1)).
To do that you need to get the median retention time correction (median(to)_s2) and the variability of the retention time correction (mad(to)_s2) for the retention time closest to t_x(1) (Figure 2). These values can be obtained from XCMS (datasetname2@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of t_x(1) is:
U_t_x(1)=x(1)+|median(to)_s2+1.96*mad(to)_s2|
and the lower confidence interval is:
L_t_x(1)=x(1)-|median(to)_s2-1.96*mad(to)_s2|.

Example: median(to)_s2+1.96*mad(to)_s2 of ~31.831=1.912 and median(to)_s2-1.96*mad(to)_s2 of ~31.831=-1.061 then U_t_x(1)=33.743 and L_t_x(1)=30.770


4. Check if the confidence intervals of t_i and t_x(1) overlap.

Example: 31.835-33.350 ? 30.770-33.743 ? TRUE


5.Repeat steps 3 and 4 for x(2)...x(n).

Example: : TRUE FALSE FALSE FALSE FALSE FALSE FALSE[/list][/list]

Then I conclude that M280.093T32.408 from study 1 is the same as M280.093T31.831 from study 2.

What do you think? Is this approach making sense?

Best,

Andrea

Re: memory error during fillpeak step

Reply #11
Dear XCMS users,
I am facing same problem with my just two data files of 30 min run from Q-exactive after converting them to cdf file (each file goes upto 1.3 GB )
Initially working with my data of 30 samples. I faced problem of memory allocation over 40000MB, So I decided to run with just two samples. They work fine until peakFill() and got following error.
How I can process my data using 64bit PC with 4GB RAM, as with just 2 samples only I am getting memory error.
I there any way out?

Please help.



> xset2<-fillPeaks(xset1)
NS-Coat-161212-STG-1 Error: cannot allocate vector of size 840.9 Mb
In addition: Warning messages:
1: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
2: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
3: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)
4: In mzR:::netCDFRawData(cdf) :
  Reached total allocation of 4004Mb: see help(memory.size)







R Under development (unstable) (2012-12-01 r61188)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252   

attached base packages:
[1] parallel  stats    graphics  grDevices utils    datasets  methods  base   

other attached packages:
[1] multtest_2.14.0    Biobase_2.19.1    BiocGenerics_0.5.1 xcms_1.34.0        mzR_1.5.2          Rcpp_0.10.1     

loaded via a namespace (and not attached):
[1] codetools_0.2-8  MASS_7.3-22      splines_2.16.0  stats4_2.16.0    survival_2.36-14
>

Re: memory error during fillpeak step

Reply #12
Quote
I there any way out?

Yes there is. Simply purchase sufficient RAM for your workstation.

Re: memory error during fillpeak step

Reply #13
[quote author="Ralf"]
Quote
I there any way out?

Yes there is. Simply purchase sufficient RAM for your workstation.[/quote]
I mean can I generate just peak report (without 'mullest' comparison) for those two chromatograms without doing fillPeaks()?
Thanks

Re: memory error during fillpeak step

Reply #14
I think I found an unfortunate behavior that is causing fillpeaks to need an excessive amount of memory.

The problem seems to be here:
Code: [Select]
   argList <- apply(ft,1,function(x) list(file=x["file"],id=as.numeric(x["id"]),
                                          params=list(method="chrom",
                                            gvals=gvals,
                                            prof=prof,
                                            dataCorrection=object@dataCorrection,
                                            polarity=object@polarity,
                                            rtcor=object@rt$corrected[[as.numeric(x["id"])]],
                                            peakrange=peakrange)
                                          ))
gvals is repeated for each sample as far as I can see.

This is the object I used for testing (we have a bigger set where it runs out of memory):
Code: [Select]
An "xcmsSet" object with 456 samples

Time range: 0.6-361 seconds (0-6 minutes)
Mass range: 60.0429-999.9048 m/z
Peaks: 1050355 (about 2303 per sample)
Peak Groups: 4232
Sample classes: data_converted_pos

Profile settings: method = bin
                  step = 0.005

Memory usage: 266 MB

So gvals is 4232 x 456  and takes 7.6 MB. When it is repeated for each sample that eats 3.5GB alone in this case and would increase to the power of 2 when you increase the number of samples.
If I understand correctly gvals is the same in each index of the list. So can the structure of argList be changed to only have one copy?


edit: not to mention that the memory requirement is multiplied by the number of cores you want to use for gapfilling  :shock:
Blog: stanstrup.github.io