Dear all,
I'm aligning a large number of samples (~9,000). Everything works fine until I start the filling peak part.
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xcms_1.30.3
......
> xset3
An "xcmsSet" object with 8954 samples
Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 101230521 (about 11306 per sample)
Peak Groups: 18501
Sample classes: inj2, inj3
Profile settings: method = bin
step = 0.1
Memory usage: 9040 MB
>xset4 <- fillPeaks.chrom(xset3)
Error in rbind(deparse.level, ...) :
resulting vector exceeds vector length limit in 'AnswerType'
> traceback()
4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))
3: .local(object, ...)
2: fillPeaks.chrom(xset3)
1: fillPeaks.chrom(xset3)
Note that
fillPeaks.chrom worked fine when I lowered the number of peak groups (by lowering
bw or
mzwidth):
> xset3
An "xcmsSet" object with 8954 samples
Time range: 2.1-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 89205782 (about 9963 per sample)
Peak Groups: 13674
Sample classes: inj2, inj3
Profile settings: method = bin
step = 0.1
Memory usage: 8000 MB
> xset4 <- fillPeaks.chrom(xset3)
021228_TwinGene_000101 021228_TwinGene_000102 etc....
I have plenty of RAM available, since I'm using a node with 72GB of RAM and I even tried a node with 512GB of RAM. This process should use max 30GB of RAM.
I wonder if there is any memory limitation in the
fillPeaks.chrom function.
Thanks in advance!
Can you run
options(error=recover)
xset4 <- fillPeaks.chrom(xset3)
the hit 3 (enter the local environment of the fillpeaks function)
and enter
sum(is.na(gvals))
?
I would like to see if sum(is.na(gvals)) is greater than .Machine$integer.max (2147483647)
Ralf
Hi Ralf,
> xset4 <- fillPeaks.chrom(xset3)
Error in rbind(deparse.level, ...) :
resulting vector exceeds vector length limit in 'AnswerType'
Enter a frame number, or 0 to exit
1: fillPeaks.chrom(xset3)
2: fillPeaks.chrom(xset3)
3: .local(object, ...)
4: rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat)))
Selection: 3
Called from: .local(object, ...)
Browse[1]> sum(is.na(gvals))
[1] 133210130
It looks to be lower/ Other suggestions?
Andrea
It looks like the resulting matrix is simply too big for R (at the moment).
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
Error in rbind(deparse.level, ...) :
resulting vector exceeds vector length limit in 'AnswerType'
http://svn.r-project.org/R/trunk/src/main/bind.c (http://svn.r-project.org/R/trunk/src/main/bind.c)
#ifndef LONG_VECTOR_SUPPORT
if (data->ans_length < 0)
errorcall(call, _("resulting vector exceeds vector length limit in '%s'"), "AnswerType");
#endif
Looks like the R team is working on it and long vector support is coming in 2.16
http://permalink.gmane.org/gmane.comp.l ... table/1125 (http://permalink.gmane.org/gmane.comp.lang.r.datatable/1125)
http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html (http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html)
Ralf
Thanks great!
the example below works for me with today's Rdev
R Under development (unstable) (2012-08-14 r60264) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
> ncol=12
> peakmat <- matrix(0, nrow=101230521, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =133210130, ncol =ncol))
> print(object.size(peakmat2),unit="Gb")
21 Gb
Dear Ralf,
Now I have more samples and even with the new R release, the matrix seems to be too large:
R Under development (unstable) (2012-09-19 r60762) -- "Unsuffered Consequences"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
> xset3
An "xcmsSet" object with 12634 samples
Time range: 2.2-1319.6 seconds (0-22 minutes)
Mass range: 49.9841-1200.0173 m/z
Peaks: 118463468 (about 9377 per sample)
Peak Groups: 21258
Profile settings: method = bin
step = 0.1
Memory usage: 10700 MB
>options(error=recover)
>xset4 <- fillPeaks.chrom(xset3)
Error in rbind(peakmat, matrix(nrow = sum(is.na(gvals)), ncol = ncol(peakmat))) :
long vectors not supported yet: bind.c:1524
Browse[1]> sum(is.na(gvals))
[1] 230456952
>peakmat <- matrix(0, nrow=118463468, ncol=ncol)
> peakmat2 <- rbind(peakmat, matrix(nrow =230456952, ncol =ncol))
Error in rbind(peakmat, matrix(nrow = 230456952, ncol = ncol)) :
long vectors not supported yet: bind.c:1524
Is there a way to split the grouped data in two sub-dataset (half samples in one and half in the other) and then run the fillPeaks step in each dataset?. This shouldn't change the features names and it should be straightforward to merge them back together.
Best,
That is a lot of samples!!! ~200 days of run time.
You can split you xcmsset: see ?split.xcmsSet
But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?
Yes indeed there are lot of samples. However I guess that in the future it will be more common to see this large metabolomics efforts.
I can align separate each study (there are two studies involved) and then find the common features with metaXCMS. The problem here is that metaXCMS needs the diffreport output and I don't need to do any association analysis (or I want it to do with other software). I just want to find the common features between the two separate aligned studies. Is there a way to do that with metaXCMS?
I actually tried that yesterday but it turns out to be more tricky then I thought.
You need some adoption in findPeaks or a script which simply redo that step.
I will try again over the weekend and report back!
Carsten
Hi all,
the problem with the
split.xcmsSet function is that I lose both retention time correction and groups. Basically it is the same that aligning the two studies separately.
I directely aligned the two studies separately using the same XCMS parameters.
I then need to find the common features in both studies.There are two options to do that:
fixed approach where I specify the max deviation in masses and retention times between the two studies (I think this is similar to what metaXCMS does).
How does it work:
a. Find all the masses in study 1 that are equal +/- 0.02 Dalton in study 2.
b. For each common mass find the retention times that are equal +/-
x seconds in study 2. [/list]
[/list]
However, we know that the retention time correction is a function of the retention time and not uniform for all the retention times.
This is for example the median retention time correction for study 1. The dashed lines are 1.96*median absolute deviation.
[attachment=1:jzr26mk2]a1.png[/attachment:jzr26mk2]
This is the same graph for study 2.
[attachment=0:jzr26mk2]a2.png[/attachment:jzr26mk2]
The second possibility is:
dynamic approach where I specify the max deviation in masses (e.g. +/- 0.02 Da) and a
dynamic deviation in retention time.
A
dynamic retention time means that when I try to find features with similar retention time, given that they have similar mass, I will not use a fixed range, but I allow the deviation in retention times to depend on the retention time.
Let's see an example:
I have this feature from study 1: M280.093T32.408
My goal is to determine if there is a similar feature in study 2.
This is the algorithm I used to compare features and find out if they are the same:
i of interest from study 1 with retention time
t_i and mass
m_i search for all the features in study 2 with mass
m so that:
|m_i-m| < 0.02. Obtain
x(1)...x(n) features.
Example: M280.093T31.831 M280.089T144.406 M280.111T333.820 M280.108T454.073 M280.104T578.557 M280.103T890.490 M280.103T299.882
2.Determine the confidence intervals of
t_i.
To do that you need to get the median retention time correction (
median(to)_s1) and the variability of the retention time correction (
mad(to)_s1) for the retention time closest to
t_i (Figure 1). These values can be obtained from XCMS (datasetname1@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of
t_i is:
U_t_i=t_i+|median(to)_s1+1.96*mad(to)_s1| and the lower confidence interval is:
L_t_i=t_i-|median(to)_s1-1.96*mad(to)_s1|.
Example: median(to)_s1+1.96*mad(to)_s1 of ~32.408=0.942 and
median(to)_s1-1.96*mad(to)_s1 of ~32.408=-0.573 then
U_t_i=33.350 and
L_t_i=31.835
3.Start with
x(1) feature and determine the confidence intervals of
t_x(1) (retention time of
x(1)).
To do that you need to get the median retention time correction (
median(to)_s2) and the variability of the retention time correction (
mad(to)_s2) for the retention time closest to
t_x(1) (Figure 2). These values can be obtained from XCMS (datasetname2@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of
t_x(1) is:
U_t_x(1)=x(1)+|median(to)_s2+1.96*mad(to)_s2| and the lower confidence interval is:
L_t_x(1)=x(1)-|median(to)_s2-1.96*mad(to)_s2|.
Example: median(to)_s2+1.96*mad(to)_s2 of ~31.831=1.912 and
median(to)_s2-1.96*mad(to)_s2 of ~31.831=-1.061 then
U_t_x(1)=33.743 and
L_t_x(1)=30.770
4. Check if the confidence intervals of
t_i and
t_x(1) overlap.
Example: 31.835-33.350 ? 30.770-33.743 ? TRUE
5.Repeat steps 3 and 4 for x(2)...x(n).
Example: : TRUE FALSE FALSE FALSE FALSE FALSE FALSE[/list][/list]
Then I conclude that M280.093T32.408 from study 1 is the same as M280.093T31.831 from study 2.
What do you think? Is this approach making sense?
Best,
Andrea
[attachment deleted by admin]
Dear XCMS users,
I am facing same problem with my just two data files of 30 min run from Q-exactive after converting them to cdf file (each file goes upto 1.3 GB )
Initially working with my data of 30 samples. I faced problem of memory allocation over 40000MB, So I decided to run with just two samples. They work fine until peakFill() and got following error.
How I can process my data using 64bit PC with 4GB RAM, as with just 2 samples only I am getting memory error.
I there any way out?
Please help.
> xset2<-fillPeaks(xset1)
NS-Coat-161212-STG-1 Error: cannot allocate vector of size 840.9 Mb
In addition: Warning messages:
1: In mzR:::netCDFRawData(cdf) :
Reached total allocation of 4004Mb: see help(memory.size)
2: In mzR:::netCDFRawData(cdf) :
Reached total allocation of 4004Mb: see help(memory.size)
3: In mzR:::netCDFRawData(cdf) :
Reached total allocation of 4004Mb: see help(memory.size)
4: In mzR:::netCDFRawData(cdf) :
Reached total allocation of 4004Mb: see help(memory.size)
R Under development (unstable) (2012-12-01 r61188)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] multtest_2.14.0 Biobase_2.19.1 BiocGenerics_0.5.1 xcms_1.34.0 mzR_1.5.2 Rcpp_0.10.1
loaded via a namespace (and not attached):
[1] codetools_0.2-8 MASS_7.3-22 splines_2.16.0 stats4_2.16.0 survival_2.36-14
>
Yes there is. Simply purchase sufficient RAM for your workstation.
I mean can I generate just peak report (without 'mullest' comparison) for those two chromatograms without doing fillPeaks()?
Thanks
I think I found an unfortunate behavior that is causing fillpeaks to need an excessive amount of memory.
The problem seems to be here:
argList <- apply(ft,1,function(x) list(file=x["file"],id=as.numeric(x["id"]),
params=list(method="chrom",
gvals=gvals,
prof=prof,
dataCorrection=object@dataCorrection,
polarity=object@polarity,
rtcor=object@rt$corrected[[as.numeric(x["id"])]],
peakrange=peakrange)
))
gvals is repeated for each sample as far as I can see.
This is the object I used for testing (we have a bigger set where it runs out of memory):
An "xcmsSet" object with 456 samples
Time range: 0.6-361 seconds (0-6 minutes)
Mass range: 60.0429-999.9048 m/z
Peaks: 1050355 (about 2303 per sample)
Peak Groups: 4232
Sample classes: data_converted_pos
Profile settings: method = bin
step = 0.005
Memory usage: 266 MB
So gvals is 4232 x 456 and takes 7.6 MB. When it is repeated for each sample that eats 3.5GB alone in this case and would increase to the power of 2 when you increase the number of samples.
If I understand correctly gvals is the same in each index of the list. So can the structure of argList be changed to only have one copy?
edit: not to mention that the memory requirement is multiplied by the number of cores you want to use for gapfilling :shock:
bump?
I also find that fillpeaks seem now to "loose" the intb column. I couldn't find the cause though.
fillPeaks seems to loose all values except mz, mzmin, mzmax, rt, rtmin, rtmax, into, maxo and sample.
Is this somehow related to R3.0?
any solutions?
br
Gunnar
Hi,
I think that problem was fixed with
CHANGES IN VERSION 1.37.1
--------------------------
BUG FIXES
o fixed fillPeaks, which 1) dropped non-standard columns
and 2) failed if nothing to do, based on patches by Tony Larson.
Or is it still present there ?
Yours,
Steffen
I just wanted to refresh this thread with my own memory problems:
Windows 7 machine, 64 GB RAM
R 3.0.2
xcms 1.38.0
> xset
An "xcmsSet" object with 1610 samples
Time range: 1.7-1205.4 seconds (0-20.1 minutes)
Mass range: 55.0152-1199.8425 m/z
Peaks: 7274098 (about 4518 per sample)
Peak Groups: 6469
Memory usage: 1360 MB
xset <- fillPeaks.chrom(xset, nSlaves=2)
Error: cannot allocate vector of size 39.7 Mb
Windows task manager has the memory full upon failure, which gc() cleans up.
Hi,
how far do you get with nSlaves=1 ? The parallel fillPeaks currently
passes the xcmsSet down to all slaves, and the slaves operate on a subset
of them. The more clever way would be to pass trimmed xcmsSets to the slaves,
so the memory requirement is not mutliplied by the number of slaves.
Steffen
Though this is true the biggest problem is that the data is repeated for each sample...
nSlaves=1 still fails. I don't know how far it gets, but the memory climbs pretty quickly until I receive the same error message as before.
Hi,
It would be great if someone could test xcms prior to 1.35.4, e.g. 1.34.0 from
http://bioconductor.org/packages/2.11/b ... /xcms.html (http://bioconductor.org/packages/2.11/bioc/html/xcms.html)
That would be a way to rule out that my changes for parallel fillPeaks()
cause the problem. If 1.34.0 is fine, I'd need to think about an optimisation
for the parallel case.
Yours,
Steffen
Steffen et all,
I just started testing the fillPeaks step using R v 2.15.2, xcms v1.34.0. I haven't completed the process yet, but I am 99.9% certain that this version will complete the fillPeaks step on the dataset I began earlier. I am ~10% through the data files and the memory usage by R has hardly changed over that time.
I will update this when I know for certain whether it succussfully finished.
Corey
It did succeed - seems like the newer versions of fillPeaks is rather memory inefficient.
I have a suggestion for a quick fix: using an environment to pass gvals. this way it is not repeated one time per sample. And the code change is only minimal.
In my test with 600 samples this used 140MB instead of 10.4
GB on argList.
gvals_env <- new.env(parent=baseenv())
assign("gvals", gvals, envir = gvals_env)
argList <- apply(ft,1,function(x) {
## Add only those samples which actually have NA in them
if (!any(is.na(gvals[,as.numeric(x["id"])]))) {
## nothing to do.
list()
} else {
list(file=x["file"],id=as.numeric(x["id"]),
params=list(method="chrom",
gvals=gvals_env,
prof=prof,
dataCorrection=object@dataCorrection,
polarity=object@polarity,
rtcor=object@rt$corrected[[as.numeric(x["id"])]],
peakrange=peakrange))
}
})
fillPeaksChromPar would then need to do:
gvals <- params$gvals$gvals
instead of
gvals <- params$gvals
I don't know if there is a away around this. Is this used by any other functions? if not it seems like a fast fix though my knowledge of environments are very limited so I don't know if this has any unforeseen consequences.
Thank Jan, this sounds impressive!
We now have also created the github-bioc bridge
at https://github.com/sneumann/xcms (https://github.com/sneumann/xcms)
could you send this patch as a pull request ?
Thanks in advance,
yours,
Steffen
Done. I also managed to compile xcms and test the function directly. It appears to work as intended :)