Show Posts - hpbenton

This section allows you to view all Topics made by this member. Note that you can only see Topics made in areas you currently have access to.

Topics - hpbenton

XCMS - Cookbook / Heatmap of data

February 02, 2013, 12:02:03 PM

You want to look at a heatmap of your final data, maybe have a dendrogram as well.

Code: [Select]

library(xcms)
cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE,full=T)
xset <- xcmsSet(cdffiles)
xsg <- group(xset)
xsg <- retcor(xsg)
xsg <- group(xsg,bw=10)
xsg <- fillPeaks(xsg)

Load up the default toy dataset. Extract the intensities for each sample for each feature from the xcmsSet object. We'll take the 'into' group here but if you have used centWave for peak detection then you can also use 'intb' which is the background subtracted intensity.

Code: [Select]

library(gplots)
dat<-groupval(xsg, "medret", "into")
heatmap.2(dat, col=redgreen, scale="row", trace="none")

This code will produce a heatmap of all of your data matrix. The scaling will scale each row (feature) of your data into a Z-score. For more on z-scores look at
http://http://en.wikipedia.org/wiki/Standard_score

[attachment deleted by admin]

XCMS - FAQ / How do I pick peaks on a selected RT range?

September 21, 2012, 10:08:36 AM

So you have a load of data and you're only really interested in picking peaks on a small range of data. There are two ways of doing this, the new way, which is an upgrade since mzR was incorporated into xcms and the older way. We'll look at both.

1) The new way. Here we'll use the scanrange option to look at only a certain range of the spectra. To do so please make sure that your xcms version is updated to the latest code. (Tested with xcms v 1.35.2)
First, we need to convert the scan time into a scans. Each scan has a certain time. You know the time that you want but the instrument knows that the scan was 1 or 100. The actual time will be slight different after this conversion for the different files.

Code: [Select]

library(xcms)
library(faahKO)
cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE,full=T)
## lets convert into time into scans
RTrange<-c(2600,3000)
xr<-xcmsRaw(cdffiles[5]) ## loads up a single file into an xcmsRaw object
which(xr@scantime > 2600 & xr@scantime < 2605)[1]
## [1] 65
> which(xr@scantime > 3499 & xr@scantime < 3500)
## [1] 639
xr@scantime[c(65,639)] This will be our new scan range

xset <- xcmsSet(cdffiles, scanrange=c(65,636))

2)The old way. In this method all of the LC-MS spectrum is used for peak detection.
Afterwards the xcmsSet object is slimmed down to only the required range. This method can be quite useful if you want to have both the slimmed down object and the full RT range object in your R user space at the same time.

Code: [Select]

library(xcms)
library(faahKO)
cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE,full=T)
xset <- xcmsSet(cdffiles)
## Do peak detection on everything

## Now edit the object to the required range
RTrange<c(2600, 3500)
ix.rt<-which(xset@peaks[,"rt"] > RTrange[1] & xset@peaks[,"rt"] < RTrange[2])
xsetRT<-xset## copy the object into a new object to save the old one :)
xsetRT@peaks<-xset@peaks[ix.rt,] ## selects only the peaks that we want in the range and copy it into  the new object
xset
xsetRT

The results should look something like this. Notice the RT range is now changed:
xset :
An "xcmsSet" object with 12 samples

Time range: 2506.1-4147.7 seconds (41.8-69.1 minutes)
Mass range: 200.1-599.3338 m/z
Peaks: 4721 (about 393 per sample)
Peak Groups: 0
Sample classes: KO, WT

Profile settings: method = bin
step = 0.1

Memory usage: 0.713 MB

xsetRT:
An "xcmsSet" object with 12 samples

Time range: 2604.7-3499.8 seconds (43.4-58.3 minutes)
Mass range: 200.1-597.3132 m/z
Peaks: 2722 (about 227 per sample)
Peak Groups: 0
Sample classes: KO, WT

Profile settings: method = bin
step = 0.1

Memory usage: 0.515 MB

XCMS - Cookbook / Normalise with median fold change

June 01, 2012, 05:55:55 PM

Recently there was a paper published with suggested that median fold change was a suitable normalization method for urine and other samples that are effected by dilutions.

Code: [Select]

normalize.medFC <- function(mat) {
	# Perform median fold change normalisation
	#           X - data set [Variables & Samples]
	medSam <- apply(mat, 1, median)
	medSam[which(medSam==0)] <- 0.0001
	mat <- apply(mat, 2, function(mat, medSam){
		medFDiSmpl <- mat/medSam
		vec<-mat/median(medFDiSmpl)
		return(vec)
	}, medSam)
	return (mat)
}

On the faahKO dataset it would work something like the following:

Code: [Select]

cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE,full=T)
faahko <- xcmsSet(cdffiles)
faahko <- group(faahko)
faahko <- retcor(faahko)
faahko <- fillPeaks(faahko)

report <- cbind(groups(faahko), normalize.medFC(groupval(faahko, "medret", "into")) )
head(report)
write.csv(report, file="reportFile.csv")

(1) Veselkov, K. A.; Vingara, L. K.; Masson, P.; Robinette, S. L.; Want, E.; Li, J. V.; Barton, R. H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T. M. D.; Pelczer, I.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2011, 83, 5864–5872.

XCMS - Cookbook / Changing or adding minFrac after processing

April 30, 2012, 08:25:49 PM

So you're finished with processing but you want to change minFrac. There must be an automated way to do this other than hand picking in excel!
You used the group.nearest method and would really like to get rid of some of the non reproducible groups. How can I do this before printing the diffreport.

Using the following function below and assuming that you still have an xcmsSet object saved or in the R console its easy. Example code follows using the faahko dataset.

Code: [Select]

post.minFrac<-function(object, minFrac=0.5){
	ix.minFrac<-sapply(1:length(unique(sampclass(object))), function(x, object, mf){
		meta<-groups(object)
		minFrac.idx<-numeric(length=nrow(meta))
		idx<-which(meta[,levels(sampclass(object))[x]] >= mf*length(which(levels(sampclass(object))[x] == sampclass(object)) ))
		minFrac.idx[idx]<-1
		return(minFrac.idx)
	}, object, minFrac)
	ix.minFrac<-as.logical(apply(ix.minFrac, 1, sum))
	ix<-which(ix.minFrac == TRUE)
	## return(ix)
        object@groupidx<-object@groupidx[ix]
        object@groups<-object@groups[ix,]
        return(object)
}

The faahKO toy dataset:

Code: [Select]

library(faahKO) ## These files do not have this problem to correct for but just for an 
cdfpath <- system.file("cdf", package = "faahKO")
cdffiles <- list.files(cdfpath, recursive = TRUE, full.names = TRUE)

xset<-xcmsSet(cdffiles)
gxset<-group(xset, method="nearest")
nrow(gxset@groups) == 1096 ## the number of features before minFrac

gxset<-post.minFrac(gxset)
nrow(gxset@groups) == 465 ## and after minFrac !

XCMS - FAQ / Time for scan x greater than scan y - ProteoWizard

November 11, 2011, 06:42:52 AM

You converted a file using proteoWizard to an mzXML, you load it into xcms and you get something like the following error :

Quote

Time for scan X greater than scan Y

This is the default converting action of protoWizard. You will need to load the the not graphical user interface (GUI) version of the converter, msconvert.exe. Using a line somthing like the following in the dos prompt:

Code: [Select]

msconvert D:DataMyFile.RAW --filter "sortByScanTime" --mzXML -o D:DataOutput

Currently, version 2.2, the "sortByScanTime" option is only avaiable in the command line version of the software and has not been ported over to the GUI yet.

Hint: If you're using MassLynx you can load up the run list and have it automatically run the converter on each file:

1) Open MassLynx and above the Sample Table select Samples ? Format ? Customise. Enable the following Fields to display
Process
Process Options

2) Choose Samples ? Format ? Save and save the new Sample List format.

3) Make sure that MassWolf/msconvert is in the MassLynx directory. Choose massWolf/msconvert from the Process drop down list. As massWolf/msconvert is a command-line tool it requires some parameters to make it function. In the below example the following line has been added to the 'Process Options' column;

massWolf add: --mzXML C:MassLynxDefault.proData1234.raw

msconvert add: D:DataMyFile.RAW --filter "sortByScanTime" --mzXML -o D:DataOutput

4) To process data that has already been acquired use the usual MassLynx Run (play) icon and select “Auto Process Samples”. When processing starts a command prompt will open when required and will output the mzXML file into the parent folder.

XCMS - FAQ / How do I perform multiple testing correction in xcms?

September 04, 2011, 07:43:33 PM

One of the great things about xcms is since it's in R this is very easy to do. Multiple testing correction gives you a corrected pvalue due to the fact that you have performed multiple univariate tests. However, you will see that while your false discovery rate decreases, you will also sacrifice a few real results. There are many papers about the problems associated with multiple testing correction and many new algorithms to make the situation better (see links below). But you asked the question, so here's the answer!

Code: [Select]

require(multtest)
library(xcms)
library(faahKO)
gxs<-group(faahko)
ret<-retcor(gxs, p="m", f="s")
gret<-group(ret)
fill<-fillPeaks(gret)
diff<-diffreport(fill, "KO", "WT")

First we need the diffreport, so I'll generate one using the default dataset. If you have already saved your datasets then you can read it back in using read.tsv
Now we'll see how the pvalue adjustment function works:

Code: [Select]

colnames(diff)
crpval<-mt.rawp2adjp(diff[,"pvalue"], proc="Bonf") ## I'll go with the default adjustment 
idx<-sort(crpval$index, index.return=TRUE)
AdjPval<-crpval$adjp[idx$ix,2]
diff<-cbind(diff[,1:4], AdjustedPval=AdjPval, diff[,5:ncol(diff)])

write.csv(diff, file="NewDiffreport.csv")

Here are some links to explain why correction is needed, what it is and problems associated with it.
http://en.wikipedia.org/wiki/Bonferroni_correction
http://rss.acs.unt.edu/Rdoc/library/mul ... 2adjp.html - This is the R page for the function we are using.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1112991/
http://www.google.co.uk/search?client=s ... 76&bih=877

XCMS - FAQ / How do I get data from the xcmsSet/xcmsRaw object?

August 24, 2011, 11:26:38 AM

The xcmsSet object is an S4 object. If you don't know what that means, its ok, it just a defined way of hold information.

Lets open our default example set the faahko dataset.

Code: [Select]

library(xcms)
library(faahKO)
faahko

Notice that calling the object will give you some information about it but not tell you everything. This is because the object is calling a method called 'show'. We'll test this in the next bit of code. To find where this info is coming from we need to find the data in the object. This data is held in slots within the object and can be different types of data.

Code: [Select]

show(faahko) ## same as just faahko
names(attributes(faahko)) ## these are the slots within the object
class(faahko@peaks)
class(faahko@rt)
head(faahko@peaks)

Again notice that the type/classes for peaks and rt are different. If you want to access the data you need to use the @ symbol to access the slot. However, if the programmer was good then many of these slots have direct access using specified methods. The programmer for xcms was very good and so we have just that.

Code: [Select]

head(groups(faahko))
head(peaks(faahko))
phenoData(faahko)

For the moment while I'm still editing this I'm going to drop this code here, eventually this will be cleaned up

[code]
require(faahko) || stop("Install faahkon")

faahko<-group(faahko) ## the data needs to be grouped first to find the matching values between the samples

head(groups(faahko)) ## gives the meta data for each variable

vals<-groupval(faahko, "medret", "into") ## returns the intensity values for each feature and each sample
dim(vals) ## features X samples

## ungrouped xcmsSet objects have the only the following slots filled:
head(faahko@peaks) ## just a peak list, sorted by sample
unique(faahko@peaks[,"sample"]) ## these number are indexes to ->
faahko@phenoData ## which is a class ->
class(faahko@phenoData)
## but method ->
sampclass(faahko) ## is a factor and is the better way to id against the peak list.

length(faahko@rt) ## one == raw rt, 2 == corrected rt
length(faahko@rt$raw) ## == number of samples
length(faahko@rt$raw[[1]]) == number of scans, element is the scan time for that run

[code]

More to come:
Explain how each slot relates to each other.
Show xcmsRaw slots
xcmsRaw -> xcmsSet via findPeaks method
and difference between @ and $

XCMS - FAQ / My Peaks are not picked!

August 19, 2011, 12:56:09 PM

I can see my peaks, but centWave won't detect them.
To check why centWave does not detect "your" peak, you have to zoom into the raw data, and obtain some intermediate results. As seen in the following plot, the feature at m/z 507.3 seems to have "holes", i.e. the LC Peak is not detected by the FTICR in individual scans. In it's first stage the centWave peak picker determines regions-of-interest, i.e. mass traces with a low (<ppm threshhold) difference between consecutive scans. Hence, with these "holes" this (otherwise nice) LC peak is cut into a few ROIs, which subsequently are eliminated by second phase of centWave, the peak shape detection via wavelets:

[attachment=2:xytw8wi0]180px-PlotRawROIs.png[/attachment:xytw8wi0]
ROIs detected by centWave

Code: [Select]

library(xcms)
data.file="B08-08318_c.mzXML"
xraw <- xcmsRaw(data.file)
massrange=c(507,508)
timerange=c(3200,3250)
dev=50*1e-6
minCentroids=3
prefilter=c(3,10)

Manually extract ROIs

Code: [Select]

scanrange=range(which(xraw@scantime>timerange[1]
                   & xraw@scantime<timerange[2]))
featlist <- xcms:::findmzROI(xraw,scanrange=scanrange,
                            dev=dev,
                            minCentroids=minCentroids,
                            prefilter=prefilter)
means <- sapply(featlist,function(x) c(x$mzmin, x$mzmax,
                                      xraw@scantime[x$scmin],
                                      xraw@scantime[x$scmax]))
par(cex=2)
plotRaw(xraw,mzrange=massrange,rtrange=timerange, log=FALSE)
rect(means[3,],means[1,],means[4,], means[2,])

This can be verified looking at the corresponding EIC. Beware that plotChrom() uses the resampled profile Matrix for the values, and hence profMethod and profStep have an influence on the shape:

[attachment=1:xytw8wi0]180px-PlotChromBinLin.png[/attachment:xytw8wi0]
Filtered (intlin) EIC

[attachment=0:xytw8wi0]180px-PlotChromBin.png[/attachment:xytw8wi0]
Unfiltered (bin) EIC
It might be that matchedFilter is a bit better here, because it is using the smoothed (unless you use profMethod="bin") profile matrix.

Code: [Select]

## Check EIC
##
profStep(xraw) <- 0.05
## With smoothing
profMethod(xraw) <- "intlin"
plotChrom(xraw, massrange=c(507.25, 507.35), timerange=c(3100,3350))
## Without smoothing
profMethod(xraw) <- "bin"
plotChrom(xraw, massrange=c(507.25, 507.35), timerange=c(3100,3350))

[attachment deleted by admin]

XCMS - FAQ / I think I found a bug, how do I report it?

August 19, 2011, 12:50:26 PM

Usually you will get some help on the xcms mailing list. Good bug reports will save both you and the fellow users (or developers) time.
The long answer can be found in the Posting Guide of the bioconductor project, and most points also apply to xcms. Simon Tatham's essay on "How to Report Bugs Effectively" is not restricted to a particular programming language, but a very recommended read in general.
The short answer is:
Check the growing xcms FAQ
Report the shortest possible steps to reproduce the problem, even better try to reproduce it with the public faahKO data as e.g. in the PcaMdaExample.
Show where the bug happened:

Code: [Select]

traceback()

Tell us what version of XCMS you are running, the bug might be already fixed, or you could be running the (sometimes) unstable developer version:

Code: [Select]

sessionInfo()

If you found a solution, don't hesitate to sign up here and put it in the FAQ ;-)

XCMS - FAQ / How good is fillPeaks ?

August 19, 2011, 12:47:44 PM

fillPeaks() will add intensities for peaks not observed in a certain sample, but in some/most of the other samples of a smple class. The question is:
How good does fillPeaks() estimate the intenisty for my data ?
One way to estimate the quality is to take an xcmsSet grouped without NA values (i.e. one class and minfrac=1), remove the peaks of one sample, and fill them back in.
This way you can create a scatterplot of found peaks vs. filled peaks, and also check for which intensities they occur:

[attachment=1:mvim2ak9]FaahKOscatterFilledFound.png[/attachment:mvim2ak9]
Original vs. filled-in intensities

[attachment=0:mvim2ak9]FaahKOintensityFilledFoundRatio.png[/attachment:mvim2ak9]
Intensity ratios vs. intensity

For our evaluation of Bruker microTOFq data, we found that 94% of the restored intensities are correct, 1% had a deviation of factor 1.1 or more, and 5% had a ratio of less than 0.9 compared to the original peaks.
[attachment=3:mvim2ak9]ScatterFilledFound.png[/attachment:mvim2ak9]Original vs. filled-in intensities
[attachment=2:mvim2ak9]IntensityFilledFoundRatio.png[/attachment:mvim2ak9]

This is the code used:

Code: [Select]

library(faahKO)
 
 ## Extract xcmsSet with KO only
 xss <- split(faahko, f=sampclass(faahko))
 xs <- xss[[2]]
 
 xsg <- group(xs, minfrac=1)
 
 ## remove extra peaks within groups
 peaks <- cbind(peaks(xsg),(1:nrow(peaks(xsg))))
 meds <- as.vector(groupval(xsg,"medret"))
 meds <- meds[which(!is.na(meds))]
 peaks <-peaks[meds,]
 colnames(peaks) <- c(colnames(peaks(xsg)),"index")
 
 ## cleanup groupidx list
 for (a in 1: length(groupidx(xsg))) {
     gxs=NULL
     for (b in 1:length(groupidx(xsg)[[a]])){
         if (any(meds == groupidx(xsg)[[a]][b]))
             gxs<-c(gxs,groupidx(xsg)[[a]][b])
     }
     groupidx(xsg)[[a]] <- gxs
 }
 
 ## change index in groupidx
 for (a in 1: length(groupidx(xsg))) {
     gxs=NULL
     for (b in 1:length(groupidx(xsg)[[a]])){
         groupidx(xsg)[[a]][b] <- which(peaks[,"index"]==groupidx(xsg)[[a]][b])
     }
 }
 
 ## Write "cleaned" peaklist
 peaks(xsg) <- peaks[,(1:(ncol(peaks)-1))]
 
 ## Create backup for later comparison
 xsbackup <- xsg
 
 ## remove peaks from last sample
 lasts <- max(peaks(xsg)[,"sample"])
 lsmin <- min(which(peaks(xsg)[,"sample"]==lasts))
 
 ## remove peaksIDs from groupidx
 for (a in 1: length(groupidx(xsg))){
     groupidx(xsg)[[a]] <- groupidx(xsg)[[a]][which(groupidx(xsg)[[a]]<lsmin)]
 }
 
 ## remove peaks from peaklist
 peaks(xsg) <- peaks(xsg)[1:(lsmin-1),]
 
 ##
 ## Preparations finished, now fillPeaks()
 ##
 xsgf <- fillPeaks(xsg)
 
 ## fillPeaks again ?!
 xsbackup2 <- fillPeaks(xsbackup)
 
 ## compare peaks in both xcmsSets groupwise and samplewise
 gxorig <- groupidx(xsbackup2) ## das fillpeaks-ergebnis des backups
 gxfill <- groupidx(xsgf) ## das fillpeaks-ergebnis des XS ohne sample lasts
 
 ## plot original and filled peaks
 mxo<-NA
 mxf<-NA
 ino<-NA
 inf<-NA
 for (a in 1: length(gxorig))
 {
     opeak <- gxorig[[a]][which(peaks(xsbackup2)[gxorig[[a]],"sample"]==length(sampnames(xs)))]
     fpeak <- gxfill[[a]][which(peaks(xsgf)[gxfill[[a]],"sample"]==length(sampnames(xs)))]
     mxo[a] <- peaks(xsbackup2)[opeak,"mz"]
     mxf[a] <- peaks(xsgf)[fpeak,"mz"]
     ino[a] <- peaks(xsbackup2)[opeak,"into"]
     inf[a] <- peaks(xsgf)[fpeak,"into"]
 }
 
 png(filename = "faahKOscatterFilledFound.png")
 plot(log(ino),log(inf))
 dev.off()
 png(filename = "faahKOintensityFilledFoundRatio.png")
 plot(ino,inf/ino)
 dev.off()

[attachment deleted by admin]

XCMS - FAQ / nPeaks larger than total sample number, how?

August 19, 2011, 12:42:20 PM

How is it possible to have a npeaks value higher than the total number of samples ?

A peak group simply contains all the peaks found with in an array defined by a m/z range and a retention time range.
Depending on how dense the peaks are, it is certainly possible that a peak group will contain multiple peaks from a given sample, and even more peaks than the total number of samples.
The retention time alignment algorithm takes this into account when selecting peaks to use as standards. Check out the xcms paper for more information.
Knowing the number of peaks included in a group can help provide some diagnostic information about the quality of that group. I find it useful to know.
old forum link
old forum link2
-Colin

XCMS - FAQ / What is the influence of BW in group upon retcor

August 19, 2011, 12:37:26 PM

What does BW do?
The bw parameter or bandwidth controls the size of the retention time window for the groups. Too large or too small and few 'well behaved' groups will be found. You need just the right bw, goldilocks. Most of the time the run type will have a good default bw (HPLC~=30, UPLC~=10-20). The code below demonstrates how changing the bw parameter affects the retention time alignment and the number of detected groups. As the script run watch the retention time plots change.

Code: [Select]

library(xcms)
library(faahKO)

RetList<-list()
meanRet<-matrix(0, nrow=length(faahko@rt$raw[[1]]), ncol=length(seqBW))
seqBW<-seq(from=1, to=50, by=1)
for(i in seqBW){
	gxs<-group(faahko, bw=i)
	ret<-retcor(gxs, f="s", p="d")
	
	foo<-unlist(ret@rt$corrected)
	dim(foo)<-c(length(ret@rt$corrected[[1]]),length(ret@rt$corrected))
	retMean<-rowMeans(foo)
	RetList[[i]]<-foo
	meanRet[,i]<-retMean
	plot(0, xlim=c(from=min(foo), to=max(foo)), ylim=c(min(retMean - foo), max(retMean - foo)), 
		type="n", ylab="Ret Dev", xlab="retention time")
	colRet<-rainbow(ncol(foo))
	for(j in 1:ncol(foo)){
		points(seq(from=min(foo), to=max(foo), length.out=1278), (retMean - foo[,j]), type="l", col=colRet[j])
	}
	cat(paste("BW: ", seqBW[i], " - Features:", nrow(groups(gxs)), "n", sep=""))
	rm(ret)
	gc()
}

Now we can see the mean difference between each bw. The following code is basically a repeat of the code above but using the mean retention time deviation between all samples.

Code: [Select]

 dim(meanRet)
## code to come

XCMS / mz sorting violation

August 19, 2011, 12:07:55 PM

I have noticed that a lot of people have been getting the mz sort violation error while processing their data. In an effort to know where this is coming from I've created a poll. If people who have had this problem could vote and let us know where and how they converted. Maybe it's a certain instrument/parameter/converter or totally random?

Paul

XCMS - FAQ / fillPeaks has zeros ?

August 19, 2011, 10:55:59 AM

fillPeaks() will add intensities for peaks not observed in a certain sample, but in some/most of the other samples of a smple class.
Some people have observed, that even after fillPeaks(), some values are still "0" or even "NaN" (NotANumber).
fillPeaks() is trying to (blindly) extract/integrate intensities from those raw files, where the peaks are "missing" after the peak picking step (matchedFilter or centWave).
The information, where to look, is taken from the other samples where the peak was observed, i.e. the rtmin/rtmax and mzmin/mzmax after group()ing. Two cases where this will fail are most prominent:

If the peak is near the beginning (or end) of the LC run, this time range could be "before" or "beyond" the runtime of the sample in question, especially if retcor() was used to correct/modify any RT shifts. In this case, fillPeaks() will report "NaN". There is nothing xcms could sensibly do otherwise, and this non-number should flag a warning that there was something fishy. You might workaround by removing the first / last seconds (depending on the gradient they usually don't contain important/reliable peaks anyway) using the scanrange=c(min,max) parameter to xcmsSet() and findPeaks().
If your m/z calibration is very good for most runs, the group will have a very small mzin/mzmax window. For centroided data it is then quite easy to miss the actual raw points by just a tiny bit, as seen in the graph. In that case, the sum of intensities will be zero.

[attachment=1:2z4nlwmv]300px-ZeroPeaks-26.png[/attachment:2z4nlwmv] raw data just above raw data fillPeaks looking "beyond" raw data: The red EIC ends, because the peak is located at the end of the run, and the retention time has been corrected previously.

[attachment=1:2z4nlwmv]300px-ZeroPeaks-26.png[/attachment:2z4nlwmv] fillPeaks looking "beyond" raw data: The red EIC ends, because the peak is located at the end of the run, and the retention time has been corrected previously.

This is the code used:

Code: [Select]

library(xcms)
load("Anal10909_SP.Rdata")
xs <- fill.ret.Anal10909_SP

# Shortcut: aliases for groups
gr <- groups(xs)
gv <- groupval(xs, value="into")

# Shortcut: get offending group(s):
nanidx <- which(is.na(gv), arr.ind=TRUE)
zeridx <- which(gv==0, arr.ind=TRUE)

# Plot offending group(s):
e <- getEIC(xs, groupidx=nanidx[1], )
# Bad NaN guy in red:
col=rep("grey", length(sampnames(xs)))
col[nanidx[2]] <- "red"
plot(e, xs, col=col, sleep=1)

Code: [Select]

## PlotRaws
xr <- xcmsRaw(filepaths(xs)[nanidx[2]])
rtcor <- xs@rt$corrected
xr@scantime <- rtcor[[nanidx[2]]]

pdf(file="zeroPeaks.pdf")
par(mfrow=c(2,1))

for (i in 1:length(zeridx[,1])) {
    # EICs
    e <- getEIC(xs, groupidx=zeridx[i,1], )
    plot(e, xs, col=as.character(col[zeridx[i,1],]), sleep=1)

    ## PlotRaws
    xr <- xcmsRaw(filepaths(xs)[zeridx[i,2]])
    currentGroup <- gr[zeridx[i,1], ]

    ## Disable warnings for empty regions

    suppressWarnings(plotRaw(xr,
                             massrange=currentGroup[c("mzmin","mzmax")]+c(-0.1,+0.1),
                             timerange=currentGroup[c("rtmin","rtmax")]+c(-5,+5),
                             log=TRUE,
                             title=paste("Group/Samp/mz/RT: ", paste(zeridx[i,], collapse=" "), paste(round(currentGroup[c("mzmin","mzmax","rtmin","rtmax")], digits=2), collapse=" "))))

    rect(currentGroup["rtmin"],
         currentGroup["mzmin"],
         currentGroup["rtmax"],
         currentGroup["mzmax"], border="grey")
}
dev.off()

[attachment deleted by admin]

XCMS - FAQ / Reduce file size

August 19, 2011, 10:30:50 AM

This tip was brought to you by Tony Larson.

I would add that apart from going to Linux to use yet more memory (which you must have physically available!), you can also do other things to reduce the size of your xcmsSet and thus save memory and increase processing speed.

The easiest solution is to reduce the size of your .cdf, .mzXML or .mzData input files. You can do this using the tools available in OpenMS, which can be downloaded and easily installed under 32 bit Windows (Linux installation also possible but tricky). Once installed, openMS allows you to trim source .cdf, .mzXML or .mzData files to a specific start and stop time, reduce the m/z scan range, or even re-sample the data to fewer scans/unit time. These operations can all be run as DOS batch files on your dataset (but be aware that .cdf files cannot be read by openms in a 64bit OS (Windows or Linux) environment, so if you need to process .cdf files you must install openms in a 32 bit environment, which is the standard Windows OS). In xcms, there is very limited ability to filter data from a source file before processing - as far as I know, this is limited to specifying a scanrange if you use the centWave method in xcmsSet().

Of course, all these parameters should ideally be optimized when you first acquire the data from your instrument - but often the benefit of hindsight reveals that some post-acquisition data reduction would be beneficial.

Finally, be aware that limiting the input file size will buy you more input files/ output vector in R, and hence allow you to process more input files. However, you will eventually come up to the ~3Gb hard-limit in Windows if you process a very large number of input files, especially when you come to use xcms functions like fillPeaks() where the entire data matrix is loaded into memory. Then Linux is the only option, and this should be running on a system with lots more RAM than a typical Windows system.