Load up the default toy dataset. Extract the intensities for each sample for each feature from the xcmsSet object. We'll take the 'into' group here but if you have used centWave for peak detection then you can also use 'intb' which is the background subtracted intensity.
This code will produce a heatmap of all of your data matrix. The scaling will scale each row (feature) of your data into a Z-score. For more on z-scores look at http://http://en.wikipedia.org/wiki/Standard_score
So you have a load of data and you're only really interested in picking peaks on a small range of data. There are two ways of doing this, the new way, which is an upgrade since mzR was incorporated into xcms and the older way. We'll look at both.
1) The new way. Here we'll use the scanrange option to look at only a certain range of the spectra. To do so please make sure that your xcms version is updated to the latest code. (Tested with xcms v 1.35.2) First, we need to convert the scan time into a scans. Each scan has a certain time. You know the time that you want but the instrument knows that the scan was 1 or 100. The actual time will be slight different after this conversion for the different files.
library(xcms) library(faahKO) cdfpath <- system.file("cdf", package = "faahKO") cdffiles <- list.files(cdfpath, recursive = TRUE,full=T) ## lets convert into time into scans RTrange<-c(2600,3000) xr<-xcmsRaw(cdffiles[5]) ## loads up a single file into an xcmsRaw object which(xr@scantime > 2600 & xr@scantime < 2605)[1] ## [1] 65 > which(xr@scantime > 3499 & xr@scantime < 3500) ## [1] 639 xr@scantime[c(65,639)] This will be our new scan range
xset <- xcmsSet(cdffiles, scanrange=c(65,636))
2)The old way. In this method all of the LC-MS spectrum is used for peak detection. Afterwards the xcmsSet object is slimmed down to only the required range. This method can be quite useful if you want to have both the slimmed down object and the full RT range object in your R user space at the same time.
## Now edit the object to the required range RTrange<c(2600, 3500) ix.rt<-which(xset@peaks[,"rt"] > RTrange[1] & xset@peaks[,"rt"] < RTrange[2]) xsetRT<-xset## copy the object into a new object to save the old one :) xsetRT@peaks<-xset@peaks[ix.rt,] ## selects only the peaks that we want in the range and copy it into the new object xset xsetRT
The results should look something like this. Notice the RT range is now changed: xset : An "xcmsSet" object with 12 samples
Time range: 2506.1-4147.7 seconds (41.8-69.1 minutes) Mass range: 200.1-599.3338 m/z Peaks: 4721 (about 393 per sample) Peak Groups: 0 Sample classes: KO, WT
Profile settings: method = bin step = 0.1
Memory usage: 0.713 MB
xsetRT: An "xcmsSet" object with 12 samples
Time range: 2604.7-3499.8 seconds (43.4-58.3 minutes) Mass range: 200.1-597.3132 m/z Peaks: 2722 (about 227 per sample) Peak Groups: 0 Sample classes: KO, WT
Recently there was a paper published with suggested that median fold change was a suitable normalization method for urine and other samples that are effected by dilutions.
(1) Veselkov, K. A.; Vingara, L. K.; Masson, P.; Robinette, S. L.; Want, E.; Li, J. V.; Barton, R. H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T. M. D.; Pelczer, I.; Holmes, E.; Lindon, J. C.; Nicholson, J. K. Anal. Chem. 2011, 83, 5864–5872.
So you're finished with processing but you want to change minFrac. There must be an automated way to do this other than hand picking in excel! You used the group.nearest method and would really like to get rid of some of the non reproducible groups. How can I do this before printing the diffreport.
Using the following function below and assuming that you still have an xcmsSet object saved or in the R console its easy. Example code follows using the faahko dataset.
library(faahKO) ## These files do not have this problem to correct for but just for an cdfpath <- system.file("cdf", package = "faahKO") cdffiles <- list.files(cdfpath, recursive = TRUE, full.names = TRUE)
xset<-xcmsSet(cdffiles) gxset<-group(xset, method="nearest") nrow(gxset@groups) == 1096 ## the number of features before minFrac
gxset<-post.minFrac(gxset) nrow(gxset@groups) == 465 ## and after minFrac !
You converted a file using proteoWizard to an mzXML, you load it into xcms and you get something like the following error :
Quote
Time for scan X greater than scan Y
This is the default converting action of protoWizard. You will need to load the the not graphical user interface (GUI) version of the converter, msconvert.exe. Using a line somthing like the following in the dos prompt:
Currently, version 2.2, the "sortByScanTime" option is only avaiable in the command line version of the software and has not been ported over to the GUI yet.
Hint: If you're using MassLynx you can load up the run list and have it automatically run the converter on each file:
1) Open MassLynx and above the Sample Table select Samples ? Format ? Customise. Enable the following Fields to display Process Process Options
2) Choose Samples ? Format ? Save and save the new Sample List format.
3) Make sure that MassWolf/msconvert is in the MassLynx directory. Choose massWolf/msconvert from the Process drop down list. As massWolf/msconvert is a command-line tool it requires some parameters to make it function. In the below example the following line has been added to the 'Process Options' column;
4) To process data that has already been acquired use the usual MassLynx Run (play) icon and select “Auto Process Samples”. When processing starts a command prompt will open when required and will output the mzXML file into the parent folder.
One of the great things about xcms is since it's in R this is very easy to do. Multiple testing correction gives you a corrected pvalue due to the fact that you have performed multiple univariate tests. However, you will see that while your false discovery rate decreases, you will also sacrifice a few real results. There are many papers about the problems associated with multiple testing correction and many new algorithms to make the situation better (see links below). But you asked the question, so here's the answer!
First we need the diffreport, so I'll generate one using the default dataset. If you have already saved your datasets then you can read it back in using read.tsv Now we'll see how the pvalue adjustment function works:
Notice that calling the object will give you some information about it but not tell you everything. This is because the object is calling a method called 'show'. We'll test this in the next bit of code. To find where this info is coming from we need to find the data in the object. This data is held in slots within the object and can be different types of data.
show(faahko) ## same as just faahko names(attributes(faahko)) ## these are the slots within the object class(faahko@peaks) class(faahko@rt) head(faahko@peaks)
Again notice that the type/classes for peaks and rt are different. If you want to access the data you need to use the @ symbol to access the slot. However, if the programmer was good then many of these slots have direct access using specified methods. The programmer for xcms was very good and so we have just that.
For the moment while I'm still editing this I'm going to drop this code here, eventually this will be cleaned up [code] require(faahko) || stop("Install faahkon")
faahko<-group(faahko) ## the data needs to be grouped first to find the matching values between the samples
head(groups(faahko)) ## gives the meta data for each variable
vals<-groupval(faahko, "medret", "into") ## returns the intensity values for each feature and each sample dim(vals) ## features X samples
## ungrouped xcmsSet objects have the only the following slots filled: head(faahko@peaks) ## just a peak list, sorted by sample unique(faahko@peaks[,"sample"]) ## these number are indexes to -> faahko@phenoData ## which is a class -> class(faahko@phenoData) ## but method -> sampclass(faahko) ## is a factor and is the better way to id against the peak list.
length(faahko@rt) ## one == raw rt, 2 == corrected rt length(faahko@rt$raw) ## == number of samples length(faahko@rt$raw[[1]]) == number of scans, element is the scan time for that run
[code]
More to come: Explain how each slot relates to each other. Show xcmsRaw slots xcmsRaw -> xcmsSet via findPeaks method and difference between @ and $
I can see my peaks, but centWave won't detect them. To check why centWave does not detect "your" peak, you have to zoom into the raw data, and obtain some intermediate results. As seen in the following plot, the feature at m/z 507.3 seems to have "holes", i.e. the LC Peak is not detected by the FTICR in individual scans. In it's first stage the centWave peak picker determines regions-of-interest, i.e. mass traces with a low (<ppm threshhold) difference between consecutive scans. Hence, with these "holes" this (otherwise nice) LC peak is cut into a few ROIs, which subsequently are eliminated by second phase of centWave, the peak shape detection via wavelets:
[attachment=2:xytw8wi0]180px-PlotRawROIs.png[/attachment:xytw8wi0] ROIs detected by centWave
This can be verified looking at the corresponding EIC. Beware that plotChrom() uses the resampled profile Matrix for the values, and hence profMethod and profStep have an influence on the shape:
[attachment=0:xytw8wi0]180px-PlotChromBin.png[/attachment:xytw8wi0] Unfiltered (bin) EIC It might be that matchedFilter is a bit better here, because it is using the smoothed (unless you use profMethod="bin") profile matrix.
Usually you will get some help on the xcms mailing list. Good bug reports will save both you and the fellow users (or developers) time. The long answer can be found in the Posting Guide of the bioconductor project, and most points also apply to xcms. Simon Tatham's essay on "How to Report Bugs Effectively" is not restricted to a particular programming language, but a very recommended read in general. The short answer is: Check the growing xcms FAQ Report the shortest possible steps to reproduce the problem, even better try to reproduce it with the public faahKO data as e.g. in the PcaMdaExample. Show where the bug happened:
fillPeaks() will add intensities for peaks not observed in a certain sample, but in some/most of the other samples of a smple class. The question is: How good does fillPeaks() estimate the intenisty for my data ? One way to estimate the quality is to take an xcmsSet grouped without NA values (i.e. one class and minfrac=1), remove the peaks of one sample, and fill them back in. This way you can create a scatterplot of found peaks vs. filled peaks, and also check for which intensities they occur:
[attachment=1:mvim2ak9]FaahKOscatterFilledFound.png[/attachment:mvim2ak9] Original vs. filled-in intensities
[attachment=0:mvim2ak9]FaahKOintensityFilledFoundRatio.png[/attachment:mvim2ak9] Intensity ratios vs. intensity
For our evaluation of Bruker microTOFq data, we found that 94% of the restored intensities are correct, 1% had a deviation of factor 1.1 or more, and 5% had a ratio of less than 0.9 compared to the original peaks. [attachment=3:mvim2ak9]ScatterFilledFound.png[/attachment:mvim2ak9]Original vs. filled-in intensities [attachment=2:mvim2ak9]IntensityFilledFoundRatio.png[/attachment:mvim2ak9]
## Extract xcmsSet with KO only xss <- split(faahko, f=sampclass(faahko)) xs <- xss[[2]]
xsg <- group(xs, minfrac=1)
## remove extra peaks within groups peaks <- cbind(peaks(xsg),(1:nrow(peaks(xsg)))) meds <- as.vector(groupval(xsg,"medret")) meds <- meds[which(!is.na(meds))] peaks <-peaks[meds,] colnames(peaks) <- c(colnames(peaks(xsg)),"index")
## cleanup groupidx list for (a in 1: length(groupidx(xsg))) { gxs=NULL for (b in 1:length(groupidx(xsg)[[a]])){ if (any(meds == groupidx(xsg)[[a]][b])) gxs<-c(gxs,groupidx(xsg)[[a]][b]) } groupidx(xsg)[[a]] <- gxs }
## change index in groupidx for (a in 1: length(groupidx(xsg))) { gxs=NULL for (b in 1:length(groupidx(xsg)[[a]])){ groupidx(xsg)[[a]][b] <- which(peaks[,"index"]==groupidx(xsg)[[a]][b]) } }
## Create backup for later comparison xsbackup <- xsg
## remove peaks from last sample lasts <- max(peaks(xsg)[,"sample"]) lsmin <- min(which(peaks(xsg)[,"sample"]==lasts))
## remove peaksIDs from groupidx for (a in 1: length(groupidx(xsg))){ groupidx(xsg)[[a]] <- groupidx(xsg)[[a]][which(groupidx(xsg)[[a]]<lsmin)] }
## remove peaks from peaklist peaks(xsg) <- peaks(xsg)[1:(lsmin-1),]
## ## Preparations finished, now fillPeaks() ## xsgf <- fillPeaks(xsg)
## fillPeaks again ?! xsbackup2 <- fillPeaks(xsbackup)
## compare peaks in both xcmsSets groupwise and samplewise gxorig <- groupidx(xsbackup2) ## das fillpeaks-ergebnis des backups gxfill <- groupidx(xsgf) ## das fillpeaks-ergebnis des XS ohne sample lasts
## plot original and filled peaks mxo<-NA mxf<-NA ino<-NA inf<-NA for (a in 1: length(gxorig)) { opeak <- gxorig[[a]][which(peaks(xsbackup2)[gxorig[[a]],"sample"]==length(sampnames(xs)))] fpeak <- gxfill[[a]][which(peaks(xsgf)[gxfill[[a]],"sample"]==length(sampnames(xs)))] mxo[a] <- peaks(xsbackup2)[opeak,"mz"] mxf[a] <- peaks(xsgf)[fpeak,"mz"] ino[a] <- peaks(xsbackup2)[opeak,"into"] inf[a] <- peaks(xsgf)[fpeak,"into"] }
How is it possible to have a npeaks value higher than the total number of samples ?
A peak group simply contains all the peaks found with in an array defined by a m/z range and a retention time range. Depending on how dense the peaks are, it is certainly possible that a peak group will contain multiple peaks from a given sample, and even more peaks than the total number of samples. The retention time alignment algorithm takes this into account when selecting peaks to use as standards. Check out the xcms paper for more information. Knowing the number of peaks included in a group can help provide some diagnostic information about the quality of that group. I find it useful to know. old forum link old forum link2 -Colin
What does BW do? The bw parameter or bandwidth controls the size of the retention time window for the groups. Too large or too small and few 'well behaved' groups will be found. You need just the right bw, goldilocks. Most of the time the run type will have a good default bw (HPLC~=30, UPLC~=10-20). The code below demonstrates how changing the bw parameter affects the retention time alignment and the number of detected groups. As the script run watch the retention time plots change.
Now we can see the mean difference between each bw. The following code is basically a repeat of the code above but using the mean retention time deviation between all samples.
I have noticed that a lot of people have been getting the mz sort violation error while processing their data. In an effort to know where this is coming from I've created a poll. If people who have had this problem could vote and let us know where and how they converted. Maybe it's a certain instrument/parameter/converter or totally random?
fillPeaks() will add intensities for peaks not observed in a certain sample, but in some/most of the other samples of a smple class. Some people have observed, that even after fillPeaks(), some values are still "0" or even "NaN" (NotANumber). fillPeaks() is trying to (blindly) extract/integrate intensities from those raw files, where the peaks are "missing" after the peak picking step (matchedFilter or centWave). The information, where to look, is taken from the other samples where the peak was observed, i.e. the rtmin/rtmax and mzmin/mzmax after group()ing. Two cases where this will fail are most prominent:
If the peak is near the beginning (or end) of the LC run, this time range could be "before" or "beyond" the runtime of the sample in question, especially if retcor() was used to correct/modify any RT shifts. In this case, fillPeaks() will report "NaN". There is nothing xcms could sensibly do otherwise, and this non-number should flag a warning that there was something fishy. You might workaround by removing the first / last seconds (depending on the gradient they usually don't contain important/reliable peaks anyway) using the scanrange=c(min,max) parameter to xcmsSet() and findPeaks().
If your m/z calibration is very good for most runs, the group will have a very small mzin/mzmax window. For centroided data it is then quite easy to miss the actual raw points by just a tiny bit, as seen in the graph. In that case, the sum of intensities will be zero.
[attachment=1:2z4nlwmv]300px-ZeroPeaks-26.png[/attachment:2z4nlwmv] raw data just above raw data fillPeaks looking "beyond" raw data: The red EIC ends, because the peak is located at the end of the run, and the retention time has been corrected previously.
[attachment=1:2z4nlwmv]300px-ZeroPeaks-26.png[/attachment:2z4nlwmv] fillPeaks looking "beyond" raw data: The red EIC ends, because the peak is located at the end of the run, and the retention time has been corrected previously.
# Plot offending group(s): e <- getEIC(xs, groupidx=nanidx[1], ) # Bad NaN guy in red: col=rep("grey", length(sampnames(xs))) col[nanidx[2]] <- "red" plot(e, xs, col=col, sleep=1)
I would add that apart from going to Linux to use yet more memory (which you must have physically available!), you can also do other things to reduce the size of your xcmsSet and thus save memory and increase processing speed.
The easiest solution is to reduce the size of your .cdf, .mzXML or .mzData input files. You can do this using the tools available in OpenMS, which can be downloaded and easily installed under 32 bit Windows (Linux installation also possible but tricky). Once installed, openMS allows you to trim source .cdf, .mzXML or .mzData files to a specific start and stop time, reduce the m/z scan range, or even re-sample the data to fewer scans/unit time. These operations can all be run as DOS batch files on your dataset (but be aware that .cdf files cannot be read by openms in a 64bit OS (Windows or Linux) environment, so if you need to process .cdf files you must install openms in a 32 bit environment, which is the standard Windows OS). In xcms, there is very limited ability to filter data from a source file before processing - as far as I know, this is limited to specifying a scanrange if you use the centWave method in xcmsSet().
Of course, all these parameters should ideally be optimized when you first acquire the data from your instrument - but often the benefit of hindsight reveals that some post-acquisition data reduction would be beneficial.
Finally, be aware that limiting the input file size will buy you more input files/ output vector in R, and hence allow you to process more input files. However, you will eventually come up to the ~3Gb hard-limit in Windows if you process a very large number of input files, especially when you come to use xcms functions like fillPeaks() where the entire data matrix is loaded into memory. Then Linux is the only option, and this should be running on a system with lots more RAM than a typical Windows system.