For metabolomics data, the biostatisticians with whom we collaborate strongly discourage any kind of filtering of mass features based on group membership. For example, they say that it would be a bad idea to detect peaks and then, during peak alignment, only keep mass features that were present in 50% of the samples in treatment group A or 50% of the samples in treatment group B. Better, they say, to keep mass features present in 50% of all samples so that your data preprocessing steps do not bias your outcome. With that in mind, I'm trying to make sure that XCMS is not filtering at all. Here's my issue: I get different results when I group my samples -- putting my raw data files into different folders -- than when I do not, so somewhere, I'm filtering when I don't mean to be. Here is the code I'm using:
Samples <- list.files(getwd(), pattern="mzdata.xml", full.names=FALSE, recursive=TRUE)
U5g.raw <- xcmsSet(Samples, method = "centWave", snthresh = 10, ppm=15, peakwidth=c(6,12), mzCenterFun="apex", integrate = 1,fitgauss= TRUE)
U5g.raw
U5g.grouped <- group(U5g.raw, method="density", minsamp=1, mzwid=0.004, bw=10, max=10)
U5g.RTcor <- retcor(U5g.grouped, missing=15, extra=30, smooth="loess", family="symmetric", plottype="mdevden")
U5g.grouped2 <- group(U5g.RTcor, method="density", minsamp=1, mzwid=0.004, bw=10, max=10)
U5g.filledpeaks <- fillPeaks(U5g.grouped2)
U5g.peaks <- peakTable(U5g.filledpeaks, filebase="ESI+ urine 5g peak table")
When I have my samples in 4 folders, one each for the four treatment groups I've got, I get 12,793 mass features. (Yes, I know that many of those are noise and that I'm probably too stringent on some of my mass spectral resolution parameters. :-) I'll adjust that later, once I better understand what's going on here.) When I put those exact same data files all together into one folder, I get 3,806 mass features.
Anyone have any thoughts on what's going on? I thought that if I put "minsamp=1" for a grouping parameter that meant that I wasn't filtering at all, but if I'm not filtering based on group membership, why do I get a different number of mass features when I group my samples by treatment group than when I don't?
Thanks in advance. This board has really, really been helpful to me in the past!
Laura