Skip to main content
Topic: Undesired filtering somewhere (Read 9416 times) previous topic - next topic

Undesired filtering somewhere

For metabolomics data, the biostatisticians with whom we collaborate strongly discourage any kind of filtering of mass features based on group membership. For example, they say that it would be a bad idea to detect peaks and then, during peak alignment, only keep mass features that were present in 50% of the samples in treatment group A or 50% of the samples in treatment group B. Better, they say, to keep mass features present in 50% of all samples so that your data preprocessing steps do not bias your outcome. With that in mind, I'm trying to make sure that XCMS is not filtering at all. Here's my issue: I get different results when I group my samples -- putting my raw data files into different folders -- than when I do not, so somewhere, I'm filtering when I don't mean to be. Here is the code I'm using:
Code: [Select]
Samples <- list.files(getwd(), pattern="mzdata.xml", full.names=FALSE, recursive=TRUE)

U5g.raw <- xcmsSet(Samples, method = "centWave", snthresh = 10, ppm=15, peakwidth=c(6,12), mzCenterFun="apex", integrate = 1,fitgauss= TRUE)

U5g.raw

U5g.grouped <- group(U5g.raw, method="density", minsamp=1, mzwid=0.004, bw=10, max=10)

U5g.RTcor <- retcor(U5g.grouped, missing=15, extra=30, smooth="loess", family="symmetric", plottype="mdevden")

U5g.grouped2 <- group(U5g.RTcor, method="density", minsamp=1, mzwid=0.004, bw=10, max=10)

U5g.filledpeaks <- fillPeaks(U5g.grouped2)

U5g.peaks <- peakTable(U5g.filledpeaks, filebase="ESI+ urine 5g peak table")
When I have my samples in 4 folders, one each for the four treatment groups I've got, I get 12,793 mass features. (Yes, I know that many of those are noise and that I'm probably too stringent on some of my mass spectral resolution parameters. :-)  I'll adjust that later, once I better understand what's going on here.) When I put those exact same data files all together into one folder, I get 3,806 mass features.

Anyone have any thoughts on what's going on? I thought that if I put "minsamp=1" for a grouping parameter that meant that I wasn't filtering at all, but if I'm not filtering based on group membership, why do I get a different number of mass features when I group my samples by treatment group than when I don't?

Thanks in advance. This board has really, really been helpful to me in the past!

Laura

Re: Undesired filtering somewhere

Reply #1
You might want to adjust the minfrac parameter.

minfrac = minimum fraction of samples necessary in at least one of the sample groups for it to be a valid group

The default value is 0.5, i.e. if a feature is present in less than 50% of the samples in one group it will be filtered out.

Re: Undesired filtering somewhere

Reply #2
Great! Thanks, Ralf; that worked! I changed the group function command, adding the minfrac parameter like this:
Code: [Select]
group(U5g.raw, method="density", minsamp=1, minfrac=0, mzwid=0.004, bw=10, max=10)
and that worked! Now, I've got 13,541 mass features.

That was the missing piece. I wasn't sure which would overrule which, the minfrac or the minsamp parameter. It appears that XCMS uses whichever filtering level is higher.

Thank you for your help!

Laura

Re: Undesired filtering somewhere

Reply #3
mzwid=0.004 appears to be very small. what kind of data is this ?
for qtof, try something like 0.025

Also, I would not limit max to 10, set it to 50 (default), 100 or more.

Re: Undesired filtering somewhere

Reply #4
Hi, Ralf.

These are QToF data. I picked the value of mzwid=0.004 because, for peaks that appear in multiple samples, the difference in m/z between a peak detected in one sample and the same peak detected in another sample by our instrument is generally slightly less than 0.004 m/z. I saw in Table 1 of your, Gary Patti and Gary Siuzdak's Nature Protocols paper from February that you recommend using mzwid=0.025 for HPLC/QToF data or mzwid=0.015 for high-resolution HPLC/QToF data and that you recommended ppm=30 and ppm=15, respectively. How did you determine those numbers? We're scanning from 100-1000 m/z in our runs, and the mean m/z we detect in plasma and urine samples is ~400 m/z. A difference between samples of 0.004 m/z for a molecule with m/z=400 is 10 ppm, so shouldn't I set mzwid=0.004 and ppm=10? But maybe you aren't determining those parameters the way that I'm thinking because a difference in m/z of 0.025 for a common small molecule would be ~60 ppm, not 30 ppm.

Thanks for any clarification!

Laura

Re: Undesired filtering somewhere

Reply #5
Hi Laura,

the centWave algorithm searches in short for m/z signals occurring in consecutive scans within a specific m/z error.
The min/max number of necessary scans are calculated from the peakwidth parameter.
The m/z error is the combination of the mzwid and ppm parameters.

At the peak apex the m/z error is certainly within your mentioned 10ppm ranges, but at the peak borders,
with are normally at low intensities, the mass accuracy is worse. This also applies to low abundance peaks.
That is the reason for choosing higher ppm values.

To get an impression on our data, you can look at the @peaks slot or the general peak list, where for each peak beside the mz values (which is calculated at the peak apex)
also the mzmin, mzmax values are reported.
 
Cheers,

Carsten

Re: Undesired filtering somewhere

Reply #6
Thanks, Carsten! That helped!

Laura