I also observe that group.nearest produces more feature groups, but more doesn't mean "better" But I haven't had the time to really compare both methods, so normally I stick to density, because I'm more familiar with it.
I spoke with Ralf about the group.density problem and we removed the threshold in the next xcms version, because it's unnecessary.
After some debugging I think I have an idea, where the problem lies. The group.density function bins all m/z values (meaning all m/z values over the whole retention time) and then loops over these bins. It defines a number of peaks which has to occur in that bin, otherwise it loops into the next. The value is defined as the minimum number of samples per class divided per two. So for the faahKO data set this number is 3, because we have 6 samples per class.
If the number of peaks within that bin is higher than the threshold, the density function divides the m/z bin into different retention time groups. For these groups the minfrac and minsamp values are then checked.
I have no idea why this threshold exist and if it can be safely removed.
What software/R package or script are you using to bin (bucket) data from LC-MS?
Feeding an xcmsRaw with a txt file could be a painful approach. If you only want to bin values have a look at the R function cut. Here is a small example, which you could use as a starting point
#Generate some data x <- rnorm(1000,0,10) #sort data x.sort <- sort(x) #take a quick look head(x.sort) #[1] -31.44805 -29.43685 -28.45767 -28.18506 -26.27450 -25.92044 #generate bins with bin size of 0.1 over whole data range bin <- seq(min(x)-0.1,max(x)+0.1,by=0.1) #take a look at the bins head(bin) #[1] -31.54805 -31.44805 -31.34805 -31.24805 -31.14805 -31.04805 #So first bin starts at -31.54805 to -31.44805 binned.x=cut(x, breaks=bin, labels=FALSE) head(binned.x) #[1] 1 22 31 34 53 57 #As an results you retrieve bin indes, so first x values goes into bin of -31.54805 to -31.44805
It exists a method for plotting EICs from a xcmsSet. You can use this methods for your problem, because the original xcmsSet object is stored within a xsAnnotate object.
So with xs <- object@xcmsSet (object is here the processed xsAnnotate object) you can retrieve the original xcmsSet object.
xsa <- annotate(xset) ##Only for the case the xset is lost. xset <- xsa@xcmsSet
##For plotting first fillPeaks your xcmsSet xset.fill <- fillPeaks(xset) ##Get EICs xeic <- getEIC(xset.fill, groupidx=1:nrow(xsa@groupInfo)) ##Plot EICs for the first peak plot(xeic, groupidx=1) ## For more information about this function see ?xcms::plot.xcmsEIC
CAMERA has the plotEICs function, which plots the peak EICs for each peak group created by CAMERA. But you want to plot for one peak the EICs from all its samples in one plot. Did I understand you right?
By the way, if you use the xcmsSet constructor, your data is already processed by the default peak picker. If you want to use the "centWave" peak picker, then change your code to
so do you need the diffreport function or are you fine with the peaktable generated by the script from Paul?
And for using CAMERA at which step if complains about the NA values? The only function I could imagine would be annoteDiffreport, which uses the diffreport and therefore were are back at the start of this thread But all other functions should work with NA values.
If you don't want to use the wrapper functions, you find in every manpage some code examples for the "normal" script based analyis. If you need help within any function, just write me a note. Perhaps in the CAMERA sub forum that so that we not "pollute" this one
the diffreport function calculates the fold changes between two samples for every feature. If in one sample the value is NA (meaning the feature could not be detected or the feature intensity was lower than your filtering settings), then the fold change can't be calculated. For those cases the fillPeaks function was written and works quite well.
You can split you xcmsset: see ?split.xcmsSet But the peak group info is discarded so it won't be trivial to put it back together again after re-grouping and filling. There is no logical way to divide this data for individual analysis?
I actually tried that yesterday but it turns out to be more tricky then I thought. You need some adoption in findPeaks or a script which simply redo that step. I will try again over the weekend and report back!
I add a new parameter mzdec to the diffreport function, which is the number of decimals (defaults to 2) in the plot title. Should be build on BioC (version 1.33.17) within the next 24 hours.
I need help in the Interpretation of XCMS-Results, too. For my analysis it was not possible to use the same amount of plant tissue in each sample. Does XCMS compare absolute or relative amounts of substances?
To keep threads clean it's would be best to create a new one instead of posting here, but unfortunately I have no moderation rights to move your posting, so lets keep it here.
Sorry for the late answer, but I was on holiday the last few days.
Quote from: "osuct"
So, points that line up represent peaks from multiple samples that have the same retention time, correct?
Exactly!
Quote from: "osuct"
The y-axis is labeled density. What density is this referring to?
This density is calculated with a "standard" gaussian kernel estimation. By the way the bw parameter from the group function is the smoothing bandwidth for this kernel.
Quote from: "osuct"
I tried to change the mzwid parameter to 0.025 and got a lot more detected features (> 1000) but is there a diagnostic like the group plot to see whether this is too small or too big?
Not a general one, as far as I know. If I want to optimize my parameters I normally look at some specific features and their group plots, also in how many cases the npeaks number is higher than the number of samples, ...
But keep in mind that the mzwid parameter is unfortunately an absolute value, not a relative one. So mzwid= 0.01 could be optimal for m/z 100 - 400, but for 400 - 1000 you need mzwid=0.025 and therefore it must be set to 0.025 in general.
Quote from: "osuct"
Also, for the following figure, does it mean two features (with different two different rt range) are detected because of the two dashed lines? or is it only 1 feature is detected with larger rt range? Why there is only 1 gaussian curve for these two line of points?
As you can see in you groups result, you have only one feature at m/z 76.95. The dashed lines marks here the left and the right end of the rt region, where the feature is defined. So according to the feature, the left line is at 212.638 and the right at 223.605 mzmed mzmin mzmax rtmed rtmin rtmax npeaks 01andQC [5,] 76.95368 76.95315 76.95496 213.623 212.638 223.605 265 239
The width of the gaussian curve depends certainly on the detected peaks but also on the previously mentioned bw parameter. The kernels are scaled such that the bw parameter is the standard deviation of those kernels. So if you know that your chromatography is very stable and the retention time for a feature doesn't change much even over hundreds of samples, then you can decrease the bw parameter. This results in much more narrow curves, which would separate this feature here in two.
the centWave algorithm has a two step feature detection framework. The first step is the detection of connected centroids, which are called ROI (region of interest). If within one scan the selection of a centroid is not distinct, this could be caused by a too large search window (ppm parameter).
If you have not many warnings, you can safely ignore them. If not, then you should lower your ppm parameter.