Find outlier runs

August 19, 2011, 09:08:37 AM

Sometimes, especially with larger sample sets, the standard retcor() will complain with "Too few peak groups, reverting to linear method" or "Not enough well behaved peak groups even for linear smoothing of retention times". There are several possible reasons why this can happen.
The underlying group() was probably not performed with a large enough bw param, e.g. UPLC parameters used for an HPLC setup. For large sample sets (way more than one hundred) you can increase the retcor() parameters missing and extra, which allows more peaks to be considered "well behaved". The retcor(method="obiwarp") method does not rely one such peak groups, and uses raw spectra correlation. It creates an xcmsSet with corrected retention times.
[attachment=1:1b8dp77j]RetcorObiwarpWithOutliers.png[/attachment:1b8dp77j]

Retcor plot of a large sample set with a few outliers
One way or another you might end up with a retcor plot like the one to the right, which shows that most runs are well or at least similar behaving, and a few outliers. If you can identify and remove them, the remaining alignment will be of much better quality. The problem is that from the plot alone the culprits can not be identified (unless they're among the first few samples shown in the legend).
The following code calculates for each sample the RMSD between the raw and uncorrected retention time, i.e. the amount of correction
[attachment=0:1b8dp77j]rmsd.png[/attachment:1b8dp77j]

Code: [Select]

# Check RMSD of retention time deviation
 library(xcms)
 library(faahKO)
 xs <- group(faahko)
 xs <- retcor(xs)
 rmsd <- sapply(1:length(xs@rt$corrected), 
                function(x) {
                   sqrt(sum((xs@rt$raw[[x]]-xs@rt$corrected[[x]]) ** 2))
                })
 plot(rmsd)

In addition, it is even possible to cluster the samples w.r.t. their retention time profiles, which can discriminate between samples with the same RMSD in different areas of the gradient:

Code: [Select]

 minlength <- min(sapply(xs@rt$raw, length ))
 devs <- sapply(1:length(xs@rt$corrected), 
                function(x) { (xs@rt$raw[[x]]-xs@rt$corrected[[x]])[1:minlength] })
 colnames(devs) <- sampnames(xs)
 ddevs <- dist(t(devs))
 hdevs <- hclust(ddevs) 
 plot(hdevs)

[attachment deleted by admin]