Skip to main content

Messages

This section allows you to view all Messages made by this member. Note that you can only see Messages made in areas you currently have access to.

Messages - CoreyG

1
XCMS / Re: peak group detected with `peakDensityParam` but not listed in feature matrix
Hi cpataki,

You can look at the peak table with the following command:
Quote
chromPeaks(xset5,rt=rtr_1,mz=mzr,type='all',msLevel=1L)

Both highlightChromPeaks and plotChromPeakDensity use this to find the peaks inside the ranges you specify.

What you should know, however, is that plotChromPeakDensity doesn't plot 'features'. It plots the density of the peaks found (see above) using the parameters you gave it (gparam). So, it is possible that 'PeakDensityParam' is including additional nearby (mz or rt) peaks, so that the median rt and mz are outside the range you specify in mzr and rtr_1.

So I would start by increasing the mzr range and trying to look again.

Cheers,
Corey
2
XCMS / Re: Side/ Partial Peak artifacts
Hi hhabra,

Could you provide a link to some of the forum discussions that you mention. I haven't seen them and would be interested in reading about what has been done before.

There are potentially a few places that you can try to limit these side/partial peaks. You seem to have tried a few approaches with some success.
I would suggest trying to alter the grouping algorithm settings. Increasing bw will eventually cause these to merge into the larger features, but this may have unwanted effects on other features. So I would look at changing the minfrac and minsamp settings.
As you are using fillPeaks, it will "find" the missing peaks in the rest of the samples. Looking at the images you posted, it doesn't look like XCMS will find the same side/partial peaks in many samples.

If that doesn't work, perhaps we can start to think of some post-processing solutions to handle it.
3
XCMS / Re: Scan numbering for DDA/IDA-experiments
Hi Tony,

While I have know the answers to your question, you could probably test it by filtering out the MS/MS data using MSConvert. Compare the original data to the filtered data and see if they are the same.

Alternatively, you could take a look at one of the files like this:
Code: [Select]
# Read data file
raw_data<-readMSData(files = files[1], pdata = new("NAnnotatedDataFrame", pd), msLevel. = 1)
 
# Extract out spectra data
spec<-spectra(raw_data)

# Iterate through the spectra list and obtain relevant information
specl<-lapply(spec,function(x) c(x@msLevel,x@rt,x@scanIndex))

# Compact list to matrix
specl<-do.call(rbind,specl)

This will create a matrix that contains the msLevel, retention time and scan number for every scan in the data. It shouldn't take long to cross reference a couple peaks to confirm whether 'scmin/scmax' need to be adjusted. If so, you have just generated a matrix that can be used to calculate the new scan counts!

Regarding lmin and lmax: I am not too sure about these. They can be found in the CentWave code, where they are found to be the minimum 'continuous wavelet transform' (CWT) coefficients on either side of a peak. Interestingly, it looks like rtmin/rtmax are defined using the same values.
I could be way off, but I think scmin/scmax are estimates of the peak bounds based on the CWT.

As it looks like 'into' is calculated using the bounds defined in lmin/lmax, perhaps these are the better columns for you to use. But do note that these values are not always referenced from scan 1.
4
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.
Depending on what you are trying to do, you could get the median peak area from each of the triplicates. That will cut your sample rows in nearly a third.

Alternatively, you could filter the dataset to remove isotopes. Depending on the average number of carbons in your metabolites and signal/abundance, you might be able to reduce the dimensions 2-4 fold. The same can be done with removing adducts.
I've briefly played around with CAMERA for this, but ended up using mz.unity.

As a less useful suggestion, is it possible to use R for your data analysis?
For a lot of multivariate analysis, mixOmics does pretty well. The website has a lot of examples and the inbuilt plotting functions have come a long way.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.
My thoughts on this differ to many in the 'untargeted' scene. I'm really only interested in features that are present in nearly all samples (<10% missing values). So, I always ask if people expect to see features entirely missing from certain samples/groups.

The nice thing about XCMS is that you can set these parameters fairly loose early in the workflow. Then after fillChromPeaks, you can be more stringent.
With so many samples, I would imagine that seeing the same feature in multiple groups is an almost certainty. So maybe put every sample in 1 group, but set minFraction=10/600 (or something of that sort).

I'd love to hear other peoples thoughts on this, as well.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!
This is certainly a way you could go.
Perhaps there is a way to empirically determine a good CV cutoff?
If CV is mostly related to biological grouping, then you could determine the difference in CV when all injections are used compared to when you have averaged the triplicates. Determine the threshold CV by permuting the biological grouping and repeating the process (you will end up averaging non-triplicates randomly). Whatever the 95th percentile is, that is your critical value.
5
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.
Glad to hear I could be of help.

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
The differences between QCs and samples shouldn't be that big of a deal.
Depending on what batch correction method you use, you can assess the improvement in CV (RSD) of the QC features to determine how useful the batch correction was. Now, if the batch correction method optimized itself based on minimizing QC variation, then this approach is biased. Cross-validation should then be used to assess performance.

A simple visualization is to plot the pre-corrected CVs on the x-axis and the post-corrected CVs on the y-axis. Points that fall below the diagonal were improved; points on the diagonal weren't affected; points above the diagonal were negatively affected.
This may be an easy way to get a 'gut' feel for what method works best for you.

-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?

This is a fairly common approach. Of course, you should always try to keep the sample:solvent ratio as consistent across all samples as possible. Remember that different sample:solvent ratios will cause variability in extraction efficiency, ionization and detector response.

If you are concerned about introducing associations into your data, consider using a linear model to remove the correction factor.
Get the residuals from lm(peakArea~correctionFactor). This allows the detector response to not be 1:1, but doesn't do much for non-linearity.
7
XCMS / Re: findChromPeaks-centWave {xcms} - Prefilter Settings and Code Documentation
Hi Tony,

Yes, I do believe you are correct with your interpretation of the documentation.
It looks like the prefilter check is done here: do_findChromPeaks-functions.R. The vapply iterates over each ROI index list and counts the number of scans with intensity above the prefilter threshold (I). If this is above 'k', the ROI is kept.

Also, 'scans' would probably be a better word to use  :)
8
XCMS / Re: Question regarding the plot-methods {MSnbase}, type = "XIC"
Hi Tony,

There are potentially a few different ways of getting around this kind of problem. But I can't quite think of anything elegant...
At first, I was hoping that you could pass 'panel.last=c(abline(v = 181.0, col = "red", lty = 2),abline(h = 106.05, col = "red", lty = 2))' as another parameter, but that doesn't work.

As a pretty poor work around, I've pulled out the code from the MSnbase github page and edited the plotting functions.
Code: [Select]
plotXIC_MSnExp_ex <- function(x, ...) {
  ## Restrict to MS level 1
  x <- filterMsLevel(x, 1L)
  if (!length(x))
    stop("No MS1 data available")
  fns <- basename(fileNames(x))
  if (isMSnbaseVerbose())
    message("Retrieving data ...", appendLF = FALSE)
  x <- as(x, "data.frame")
  x <- split(x, x$file)
  if (isMSnbaseVerbose())
    message("OK")
  ## Check if we are greedy and plot a too large area
  if (any(unlist(lapply(x, nrow)) > 20000))
    warning("The MS area to be plotted seems rather large. It is suggested",
            " to restrict the data first using 'filterRt' and 'filterMz'. ",
            "See also ?chromatogram and ?Chromatogram for more efficient ",
            "functions to plot a total ion chromatogram or base peak ",
            "chromatogram.",
            immediate = TRUE, call = FALSE)
  ## Define the layout.
  dots <- list(...)
  if (any(names(dots) == "layout")) {
    if (!is.null(dots$layout))
      layout(layout)
    dots$layout <- NULL
  } else
    layout(.vertical_sub_layout_ex(length(x)))
  tmp <- mapply(x, fns, FUN = function(z, main, ...) {
    .plotXIC_ex(x = z, main = main, layout = NULL, ...)
  }, MoreArgs = dots)
}


.plotXIC_ex <- function(x, main = "", col = "grey", colramp = topo.colors,
                     grid.color = "lightgrey", pch = 21,
                     layout = matrix(1:2, ncol = 1), ...) {
  # start edit
  plot_strip<-function(..., v, h) plot(...)
  dots <- list(...)
  # end edit
 
  print(list(...))
  if (is.matrix(layout))
    layout(layout)
  ## Chromatogram.
  bpi <- unlist(lapply(split(x$i, x$rt), max, na.rm = TRUE))
  brks <- lattice::do.breaks(range(x$i), nint = 256)
  par(mar = c(0, 4, 2, 1))
 
  # start edit
  plot_strip(as.numeric(names(bpi)), bpi, xaxt = "n", col = col, main = main,
       bg = lattice::level.colors(bpi, at = brks, col.regions = colramp), xlab = "",
       pch = pch, ylab = "", las = 2, ...)
  # end edit
 
  if(!is.null(dots$v)) abline(v=dots$v,col='red',lty=2)
 
  mtext(side = 4, line = 0, "Intensity", cex = par("cex.lab"))
  grid(col = grid.color)
  par(mar = c(3.5, 4, 0, 1))
 
  # start edit
  plot_strip(x$rt, x$mz, main = "", pch = pch, col = col, xlab = "", ylab = "",
       yaxt = "n", bg = lattice::level.colors(x$i, at = brks, col.regions = colramp),
       ...)
 
  if(!is.null(dots$h)) abline(h=dots$h,col='red',lty=2)
  if(!is.null(dots$v)) abline(v=dots$v,col='red',lty=2)
  # end edit
 
  axis(side = 2, las = 2)
  grid(col = grid.color)
  mtext(side = 1, line = 2.5, "Retention time", cex = par("cex.lab"))
  mtext(side = 4, line = 0, "m/z", cex = par("cex.lab"))
}

.vertical_sub_layout_ex <- function(x, sub_plot = 2) {
  sqrt_x <- sqrt(x)
  ncol <- ceiling(sqrt_x)
  nrow <- round(sqrt_x)
  rws <- split(1:(ncol * nrow * sub_plot), f = rep(1:nrow,
                                                   each = sub_plot * ncol))
  do.call(rbind, lapply(rws, matrix, ncol = ncol))
}

If you run these functions, then you can plot using 'plotXIC_MSnExp_ex(serine,v=181.0,h=106.05)'.
As you would expect, this isn't a great solution. But if it is particularly helpful, perhaps suggest it on the MSnBase github page and it might get implemented in a future release.

Of course, if anybody else has a better solution, please feel free to post it.

Cheers,
Corey
9
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Random Forest batch correction (Quality Control – Random Forest Signal Correction; QC-RFSC) is incorporated into statTarget and is available in R (https://bioconductor.org/packages/release/bioc/html/statTarget.html).

Alternatively, there was a recent paper by Fiehn's lab on Systematical Error Removal using Random Forest (SERRF). Some details of the method can be found: https://slfan2013.github.io/SERRF-online/. Unfortunately, they have set it up as an online portal, where you have to upload your data to their server. This could constitute a breach of ethics for data sharing/storage, so be you have permission to do so if you use it.
10
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Glad to see things moving along, Metabolon1.

Regarding batch correction, take a look at a recent topic that came up on the forums: Feature intensity drift correction suggestions. In addition, check out the Metabolomics Society Wiki page on batch effects for some other packages: Batch Effect.
There are also a few RandomForest based methods floating around. I'll grab the name of the package tomorrow.

In regards to the latest plots you put up, they looked remarkably similar to plots I made when I "exported" my data with featureValues but didn't specify 'value = "into"'. If you don't specify this, featureValues just returns the row number of chromPeaks for that peak.

Lastly, I too had to set 'binSize' (in PeakDensityParam) quite a bit lower as I observed the feature m/z range was a bit too large given the ppm error that I would expect.

Cheers,
Corey
11
XCMS / Re: Implementing custom retention time alignment algorithms
Thanks Johannes,

I'll take a look at making the changes and generating a pull request (never done one before).

No problem regarding xcms centWave. I don't think I'd be brave enough to suggest changes there.

Cheers,
Corey
12
Other / Re: Open software for SRM experiments
Hi VSO,

I've used skyline (https://skyline.ms/project/home/software/Skyline/begin.view) a fair bit in the past for metabolomics SRM data analysis. You can export peak areas, retention times (apex, start of integration, end of integration) and FWHM fairly easily (using document grid). I don't know if it generates a metric for noise (or S/N)...

The hardest part when getting started, is that you have to manually specify what transitions you want to look at (edit->Insert->Transition list). Skyline won't read the transition list from a file automatically.

Cheers,
Corey
13
XCMS / Re: Implementing custom retention time alignment algorithms
Thanks for looking into and fixing the error - very much appreciated.

I was quite intrigued that you said fillChromPeaks always uses the adjusted retention time, so I looked a bit deeper into the code.
It seems fairly simple to allow the ability to integrate using the original rt range.

If you included another parameter on getChromPeakData to select whether switch back to the unadjusted rtrange, stored the unadjusted rtime, figured out which index of rtim is the rtmin and rtmax, then use those indexes to get the original rt for use in the calculation of res[,"into"]
Code: [Select]
.getChromPeakData <- function(object, peakArea, sample_idx,
                             mzCenterFun = "weighted.mean",
                             cn = c("mz", "rt", "into", "maxo", "sample"),
                             unadjusted=FALSE) {
...
rtim <- rtime(object)
if(unadjusted) rtim_adjusted <- rtime(object,adjusted=!unadjusted)
...
rtScans<-range(which(rtim >= rtr[1] & rtim <= rtr[2]))
...
if(unadjusted) rtrange<-rtim_unadjusted[rtScans] else rtrange<-rtim[rtScans]
...
res[i, "into"] <- sum(mtx[, 3], na.rm = TRUE) *
          ((rtrange[2] - rtrange[1]) /
             max(1, (sum(rtim >= rtr[1] & rtim <= rtr[2]) - 1)))

However, this highlighted something else in the code that felt odd to me (again, I'm making a lot of assumptions).
By using 'rtr[2] - rtr[1]' in the calculation of "into", don't we always end up overestimating the area of the peak?
rtr comes from the medians of other samples, but getChromPeakData integrates using scans found between these limits. So the rt range of where it integrates is notionally smaller than 'rtr[2] - rtr[1]'. In the example above, rtrange is indeed smaller (with unadjusted=FALSE).

Could we iterate over peakArea and calculate new rtmin and rtmax based on the actual rtime?
Code: [Select]
peakArea<-apply(peakArea,1,function(pk) {
    # Get start and end index of rtim between rt range
    rtScans<-range(which(rtim >= pk["rtmin"] & rtim <= pk["rtmax"]))
   
    # Convert median rt range to actual rt range
    pk[c("rtmin","rtmax")]<-rtim[rtScans]
   
    # If the user wants unadjusted rt range give it to them, otherwise just rt range
    if(unadjusted) rtrange<-rtim_unadjusted[rtScans] else rtrange<-rtim[rtScans]
   
    # Save rt range in peakArea, so it can be used instead of rtr[2]-rtr[1] for res[i,'into']
    pk[c("rtDiff")]<-diff(rtrange)
   
    return(pk)
})
peakArea<-t(peakArea)

I'm not sure how centWave integrates peaks and how the rtmin and rtmax are chosen. So maybe this doesn't make sense...

Cheers,
Corey
14
XCMS / Re: Implementing custom retention time alignment algorithms
Hi Johannes,
Just to be clear, I am using 'adjustedRtime' to apply the adjusted retention times (and not following it with 'applyAdjustedRtime').
Code: [Select]
adjustedRtime(xdata)<-scans

If I use applyAdjustedRtime after adjustedRtime, I do not get an error with fillChromPeaks. This is likely because hasAdjustedRtime returns FALSE, so the processHistory check never gets performed (methods-XCMSnExp.R#L651, I think)
My concern with using applyAdjustedRtime, is that the data will be slightly warped and so the integration during fillChromPeaks will be slightly off. That is, unless it using the retention time when loading the raw data again? In which case the whole thing is solved  :))

Nonetheless, I am using 'R Under development (unstable) (2019-01-14 r75992)', 'xcms_3.5.1' and 'MSnbase_2.9.3'.
I compiled this version of xcms from the github page ("sneumann/xcms") to utilize the subset feature, so I'm not sure if that version number above is necessarily correct.

Thanks
15
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Hopefully someone with more experience will chime in when they get the chance.

In my experience, limiting the number of available cores will reduce how much total ram is used/allocated i.e. ram usage should be somewhat linear with the number of parallel processes. On our lower spec desktops, we use 4 threads to keep the ram usage below 16 GB.

Based on my understanding, readMSData generates the "onDisk" object without having to load all the raw data into ram. findChromPeaks will cause each parallel process to load a raw data file from the hard drive into memory, perform the peak detection, then clear the unneeded data from memory. So this step will be memory and IO intensive.
After that, retention time alignment may or may not be similarly demanding. Obiwarp works on the raw data, so it will need to be loaded into ram again. Peakgroups, on the otherhand, doesn't require the raw data.
The last operation I usually perform is fillChromPeaks. This, again, requires the raw data to be loaded. In my hands, this seems to be the most memory intensive step, requiring me to run it single threaded even with 32 GB of system ram.

You certainly could get away with changing the number of available cores at different steps. But you might need to experiment to determine what works best for your system. In our case, we ran scripts with different number of threads and monitored the systems - reducing the number until it was stable.