Skip to main content
Recent Posts
11
XCMS Online / Re: Discrepancy between XCMS online and XCMS R-Package
Last post by johannes.rainer -
AFAIK xcmsOnline uses an old version of xcms (that's what you see in the logs) - and most likely also an old version of R. The development of xcms and xcmsOnline somehow diverged at some point and all the new developments in xcms (aka xcms3) are not used/available in the online version.
 
Some discrepancies could eventually be explained by the changes in xcms that we did during the update and modernization (have also a look at the vignettes of xcms https://bioconductor.org/packages/release/bioc/html/xcms.html, specifically the New and modified functionality in xcms).

For a thourough comparison one would however have to start with an standalone R xcms version 1.47.3 and compare its results to those of xcmsOnline. Could be that xcmsOnline has some internal helper functions and modifications that are not available in the standalone R version of xcms ... but I am only guessing here.
12
XCMS / Re: Scan numbering for DDA/IDA-experiments
Last post by johannes.rainer -
I would be careful with the scmin/scmax lmin/lmax columns - I do not recall what they exactly mean. We do by default not record from which spectrum the data of a chromatographic peak comes, but with the retention time and m/z range available it is easy to subset/extract all spectra for one chromatographic peak.

What exactly do you want/need to do with the data? Maybe there is a simple solution for that...
13
XCMS / Re: Scan numbering for DDA/IDA-experiments
Last post by CoreyG -
Hi Tony,

While I have know the answers to your question, you could probably test it by filtering out the MS/MS data using MSConvert. Compare the original data to the filtered data and see if they are the same.

Alternatively, you could take a look at one of the files like this:
Code: [Select]
# Read data file
raw_data<-readMSData(files = files[1], pdata = new("NAnnotatedDataFrame", pd), msLevel. = 1)
 
# Extract out spectra data
spec<-spectra(raw_data)

# Iterate through the spectra list and obtain relevant information
specl<-lapply(spec,function(x) c(x@msLevel,x@rt,x@scanIndex))

# Compact list to matrix
specl<-do.call(rbind,specl)

This will create a matrix that contains the msLevel, retention time and scan number for every scan in the data. It shouldn't take long to cross reference a couple peaks to confirm whether 'scmin/scmax' need to be adjusted. If so, you have just generated a matrix that can be used to calculate the new scan counts!

Regarding lmin and lmax: I am not too sure about these. They can be found in the CentWave code, where they are found to be the minimum 'continuous wavelet transform' (CWT) coefficients on either side of a peak. Interestingly, it looks like rtmin/rtmax are defined using the same values.
I could be way off, but I think scmin/scmax are estimates of the peak bounds based on the CWT.

As it looks like 'into' is calculated using the bounds defined in lmin/lmax, perhaps these are the better columns for you to use. But do note that these values are not always referenced from scan 1.
14
XCMS Online / Re: Discrepancy between XCMS online and XCMS R-Package
Last post by Purva Kulkarni -
Hi,

I have been trying to do a similar comparison recently with identical parameters using XCMS online (3.7.1 as displayed on their website) and the R package (version 3.4.2). I also get different results in terms of number of features. When I checked the log file created by XCMS online, I found the following header text:
------------------------------------------------------------------------
XCMSOnline  version  2.7.2
XCMS        version  1.47.3
CAMERA      version  1.34.0
------------------------------------------------------------------------

I am not sure why does XCMS online log file display old version numbers for XCMS online and XCMS. I have also reported this to the XCMS group via their contact form. 
Hope someone here can help me to make a fair comparison.
15
XCMS / Scan numbering for DDA/IDA-experiments
Last post by Tony -
Dear Forum,

we are conducting metabolomic experiments using a AB Sciex 5600 TripleToF with DDA (Data Dependendent Aquisition) unsing R 3.6.0 under MSnbase (2.9.5) and xcms (3.5.5).
So we have MS1-scans and MS2-scans intrinsically in the raw data files.

The question here refers, on how to obtain the correct number of scans per peak in one file. The raw data was read with:
readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), msLevel. = 1)

Given the information from the function chromPeaks(object, bySample = FALSE, rt = numeric(), mz = numeric(), ppm = 0, type = "any"), this results in the following table.

               mz    mzmin    mzmax     rt  rtmin  rtmax      into      intb     maxo  sn egauss mu sigma  h  f dppm scale scpos scmin scmax lmin lmax sample is_filled
CP000001 185.0415 185.0409 185.0423 46.568 40.745 53.293 1623.3019 1605.9049 193.8009  25     NA NA    NA NA  6    1     9   169   160   178  148  185      1         0
CP000002 185.0419 185.0409 185.0429  3.887  0.724  6.577  763.4537  755.4926 170.2468  21     NA NA    NA NA  6    4     7    15     8    22    3   25      1         0
CP000003 512.8859 512.8845 512.8887 51.321 49.069 52.634  322.0898  319.1189 175.0130 174     NA NA    NA NA  7    8     7   182   175   189   87   93      1         0
CP000004 271.9464 271.9443 271.9484 51.321 48.780 53.293  303.1867  299.2378 142.4416 141     NA NA    NA NA  8    8     7   182   175   189   87   95      1         0
CP000005 385.9267 385.9250 385.9298 51.321 48.780 53.293  275.9011  271.9522 131.7186 131     NA NA    NA NA  9    5     7   182   175   189   87   95      1         0
CP000006 498.9059 498.9042 498.9077 50.666 49.069 53.293  256.1620  252.5414 133.5325 133     NA NA    NA NA 10    7     7   181   174   188   87   94      1         0

Is it okay to use the colums "scmin" "scmax", i.e. to compute scmax - scmin to get the correct number of scans for each peak,
or is there a need to take into account, that several scans need to be omitted for MS2-scans?

Basically the question (for DDA-experiments) simply condenses on how the scan numbering works:
How are the MS1-scans are numbered intrinsically?
How are the MS2-scans are numbered intrinsically?

By the way, what is the meaning of the columns lmin lmax? I could not find the meaning in the documentation of  chromPeaks() ...

Thanks for an answer.

kind regards
Tony
17
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by CoreyG -
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.
Depending on what you are trying to do, you could get the median peak area from each of the triplicates. That will cut your sample rows in nearly a third.

Alternatively, you could filter the dataset to remove isotopes. Depending on the average number of carbons in your metabolites and signal/abundance, you might be able to reduce the dimensions 2-4 fold. The same can be done with removing adducts.
I've briefly played around with CAMERA for this, but ended up using mz.unity.

As a less useful suggestion, is it possible to use R for your data analysis?
For a lot of multivariate analysis, mixOmics does pretty well. The website has a lot of examples and the inbuilt plotting functions have come a long way.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.
My thoughts on this differ to many in the 'untargeted' scene. I'm really only interested in features that are present in nearly all samples (<10% missing values). So, I always ask if people expect to see features entirely missing from certain samples/groups.

The nice thing about XCMS is that you can set these parameters fairly loose early in the workflow. Then after fillChromPeaks, you can be more stringent.
With so many samples, I would imagine that seeing the same feature in multiple groups is an almost certainty. So maybe put every sample in 1 group, but set minFraction=10/600 (or something of that sort).

I'd love to hear other peoples thoughts on this, as well.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!
This is certainly a way you could go.
Perhaps there is a way to empirically determine a good CV cutoff?
If CV is mostly related to biological grouping, then you could determine the difference in CV when all injections are used compared to when you have averaged the triplicates. Determine the threshold CV by permuting the biological grouping and repeating the process (you will end up averaging non-triplicates randomly). Whatever the 95th percentile is, that is your critical value.
18
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by CoreyG -
Hello folks,

Here's an update from my end.

I returned form vacation to find CoreyG's helpful responses. It turns out that I was not using "value='into'". I changed this param, and now my data look much better.
Glad to hear I could be of help.

I've been using the Brunius batchCorr package, because I already know how to use R. However, given the characteristics of my dataset, I wonder if it is adequate.

Characteristics:
-- ~1950 files representing ~570 plant extracts (triplicate injection) + QC samples
-- 13 batches
-- All extracts are from the same species
-- The QC sample is an extract of about 40 accessions pooled together. However, it looks quantitatively different than most of the extracts in the sample set: the later eluting peaks of the QC sample are generally bigger while the early peaks are smaller. I don't think there are many qualitative differences between QC and other samples. However, I can imagine that these might translate into presence/absence differences in the peak table for minor compounds.
The differences between QCs and samples shouldn't be that big of a deal.
Depending on what batch correction method you use, you can assess the improvement in CV (RSD) of the QC features to determine how useful the batch correction was. Now, if the batch correction method optimized itself based on minimizing QC variation, then this approach is biased. Cross-validation should then be used to assess performance.

A simple visualization is to plot the pre-corrected CVs on the x-axis and the post-corrected CVs on the y-axis. Points that fall below the diagonal were improved; points on the diagonal weren't affected; points above the diagonal were negatively affected.
This may be an easy way to get a 'gut' feel for what method works best for you.

-- The extracts--other than QC--are not standardized by concentration or by equivalent weight of plant material. There is a range of weight of plant material that was extracted. Nonetheless, I do have for each sample the weight of plant material extracted and the weight of solvent used for extraction. From these values, I have generated a sample:solvent correction factor.
-- This is a pilot dataset and was not intended for publication.

My thinking is, now that the batch correction has been done, the next step is to apply the sample:solvent correction factor. The simplest thing to do would be, for each feature in a sample, divide the peak area value by the correction factor for that sample. However, I realize that detector response may not be linear in the range of interest for each feature; thus, the results may not be completely accurate. Nonetheless, I can't think of a better option. Any feedback on my approach?

This is a fairly common approach. Of course, you should always try to keep the sample:solvent ratio as consistent across all samples as possible. Remember that different sample:solvent ratios will cause variability in extraction efficiency, ionization and detector response.

If you are concerned about introducing associations into your data, consider using a linear model to remove the correction factor.
Get the residuals from lm(peakArea~correctionFactor). This allows the detector response to not be 1:1, but doesn't do much for non-linearity.
19
XCMS / Re: Using xcmsSet getting Error in R_nc4_close: NetCDF: Not a valid ID
Last post by CoreyG -
Hi Dominic,

The issue you are having was recently noted on the xcms github page: NetCDF: Not a valid ID error when using CentOS with NetCDF library 4.6.2

It looks like Johannes has already fixed the issue. You can install the patched version using the command:
Code: [Select]
devtools::install_github("sneumann/xcms", ref = "RELEASE_3_8")

Hopefully that fixes the issue for you!
20
Other / Re: Peak alignment with large dataset (over 2500 samples and growing)
Last post by metabolon1 -
On another matter, I’m wondering how to cut down on the number of features while still maintaining a low intensity threshold. Currently I have ~13,000 features. My goal is to be able to get my peak table into EZinfo, which is not able to handle my 1950x13,000 peak table. I am interested in minor compounds, so I don’t just want to filter by intensity. I have a few ideas, and I would love it if anyone could offer feedback.

One idea is to set strict filters during correspondence to cut down on the number of “false” features. I tried re-running XCMS on this same dataset but using different params. In particular, for correspondence, I changed the way sample groups are defined. Previously, all samples were in the same group. This time, I defined each sample as its own group (so ~600 groups). For PeakDensityParam, I set minFraction=2/3 & minSamples=2. My thinking was that a true feature would be present in all 3 injections of a sample, but I set the cutoff to 2 out of 3 to be on the safe side. In this way, I hoped to eliminate false features. At any rate, the correspondence step took much longer than before, and I ran out of memory before the script was completed. I tried a couple times with the same result.

Another idea is to filter out less informative markers based on variation of each feature across my sample set. My idea is to calculate the coefficient of variation for each marker across the entire dataset, and then exclude any markers below a certain CV value. I understand that p value and fold-change are often used for this kind of filtering, but as I understand it, these only make sense if the dataset contains multiple experimental groups. I don’t have any groups in my dataset; this is just an exploratory analysis. Does anyone have knowledge of or experience with filtering in this way? Any papers that you can suggest? How to determine an appropriate cutoff value for CV?

Thanks!