How to run featureSpectra (or another function) on a subset of samples

Topic: How to run featureSpectra (or another function) on a subset of samples (Read 5569 times) previous topic - next topic

How to run featureSpectra (or another function) on a subset of samples

December 13, 2019, 02:03:11 PM

Dear All,

I'm developing a pipeline that will allow me to extract a common set of features from MSe and DDA acquisitions run on similar samples. See here for more background info: https://groups.google.com/forum/#!category-topic/molecular_networking_bug_reports/ideas-for-new-features/bLmtPLnQrR8

What I'm trying to do now is to export the MS2 spectra from the DDA files only using the featureSpectra function. I want to integrate this with GNPS for FBMN.

If I run it on the unmodified XCMSnExp object, I get an error that it contains MS1 spectra.

Code: [Select]

> filteredMs2Spectra <- featureSpectra(xdata, return.type = "Spectra")
Error: BiocParallel errors
  element index: 9, 10, 11, 12, 13, 14, ...
  first error: This experiment contains MS1 spectra

If I use filterFile, I lose the correspondence results.

Code: [Select]

> MS2.file.names <- key$file.name[which(key$ms.method == "DDA")]
> xdata <- filterFile(xdata, MS2.file.names, keepAdjustedRtime=TRUE)
Correspondence results (features) removed.

The XCMS manual says "All subsetting methods try to ensure that the returned data is consistent. Correspondence results for example are removed if the data set is sub-setted by file, since the correspondence results are dependent on the files on which correspondence was performed. "

Even after reading the above, I still don't fully understand why it is necessary to remove the correspondence results. In my mind, after correspondence and peak filling, the XCMSnExp is a peak table with a bunch of metadata associated with it (I realize that this is an oversimplification and probably not true). And if so, why does correspondence have to be dropped when filtering by file?

At any rate, is there a way to do what I am trying to do? That is, pre-process MSe and DDA together (which I've successfully done) and then export MS2 data from only the DDA files. Basically, I'm trying to arrive at a set of MS1 features that is common to both MSe and DDA datasets. The ultimate goal is to utilize old MSe files to generate metadata about the features and then integrate that metadata into a feature-based molecular network.

Thank you in advance for your insights!
Taylan

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #1 – December 15, 2019, 11:38:46 PM

Hi Metabolon1,

A great deal of work was put into making the XCMS analysis reproducible. This ensures the same results could be obtained just by having the final object i.e. all the data and 'process history' is contained within.

I'll suggest a bit of a hack that you can try, which I'm not able to test at the moment, but hopefully it can help get you somewhere. There is almost certainly a proper way to do what you want, but I can appreciate the frustration that can ensue when things don't work the way you want

This is basically following the calls that featureSpectra makes internally (skipping all the checks along the way...)
featureSpectra

Code: [Select]

[url=https://github.com/sneumann/xcms/blob/557b936967271690140e19224be707d87ea63168/R/functions-XCMSnExp.R#L1804]ms2_spectra_for_all_peaks[/url]
pks <- chromPeaks(xdata)
xdata_filtered <- filterMsLevel(as(xdata, "OnDiskMSnExp"), 2L)
## Split data per file
file_factor <- factor(pks[, "sample"])
pks <- split.data.frame(pks, f = file_factor)
xdata_filtered <- lapply(as.integer(levels(file_factor)), filterFile, object = xdata_filtered)

## You then need to loop through xdata_filtered and pks for the samples you need. Each entry in xdata_filtered becomes 'x' and 'pks' is the corresponding entry in the pks list.
sps <- spectra(x)
pmz <- precursorMz(x)
rtm <- rtime(x)

[url=https://github.com/sneumann/xcms/blob/557b936967271690140e19224be707d87ea63168/R/functions-XCMSnExp.R#L1877]ms2_spectra_for_peaks_from_file[/url]
## Make sure you define all the required parameters i.e. method = 'closest_mz' 
res <- vector(mode = "list", nrow(pks))
for (i in 1:nrow(pks)) {
  if (is.na(pks[i, "mz"]))
    next
  idx <- which(pmz >= pks[i, "mzmin"] & pmz <= pks[i, "mzmax"] &
                 rtm >= pks[i, "rtmin"] & rtm <= pks[i, "rtmax"])
  if (length(idx)) {
    if (length(idx) > 1 & method != "all") {
      if (method == "closest_rt")
        idx <- idx[order(abs(rtm[idx] - pks[i, "rt"]))][1]
      if (method == "closest_mz")
        idx <- idx[order(abs(pmz[idx] - pks[i, "mz"]))][1]
      if (method == "signal") {
        sps_sub <- sps[idx]
        ints <- vapply(sps_sub, function(z) sum(intensity(z)),
                       numeric(1))
        idx <- idx[order(abs(ints - pks[i, "maxo"]))][1]
      }
    }
    res[[i]] <- lapply(sps[idx], function(z) {
      z@fromFile = fromFile
      z
    })
  }
}
names(res) <- rownames(pks)

If that is sounding like too much stuffing around, you can save featureDefinitions, featureValues and chromPeaks. Filter the file to just the DDA samples. Then you can edit these to be internally consistent and write them back to xdata.

After all that, hopefully someone who knows what they are talking about comes in to give you the proper directions

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #2 – December 24, 2019, 12:17:44 PM

Hi CoreyG,

Thanks for the hack! I'll give it a try an see how it works.

Happy Holidays!
Taylan

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #3 – December 24, 2019, 06:17:30 PM

Update: I have yet to succeed. I made some headway with CoreyG's 2nd suggestion:

Quote

If that is sounding like too much stuffing around, you can save featureDefinitions, featureValues and chromPeaks. Filter the file to just the DDA samples. Then you can edit these to be internally consistent and write them back to xdata.

I'm getting an error when I try to incorporate the saved featureDefinitions back into the filtered xdata object.

Code: [Select]

j <- grep("NEG_DDA", fileNames(xdata))
MS2.file.paths <- fileNames(xdata)[j]
MS2.file.names <- gsub(".*/", "", MS2.file.paths)

# extracting parts of xdata & filtering them ------------------
fd <- featureDefinitions(xdata)
fv <- featureValues(xdata)
fv.filtered <- fv[, colnames(fv) %in% MS2.file.names]
cp <- chromPeaks(xdata)
cp.filtered <- cp[which(cp[,which(colnames(cp) == "sample")] %in% j),]

# filtering files in xdata -----------------------------------------
xdata.filtered <- filterFile(xdata, MS2.file.names, keepAdjustedRtime=TRUE)

# changing sample numbers in filtered, extracted chromatographic peak object to match those in filtered xdata object
# if the sample numbers do not match, an error is generated
file_factor <- factor(cp.filtered[, "sample"])
cp.filtered.split <- split.data.frame(cp.filtered, f=file_factor)
cp.filtered.v2 <- c()
for(i in 1:length(cp.filtered.split)){
    cp.filtered.split[[i]][,"sample"] <- i
    cp.filtered.v2 <- rbind(cp.filtered.v2, cp.filtered.split[[i]])
} #end for loop

# incorporate filtered peaks back into filtered xdata
chromPeaks(xdata.filtered) <- cp.filtered.v2

featureDefinitions(xdata.filtered) <- fd

The last line above gives the error message:

Code: [Select]

Error in validObject(object) : 
  invalid class “XCMSnExp” object: Some of the indices in column 'peakidx' of element 'featureDefinitions' do not match rows of the 'chromPeaks' matrix!

To overcome this error, I tried the code below, with the same error message resulting.

Code: [Select]

peak.list <- as.numeric(gsub("CP", "", row.names(chromPeaks(xdata.filtered))))
for(i in 1:length(fd@listData$peakidx)){ # this loop removes chromatographic peaks in "fd" that are not present in cp.filtered.v2 (i.e. chromPeaks(xdata.filtered) )
    k <- which(fd@listData$peakidx[[i]] %in% peak.list)
    fd@listData$peakidx[[i]] <- fd@listData$peakidx[[i]][k]
} # end for loop

featureDefinitions(xdata.filtered) <- fd

I'm not sure how to proceed. Ideas?

Also, regarding the 1st approach CoreyG suggested, what would I do with the 'res' object that is generated by this code?

Many thanks!
Taylan

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #4 – December 26, 2019, 01:02:27 PM

Update: I tested the code from CoreyG's 1st suggestion for a single file...

Code: [Select]

pks <- chromPeaks(xdata) # extract chromatographic peaks as a separate object
xdata_filtered <- filterMsLevel(as(xdata, "OnDiskMSnExp"), 2L) # extract MS2 data as a separate object
## Split data per file
file_factor <- factor(pks[, "sample"]) #define each sample as a separate factor
pks <- split.data.frame(pks, f = file_factor) # make a separate dataframe for each sample
xdata_filtered <- lapply(as.integer(levels(file_factor)), filterFile, object = xdata_filtered)

## You then need to loop through xdata_filtered and pks for the samples you need. Each entry in xdata_filtered becomes 'x'; and 'pks' is the corresponding entry in the pks list.
n <- 1 # index in xdata_filtered and pks
method = 'closest_mz'

sps <- spectra(xdata_filtered[[n]])
pmz <- precursorMz(xdata_filtered[[n]])
rtm <- rtime(xdata_filtered[[n]])

#https://github.com/sneumann/xcms/blob/557b936967271690140e19224be707d87ea63168/R/functions-XCMSnExp.R#L1877
## Make sure you define all the required parameters i.e. method = 'closest_mz'
res <- vector(mode = "list", nrow(pks[[n]]))
for (i in 1:nrow(pks[[n]])) {
   if (is.na(pks[[n]][i, "mz"]))
     next
   idx <- which(pmz >= pks[[n]][i, "mzmin"] & pmz <= pks[[n]][i, "mzmax"] &
                  rtm >= pks[[n]][i, "rtmin"] & rtm <= pks[[n]][i, "rtmax"])
   if (length(idx)) {
     if (length(idx) > 1 & method != "all") {
       if (method == "closest_rt")
         idx <- idx[order(abs(rtm[idx] - pks[[n]][i, "rt"]))][1]
       if (method == "closest_mz")
         idx <- idx[order(abs(pmz[idx] - pks[[n]][i, "mz"]))][1]
       if (method == "signal") {
         sps_sub <- sps[idx]
         ints <- vapply(sps_sub, function(z) sum(intensity(z)),
                        numeric(1))
         idx <- idx[order(abs(ints - pks[[n]][i, "maxo"]))][1]
       }
     }
     res[[i]] <- lapply(sps[idx], function(z) {
       z@fromFile = fromFile
       z
     })
   }
 }

...and got an error message

Code: [Select]

Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “standardGeneric” is not valid for @‘fromFile’ in an object of class “Spectrum2”; is(value, "integer") is not TRUE

Assuming this code does work, I still think I'm missing something. I'm not seeing how looping through this code would produce an output that has a data structure equivalent to calling featureSpectra on xdata. If I loop through the above code for each element in xdata_filtered and pks that correspond to a file with MS2, then each time res will be overwritten by the data from the next file. At any given time, res will have the results from only one file. Any ideas?

Also, as per my previous post, I'm still stuck at the same place regarding CoreyG's 2nd suggestion.

Thanks everyone!
Taylan

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #5 – December 28, 2019, 03:25:09 AM

Great work, Taylan. You've made good progress considering the minimal/incomplete advice I gave. However, I'm sorry you still haven't got a working solution.

I'm been travelling and haven't had much internet/computer access. I'll be able to put together a more complete solution mid next week.

Regarding the second solution, it helps to know that peakidx in featureDefinitions corresponds to the row numbers of chromPeaks (not the row.names). This means that deleting rows in chromPeaks will misalign chromPeaks and peakidx.
The simplest solution is to add a column to chromPeaks that contains the row number. cp<-cbind(cp,rowid=seq(nrow(cp))). Then filter chromPeaks, then 'match' peakidx to the temporary column. match(fd@listData$peakidx[[i ]], cp[,"rowid"]).
This gives you the new peakidx to use. Note that this will probably contain NAs for the missing samples. NA.omit() might be useful to get rid of them.
For consistency, it might be good to overwrite the row.names of cp (not sure if it matters).

This could take a while as match doesn't consider the rowid is a sorted vector. If cp or fd are large, you are doing a whole heap of value lookups.

Regarding solution 1, internally xcms calls a (bioconductor?) Parallel version of lapply. So each file returning 'res' just gets built up into a list. Looking at the xcms code on github might make this simpler to understand than my rambling

This should get you over the line. If not, I'll be able to give you some more help mid week. It's also a good idea to compare the final featurevalue before and after doing this to make sure the results are consistent for the samples still present.
^ again, the above is all from memory. So you might need to use some judgment.

EDIT: looks like some of my code is being scrambled by the wysiwyg editor.
Cheers,
Corey

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #6 – December 31, 2019, 11:54:45 AM

Thanks Dr. Corey! I'll give it a try and report back. - Taylan

Re: How to run featureSpectra (or another function) on a subset of samples

Reply #7 – December 31, 2019, 04:20:59 PM

Hi Dr. Corey,

I think I got it to work (mostly)! Here's the winning code.

Code: [Select]

j <- grep("NEG_DDA", fileNames(xdata)) # index of DDA files
MS2.file.paths <- fileNames(xdata)[j] # file paths of DDA files
MS2.file.names <- gsub(".*/", "", MS2.file.paths) # names of DDA files
fd <- featureDefinitions(xdata) # extract feature defs as a new object
fv <- featureValues(xdata) # extract feature values as a new object
fv.filtered <- fv[, colnames(fv) %in% MS2.file.names] # filter feature values to include only DDA samples
cp <- chromPeaks(xdata) # extract chromatographic peaks as a new object
cp <- cbind(rowid=seq(nrow(cp)), cp) # add a temporary column to cp for matching with 'peakidx' in feature definitions (suggested by CoreyG)
cp.filtered <- cp[which(cp[,which(colnames(cp) == "sample")] %in% j),] # filter cp to include only DDA samples

peakidx.filtered <- list() # create object to store results of loop below
for(i in 1:length(fd@listData$peakidx)){
     temp <- match(fd@listData$peakidx[[i]], cp.filtered[,"rowid"])
     peakidx.filtered[[i]] <- temp[which(!is.na(temp))]
} # end loop; this filters to include only peakidx in fd that correspond to peaks in DDA samples
fd.filtered <- fd # duplicate original feature definitions
fd.filtered@listData$peakidx <- peakidx.filtered # overwrite peakidx in duplicated fd with filtered peakidx generated by loop above

xdata.filtered <- filterFile(xdata, MS2.file.names, keepAdjustedRtime=TRUE) # create a new xdata object with only DDA samples; correspondence results are removed and will be added back in below

file_factor <- factor(cp.filtered[, "sample"]) # outputs a vector to correspond peak number with sample number
cp.filtered.split <- split.data.frame(cp.filtered, f=file_factor) # splits cp.filtered (a dataframe) into a list of dataframes (one for each remaining DDA sample)

cp.filtered.v2 <- c() # new object for storing results
for(i in 1:length(cp.filtered.split)){
    cp.filtered.split[[i]][,"sample"] <- i
    cp.filtered.v2 <- rbind(cp.filtered.v2, cp.filtered.split[[i]])
} #end loop; renumbers samples in filtered cp list, starting at 1

chromPeaks(xdata.filtered) <- cp.filtered.v2 # GOAL !!!
featureDefinitions(xdata.filtered) <- fd.filtered # GOAL !!!

# export MS1 and MS2 features
filteredMs2Spectra <- featureSpectra(xdata.filtered, return.type = "Spectra")
filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE)
filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra)
writeMgfData(filteredMs2Spectra, paste(run.name, "_ms2spectra_all.mgf", sep=""))

# generate data table (i.e. peak table) in format needed for GNPS/FBMN
setwd(save.dir)
featuresDef <- featureDefinitions(xdata.filtered)
featuresIntensities <- featureValues(xdata.filtered, value = "into")
dataTable <- merge(featuresDef, featuresIntensities, by = 0, all = TRUE)
dataTable <- dataTable[, !(colnames(dataTable) %in% c("peakidx"))]
write.table(dataTable, paste(run.name, "_xcms_all.txt", sep=""), sep = "\t", quote = FALSE, row.names = FALSE) # UPLOAD TO GNPS for FBMN

#export MS2 features only
setwd(save.dir)
filteredMs2Spectra_maxTic <- combineSpectra(filteredMs2Spectra,
                                            fcol = "feature_id",
                                            method = maxTic)
writeMgfData(filteredMs2Spectra_maxTic, paste(run.name, "_ms2spectra_maxTic.mgf", sep="")) # UPLOAD TO GNPS for FBMN
filteredDataTable <- dataTable[which(dataTable$Row.names %in% filteredMs2Spectra@elementMetadata$feature_id),]
write.table(filteredDataTable, paste(run.name, "_xcms_onlyMS2.txt", sep=""), sep = "\t", quote = FALSE, row.names = FALSE) #NOTE: apparently this is the table that can be used with GNPS/FBMN

The two key changes were adding rowid to cp and filtering peakidx to remove peaks not present in peakidx. Incidentally, I did not overwrite row names in cp/cp.filtered, and it didn't seem to cause any problems.

However, as you can see, I did not write fv.filtered back into xdata.filtered. It seems that featureValues() is a getter but NOT a setter, so

Code: [Select]

> featureValues(xdata.filtered) <- fv.filtered
Error in featureValues(xdata.filtered) <- fv.filtered : 
  could not find function "featureValues<-"

I couldn't quite figure out the structure of xdata.filtered, but I'm sure there's some way to overwrite the feature values through subsetting. Nonetheless, even without writing fv.filtered back into xdata.filtered, it seems that the feature values are re-populated somehow after

Code: [Select]

featureDefinitions(xdata.filtered)<-fd.filtered

I'm not quite sure how this happens. And while the newly populated feature values have the same dimensions as fv.filtered, they are not identical:

Code: [Select]

> all(fv.filtered == featureValues(xdata.filtered))
[1] FALSE

Visual side-by-side inspection of fv.filtered and featureValues(xdata.filtered) shows consistency for some samples [with very small values in fv.filtered replaced with "NA" in featureValues(xdata.filtered)], but the values for other samples are completely different.

Looking at str(xdata.filtered), I don't see any obvious objects to replace with fv.filtered. I thought maybe xdata.filtered@featureData@data$totIonCurrent would correspond to the values in featureValues(xdata.filtered), but they don't seem to.

Looking at the code for featureValues() , I don't see anything obvious. It may need to look at this code again with rested eyes.

Any idea how to get fv.filtered back into xdata.filtered?

Thanks!
Taylan