Hi All,
my problem is to find common features in two studies (each with lots of samples), which have been run with the same machine, but separately processed with XCMS.
Specifically, I aligned the two studies separately using the same XCMS parameters.
But now I then need to find the common features in both studies.There are two options to do that:
fixed approach where I specify the max deviation in masses and retention times between the two studies (I think this is similar to what metaXCMS does).
How does it work:
a. Find all the masses in study 1 that are equal +/- 0.02 Dalton in study 2.
b. For each common mass find the retention times that are equal +/-
x seconds in study 2. [/list]
[/list]
However, we know that the retention time correction is a function of the retention time and not uniform for all the retention times.
This is for example the median retention time correction for study 1. The dashed lines are 1.96*median absolute deviation.
[attachment=0:23oyzgmy]rt_correction_vs_rt_corrected_twge.pdf[/attachment:23oyzgmy]
This is the same graph for study 2.
[attachment=1:23oyzgmy]rt_correction_vs_rt_corrected_pivus.pdf[/attachment:23oyzgmy]
The second possibility is:
dynamic approach where I specify the max deviation in masses (e.g. +/- 0.02 Da) and a
dynamic deviation in retention time.
A
dynamic retention time means that when I try to find features with similar retention time, given that they have similar mass, I will not use a fixed range, but I allow the deviation in retention times to depend on the retention time.
Let's see an example:
I have this feature from study 1: M280.093T32.408
My goal is to determine if there is a similar feature in study 2.
This is the algorithm I used to compare features and find out if they are the same:
i of interest from study 1 with retention time
t_i and mass
m_i search for all the features in study 2 with mass
m so that:
|m_i-m| < 0.02. Obtain
x(1)...x(n) features.
Example: M280.093T31.831 M280.089T144.406 M280.111T333.820 M280.108T454.073 M280.104T578.557 M280.103T890.490 M280.103T299.882
2.Determine the confidence intervals of
t_i.
To do that you need to get the median retention time correction (
median(to)_s1) and the variability of the retention time correction (
mad(to)_s1) for the retention time closest to
t_i (Figure 1). These values can be obtained from XCMS (datasetname1@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of
t_i is:
U_t_i=t_i+|median(to)_s1+1.96*mad(to)_s1| and the lower confidence interval is:
L_t_i=t_i-|median(to)_s1-1.96*mad(to)_s1|.
Example: median(to)_s1+1.96*mad(to)_s1 of ~32.408=0.942 and
median(to)_s1-1.96*mad(to)_s1 of ~32.408=-0.573 then
U_t_i=33.350 and
L_t_i=31.835
3.Start with
x(1) feature and determine the confidence intervals of
t_x(1) (retention time of
x(1)).
To do that you need to get the median retention time correction (
median(to)_s2) and the variability of the retention time correction (
mad(to)_s2) for the retention time closest to
t_x(1) (Figure 2). These values can be obtained from XCMS (datasetname2@rt), looping across all individuals to get the median and the variability.
Then the upper confidence interval of
t_x(1) is:
U_t_x(1)=x(1)+|median(to)_s2+1.96*mad(to)_s2| and the lower confidence interval is:
L_t_x(1)=x(1)-|median(to)_s2-1.96*mad(to)_s2|.
Example: median(to)_s2+1.96*mad(to)_s2 of ~31.831=1.912 and
median(to)_s2-1.96*mad(to)_s2 of ~31.831=-1.061 then
U_t_x(1)=33.743 and
L_t_x(1)=30.770
4. Check if the confidence intervals of
t_i and
t_x(1) overlap.
Example: 31.835-33.350 ? 30.770-33.743 ? TRUE
5.Repeat steps 3 and 4 for x(2)...x(n).
Example: : TRUE FALSE FALSE FALSE FALSE FALSE FALSE[/list][/list]
Then I conclude that M280.093T32.408 from study 1 is the same as M280.093T31.831 from study 2.
What do you think? Is this approach making sense?
Best,
Andrea
[attachment deleted by admin]