Find same feature in separately processed studies October 23, 2012, 03:31:48 PM Hi All,my problem is to find common features in two studies (each with lots of samples), which have been run with the same machine, but separately processed with XCMS.Specifically, I aligned the two studies separately using the same XCMS parameters. But now I then need to find the common features in both studies.There are two options to do that:1. To use a fixed approach where I specify the max deviation in masses and retention times between the two studies (I think this is similar to what metaXCMS does).How does it work:a. Find all the masses in study 1 that are equal +/- 0.02 Dalton in study 2.b. For each common mass find the retention times that are equal +/- x seconds in study 2. [/list][/list]However, we know that the retention time correction is a function of the retention time and not uniform for all the retention times.This is for example the median retention time correction for study 1. The dashed lines are 1.96*median absolute deviation.[attachment=0:23oyzgmy]rt_correction_vs_rt_corrected_twge.pdf[/attachment:23oyzgmy]This is the same graph for study 2.[attachment=1:23oyzgmy]rt_correction_vs_rt_corrected_pivus.pdf[/attachment:23oyzgmy]The second possibility is:2. To use a dynamic approach where I specify the max deviation in masses (e.g. +/- 0.02 Da) and a dynamic deviation in retention time.A dynamic retention time means that when I try to find features with similar retention time, given that they have similar mass, I will not use a fixed range, but I allow the deviation in retention times to depend on the retention time.Let's see an example:I have this feature from study 1: M280.093T32.408My goal is to determine if there is a similar feature in study 2.This is the algorithm I used to compare features and find out if they are the same:1.Given a feature i of interest from study 1 with retention time t_i and mass m_i search for all the features in study 2 with mass m so that: |m_i-m| < 0.02. Obtain x(1)...x(n) features.Example: M280.093T31.831 M280.089T144.406 M280.111T333.820 M280.108T454.073 M280.104T578.557 M280.103T890.490 M280.103T299.8822.Determine the confidence intervals of t_i.To do that you need to get the median retention time correction (median(to)_s1) and the variability of the retention time correction (mad(to)_s1) for the retention time closest to t_i (Figure 1). These values can be obtained from XCMS (datasetname1@rt), looping across all individuals to get the median and the variability.Then the upper confidence interval of t_i is:U_t_i=t_i+|median(to)_s1+1.96*mad(to)_s1| and the lower confidence interval is:L_t_i=t_i-|median(to)_s1-1.96*mad(to)_s1|.Example: median(to)_s1+1.96*mad(to)_s1 of ~32.408=0.942 and median(to)_s1-1.96*mad(to)_s1 of ~32.408=-0.573 then U_t_i=33.350 and L_t_i=31.8353.Start with x(1) feature and determine the confidence intervals of t_x(1) (retention time of x(1)). To do that you need to get the median retention time correction (median(to)_s2) and the variability of the retention time correction (mad(to)_s2) for the retention time closest to t_x(1) (Figure 2). These values can be obtained from XCMS (datasetname2@rt), looping across all individuals to get the median and the variability.Then the upper confidence interval of t_x(1) is:U_t_x(1)=x(1)+|median(to)_s2+1.96*mad(to)_s2| and the lower confidence interval is:L_t_x(1)=x(1)-|median(to)_s2-1.96*mad(to)_s2|.Example: median(to)_s2+1.96*mad(to)_s2 of ~31.831=1.912 and median(to)_s2-1.96*mad(to)_s2 of ~31.831=-1.061 then U_t_x(1)=33.743 and L_t_x(1)=30.7704. Check if the confidence intervals of t_i and t_x(1) overlap.Example: 31.835-33.350 ? 30.770-33.743 ? TRUE5.Repeat steps 3 and 4 for x(2)...x(n).Example: : TRUE FALSE FALSE FALSE FALSE FALSE FALSE[/list][/list]Then I conclude that M280.093T32.408 from study 1 is the same as M280.093T31.831 from study 2.What do you think? Is this approach making sense?Best,Andrea[attachment deleted by admin] Quote Selected