Skip to main content
Topic: Dealing with outliers... (Read 4528 times) previous topic - next topic

Dealing with outliers...

I was checking my data quality before running PCA/PLS, but am confused by different articles I've read concerning this. Basically they say that Hotelling's T2 identifies severe outliers, and DModX identifies moderate outliers. 

What I am seeing in my data:
          -My QC's all cluster to the center of a PCA
          - I have some samples located outside the Hotelling's T2 ellipse.
          - Depending on how many PCA components I include in the DModX plot, I get different samples that are larger than the D-Crit value.

My questions:
           - How should I decide on how many PCA components to include in the DModX?
           - Should I remove all outliers detected by the DModX graph and the Hotelling's T2 plot from my data set to prevent skewing of my PCA/PLS?
           - Is it possible that the outliers could be of interest and I should leave them in further analysis?

I really appreciate any and all advice!!!!

Re: Dealing with outliers...

Reply #1
Hi Rebecca,
I'm aware that you posted your question many months ago but since I was also investigating the outlier handling, I thought I'd ask you whether you had found answers that you could share maybe?
I had posted a more general question on researchgate, and got some answers in case others could benefit from it:

https://www.researchgate.net/post/Outlier_detection_and_handling_in_metabolomics-how_to_do_better_than_obvious_outliers_on_PCA#view=5e0cf872a7cbafd5090b12f4

I'm curious about this Hotelling's T2 ellipse but have no experience yet with the test.

Thanks for any new input if you have.



Re: Dealing with outliers...

Reply #2
Hi Julie,

I ended up contacting Dr. Jasper Engel who is a biostatistician. Below is his response:

Quote
I think it would be useful to read the following paper: Hubert, M., Rousseeuw, P. J., & Vanden Branden, K. (2005). ROBPCA: a new approach to robust principal component analysis. Technometrics, 47(1), 64-79.

As shown in figure 1 of the paper, three different types of outliers can be detected with respect to a PCA subspace. So-called good leverage points are observations that lie close to the PCA subspace, but far from the majority of the observations (i.e. they have extreme scores). The PCA subspace captures the patterns in the data that are associated with the largest variance between the metabolic profiles of the observations. The metabolomic profile of a good leverage point matches well with these patterns (e.g. correlations between metabolites) but it is more extreme. An example would be a urine NMR spectrum of a sample that is much more diluted compared to the other samples. Good leverage points will have a large Hotelling T2-value, but a small DmodX-value.

Orthogonal outliers have a large orthogonal distance to the PCA space. These have high values for DmodX. This happens if the metabolic profile of this observation contains some abnormality, i.e. a metabolite that is not present in the majority of the observations, or abnormal patterns of metabolite concentrations.

Bad leverage points have a large orthogonal distance and whose score is far from the majority of the observations.

In other words, hotellings T2 and DmodX detect different types of outliers, independent of their severity.

Given the definition of Hotellings T2 and DmodX as used in PCA it is no surprise that the samples that are marked as outlier changes as a function of the number of PCs selected. Unfortunately, there is no clear rule to determine how many components to include. Depending on your software and application you could either base the number of selected components on percentage of explained variance, a breakpoint in a scree plot, or some cross-validated measure such as PRESS.

I would not always remove all outliers as indicated by PCA. By definition about 5 percent of your observations will be marked as outlier independent whether they really are outlying or not (5% is your significance level). This is very similar to the type I error in e.g. a t-test. What I would do is visualize the outliers in an outlier map (see paper referred to above) and mark any observation whose distances are clearly far away from the other observations. If you have a good explanation why these observations could be an outlier (e.g. you dropped the sample on the floor) I would remove it from further analysis. However, typically things are not so clear. Also, please note that samples that are extreme from a statistical perspective (given quite restrictive assumptions about the distribution of your data) are not necessarily biologically outlying. Perhaps there really are subpopulations in your controls that you should take into account during analysis. Therefore, I would carry out further analysis with and without the potential outliers. If the results are quite similar you dont have to worry about them. If the results are very dissimilar you should carefully report and discuss this in your paper.

He also suggested this paper as well: Bro, R., & Smilde, A. K. (2014). Principal component analysis. Analytical Methods, 6(9), 2812-2831.

I didn't have the capablility to make an outlier map as Jasper suggested so for better or for worse I ended up using both the Hotelling's T plot and DmodX to find samples that were outside the cutoff for both tests and to ensure that the quality control (QC) samples clustered together to check data normality. I then went back through all of potential outliers to check to see if there was a reason for them to be an outlier, ie. poor instrument performance, sample handling, etc. If that was the reason I removed the sample. If it wasn't I chalked it up to biological variability and left the outlier in the group.

For the PCA component number I chose to use the breakpoint in a scree plot.

Please let me know if you have any other questions about this.