If you can't read please download the document
Upload
y-h-taguchi
View
1.274
Download
0
Embed Size (px)
DESCRIPTION
Presentation at "New Developments of Multivariate Statistical Methodologies -Robust, High Speed, and High-Accuracy" 25th-27th Nov 2014, Tsukuba Univ,, Japan, http://www.math.tsukuba.ac.jp/~aoshima-lab/symposium.html Book chapter is here https://www.researchgate.net/publication/271198208_Heuristic_Principal_Component_Analysis-Based_Unsupervised_Feature_Extraction_and_Its_Application_to_Bioinformatics
Citation preview
2. 0. Why PCA? PCA = principal component analysis Motivation: Unsupervised Feature Selection How PCA? 3. 10 Ordered Features 90 random Features100 Features20 samples Class 1 Class 2 11111111110000000000 11111111110000000000 . . 11111111110000000000 01000000110110011111 00011110000101011101 . . . 01000011000110101111 How to select 10 ordered features, without classification information? 4. Embedding 100 features into 2D using PCA 90 random Features10 Ordered Features 5. PC1 represents discrimination between class 1 and class 2Class 1Class 220 samples 6. Applying weak unitary transformation to the space spanned by 20 samples... 20 samples20 samples 100 FeaturesClass 1 Class 2 10 Ordered Features 90 random FeaturesClass 1 Class 2 7. The same 2D embedding. Thus we can select 10 features.10 Ordered Features90 random Features 8. PC1 weakly represents discrimination between class 1 and class 2Class 1Class 220 samples 9. Linear discriminant analysis + leave one out cross validation using 10 ordered features .True class 1 2 Predict 1 8 2 228 Accuracy=Sensitivity=Specificity=80%How about real examples? 10. 1. Real example 1: Disease associated aberrant promoter methylation methylation gene promoter three autoimmune diseases SLE RA DM [ MZ twins (healthy+sick) + 2 healthy controls] 5 = 20 samples 3 diseases = 60 samples vs 1000 potential methylation sites 11. Embedding of 1000 promoters within 20 RA samples into 2D with PCA (PC2 vs PC3)PC3 Outlier promoters, SelectedPC2 12. PC2:RA Male Female Sick Twin Healthy Twin +:Healthy Control 1 :Healthy Control 2 Twins: Healthy > Sick Controls: No The 4th set: No The reason why unsupervised feature selection is needed.20 samples 13. Scatter plots between healthy/RA twins. Red dots = selected promoters Healthy twins RA twins P