2. 0. Why PCA? PCA = principal component analysis Motivation: Unsupervised Feature Selection How PCA? 3. 10 Ordered Features 90 random Features100 Features20 samples Class 1 Class 2 11111111110000000000 11111111110000000000 . . 11111111110000000000 01000000110110011111 00011110000101011101 . . . 01000011000110101111 How to select 10 ordered features, without classification information? 4. Embedding 100 features into 2D using PCA 90 random Features10 Ordered Features 5. PC1 represents discrimination between class 1 and class 2Class 1Class 220 samples 6. Applying weak unitary transformation to the space spanned by 20 samples... 20 samples20 samples 100 FeaturesClass 1 Class 2 10 Ordered Features 90 random FeaturesClass 1 Class 2 7. The same 2D embedding. Thus we can select 10 features.10 Ordered Features90 random Features 8. PC1 weakly represents discrimination between class 1 and class 2Class 1Class 220 samples 9. Linear discriminant analysis + leave one out cross validation using 10 ordered features .True class 1 2 Predict 1 8 2 228 Accuracy=Sensitivity=Specificity=80%How about real examples? 10. 1. Real example 1: Disease associated aberrant promoter methylation methylation gene promoter three autoimmune diseases SLE RA DM [ MZ twins (healthy+sick) + 2 healthy controls] 5 = 20 samples 3 diseases = 60 samples vs 1000 potential methylation sites 11. Embedding of 1000 promoters within 20 RA samples into 2D with PCA (PC2 vs PC3)PC3 Outlier promoters, SelectedPC2 12. PC2:RA Male Female Sick Twin Healthy Twin +:Healthy Control 1 :Healthy Control 2 Twins: Healthy > Sick Controls: No The 4th set: No The reason why unsupervised feature selection is needed.20 samples 13. Scatter plots between healthy/RA twins. Red dots = selected promoters Healthy twins RA twins P