Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
eNOSE calibrationFOR LUNG CANCER DETECTION
Andrea Merlina, Valerio Bruschi
Statistics
In 2008 approximately 12.7 million cancers were diagnosed In 2010 nearly 7.98 million people died
Cancers account for approximately 13% of deaths
Most common type:▪ lung cancer (1,4 million deaths)
▪ stomach cancer (740.000)
▪ liver cancer (700.000)
Invasive cancer are the leading cause of death in the developed world and the second leading in the developing world
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
eNose
Objectives
Composition of the breath of patients with lung cancer contains information that could be used to detect the disease
Breath samples were collected and analyzed by two electronic noses
Two goals:
✓ Instrument calibration using a set of key compounds
✓ Quality assurance of eNose data in the medical setting
eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
Mapping between instruments through the calculation of a model
distinguish healthy and ill patience
Data contextualization
Two datasets obtained as measures from two electronic noses
▪ Cyranose
▪ ROTV
In a medical environment
eNose
1. Statistics
2. Objectives
3. Data contextualization1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
≈ 350 samples/dataset10 features 10 gas non-selective sensors
Outliers
Data Analysis
Using raw data to develop an inter instrumental model:
eNose 1 as traning and eNose 2 as test results:
eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysisa) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
Class #1 Class #2
Class #1 45 74
Class #2 123 115
Classification Classification
Error (%)
Cla
sse
s
62,18
51,68
44,82Accuracy
• LDA: everybody is classified as ill• PCA – DA: everybody is classified as healthy• Mahalanobis:
RAW DATA PREPROCESSING CLASSIFICATION
Data Analysis: Intro
Hypotesis:
Normalization of a trunk using a baseline of:• 1 sample
• 5 samples (about 15%* error for class 1)
• 10 samples (about 15%* error for class 1)
• entire chunck (25%*)
In order to minimize memory effects and maximize reproducibility
Best tradeoff performances/easy of calibration
eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Introb) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
* With our best classification method
1• Max Accuracy
2• Min False Negative
3• Min number of samples
Data Analysis: IntroeNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Introb) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
Linear discrimination
Mahalanobis
Components discrimination
Data Analysis: LDA
Class #1 Class #2
Class #1 89 29
Class #2 99 139
Classification Classification
Error (%)
Cla
sse
s
24,58
41,60
64,04Accuracy
eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDAc) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions Resulting confusion matrix:
Data Analysis: MahalanobiseNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobisd) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
▪Sample points are distributed about the center of mass in a ellipsoidal manner.
▪the probability of the test point to belong to the set depends not only on the distance from the center of mass, but also on the direction.
▪Result: everybody is classified as healthy
example
Data Analysis: PCA - DAeNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DAe) Feature Selection
f) MLR
5. Conclusions
Data Analysis: PCA - DAeNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DAe) Feature Selection
f) MLR
5. Conclusions
Class #1 Class #2
Class #1 100 18
Class #2 59 179
Classification Classification
Error (%)
Cla
sse
s
15,25
24,79
78,37Accuracy
• Optimum solution
• Maximum accuracy
• Minimum False Negative
Data Analysis: Feature SelectioneNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selectionf) MLR
5. Conclusions
Based on loadings of PCA
Data Analysis: Feature Selection
Class #1 Class #2
Class #1 32 86
Class #2 22 216
Classification Classification
Error (%)
Cla
sse
s
72,88
9,24
69,66Accuracy
eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selectionf) MLR
5. Conclusions
Class #1 Class #2
Class #1 101 17
Class #2 65 173
Classification Classification
Error (%)
Cla
sse
s
14,41
27,31
76,97Accuracy
Class #1 Class #2
Class #1 90 28
Class #2 95 143
Classification Classification
Error (%)
Cla
sse
s
23,73
39,92
65,45Accuracy
Based on loadings of PCA
PCA - DA
Mahalanobis
LDA
Close to resultswithout feature
selection !!
Data Analysis: MLR (intro)eNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
❖ Can we find a regression model in order to reduce the number of features
(sensors) used in the experiment?
❖ Is the model reliable during time?
We noticed correlation between featureson the same eNose.
Multiple linear regression attempts to model the relationship between variables by fitting a linearequation to observed data.
𝑌 = 𝑋𝐵 + 𝐸
• Dataset was divided in training and test sets in a temporal way.
(first 3 days as training, the last one as testing)
𝐵 = 𝑋+ ∙ 𝑌
• RMSECV was computed for each feature predicted
𝑅𝑀𝑆𝐸𝐶𝑉 =𝑃𝑅𝐸𝑆𝑆
𝑁
with 𝑃𝑅𝐸𝑆𝑆 = σ𝑖 𝑦𝑖𝐿𝑆 − 𝑦𝑖
2
MLR hypothesis: 𝑒𝑦 ≈ 𝑒𝑥.
Regression less robust
Data Analysis: MLReNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
Class #1 Class #2
Class #1 100 18
Class #2 66 172
76,40Accuracy
Classification Classification
Error (%)C
lass
es
15,25
27,73
ConclusionseNose
1. Statistics
2. Objectives
3. Data contextualization
1. Outliers
4. Data Analysis
a) Intro
b) LDA
c) Mahalanobis
d) PCA-DA
e) Feature Selection
f) MLR
5. Conclusions
Optimum
Lesscomplexity
• 3 features removed
• False Positive (24% -> 27)
Prediction• 1 features predicted
per eNose
• As feature selection
• PCA –DA technique• Accuracy 78%
Thanks for the attention