eNOSE calibration

eNOSE calibrationFOR LUNG CANCER DETECTION

Andrea Merlina, Valerio Bruschi

Statistics

In 2008 approximately 12.7 million cancers were diagnosed In 2010 nearly 7.98 million people died

Cancers account for approximately 13% of deaths

Most common type:▪ lung cancer (1,4 million deaths)

▪ stomach cancer (740.000)

▪ liver cancer (700.000)

Invasive cancer are the leading cause of death in the developed world and the second leading in the developing world

1. Statistics

2. Objectives

3. Data contextualization

1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA

e) Feature Selection

f) MLR

5. Conclusions

eNose

Objectives

Composition of the breath of patients with lung cancer contains information that could be used to detect the disease

Breath samples were collected and analyzed by two electronic noses

Two goals:

✓ Instrument calibration using a set of key compounds

✓ Quality assurance of eNose data in the medical setting

eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

Mapping between instruments through the calculation of a model

distinguish healthy and ill patience

Data contextualization

Two datasets obtained as measures from two electronic noses

▪ Cyranose

▪ ROTV

In a medical environment

eNose

1. Statistics

2. Objectives

3. Data contextualization1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

≈ 350 samples/dataset10 features 10 gas non-selective sensors

Outliers

Data Analysis

Using raw data to develop an inter instrumental model:

eNose 1 as traning and eNose 2 as test results:

eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysisa) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

Class #1 Class #2

Class #1 45 74

Class #2 123 115

Classification Classification

Error (%)

Cla

sse

s

62,18

51,68

44,82Accuracy

• LDA: everybody is classified as ill• PCA – DA: everybody is classified as healthy• Mahalanobis:

RAW DATA PREPROCESSING CLASSIFICATION

Data Analysis: Intro

Hypotesis:

Normalization of a trunk using a baseline of:• 1 sample

• 5 samples (about 15%* error for class 1)

• 10 samples (about 15%* error for class 1)

• entire chunck (25%*)

In order to minimize memory effects and maximize reproducibility

Best tradeoff performances/easy of calibration

eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Introb) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

* With our best classification method

1• Max Accuracy

2• Min False Negative

3• Min number of samples

Data Analysis: IntroeNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Introb) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

Linear discrimination

Mahalanobis

Components discrimination

Data Analysis: LDA

Class #1 Class #2

Class #1 89 29

Class #2 99 139


Error (%)

Cla

sse

s

24,58

41,60

64,04Accuracy

eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDAc) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions Resulting confusion matrix:

Data Analysis: MahalanobiseNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobisd) PCA-DA


f) MLR

5. Conclusions

▪Sample points are distributed about the center of mass in a ellipsoidal manner.

▪the probability of the test point to belong to the set depends not only on the distance from the center of mass, but also on the direction.

▪Result: everybody is classified as healthy

example

Data Analysis: PCA - DAeNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DAe) Feature Selection

f) MLR

5. Conclusions

Data Analysis: PCA - DAeNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DAe) Feature Selection

f) MLR

5. Conclusions

Class #1 Class #2

Class #1 100 18

Class #2 59 179


Error (%)

Cla

sse

s

15,25

24,79

78,37Accuracy

• Optimum solution

• Maximum accuracy

• Minimum False Negative

Data Analysis: Feature SelectioneNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA

e) Feature Selectionf) MLR

5. Conclusions

Based on loadings of PCA

Data Analysis: Feature Selection

Class #1 Class #2

Class #1 32 86

Class #2 22 216


Error (%)

Cla

sse

s

72,88

9,24

69,66Accuracy

eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA

e) Feature Selectionf) MLR

5. Conclusions

Class #1 Class #2

Class #1 101 17

Class #2 65 173


Error (%)

Cla

sse

s

14,41

27,31

76,97Accuracy

Class #1 Class #2

Class #1 90 28

Class #2 95 143


Error (%)

Cla

sse

s

23,73

39,92

65,45Accuracy

Based on loadings of PCA

PCA - DA

Mahalanobis

LDA

Close to resultswithout feature

selection !!

Data Analysis: MLR (intro)eNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

❖ Can we find a regression model in order to reduce the number of features

(sensors) used in the experiment?

❖ Is the model reliable during time?

We noticed correlation between featureson the same eNose.

Multiple linear regression attempts to model the relationship between variables by fitting a linearequation to observed data.

𝑌 = 𝑋𝐵 + 𝐸

• Dataset was divided in training and test sets in a temporal way.

(first 3 days as training, the last one as testing)

𝐵 = 𝑋+ ∙ 𝑌

• RMSECV was computed for each feature predicted

𝑅𝑀𝑆𝐸𝐶𝑉 =𝑃𝑅𝐸𝑆𝑆

𝑁

with 𝑃𝑅𝐸𝑆𝑆 = σ𝑖 𝑦𝑖𝐿𝑆 − 𝑦𝑖

2

MLR hypothesis: 𝑒𝑦 ≈ 𝑒𝑥.

Regression less robust

Data Analysis: MLReNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

Class #1 Class #2

Class #1 100 18

Class #2 66 172

76,40Accuracy


Error (%)C

lass

es

15,25

27,73

ConclusionseNose

1. Statistics

2. Objectives


1. Outliers

4. Data Analysis

a) Intro

b) LDA

c) Mahalanobis

d) PCA-DA


f) MLR

5. Conclusions

Optimum

Lesscomplexity

• 3 features removed

• False Positive (24% -> 27)

Prediction• 1 features predicted

per eNose

• As feature selection

• PCA –DA technique• Accuracy 78%

Thanks for the attention

Documents

eNOSE calibration