NSF REU Program in Medical Informatics 1 D. Raicu, 1 J. Furst, 2 D. Channin, 3 S. Armato, and 3 K. Suzuki 1 DePaul University, 2 Northwestern University,

NSF REU Program in Medical Informatics1D. Raicu, 1J. Furst, 2D. Channin, 3S. Armato, and 3K. Suzuki

1DePaul University, 2Northwestern University, and University of Chicago

REU DataOverview Goal: continue promoting interdisciplinary studies at the frontier between information technology and medicine to undergraduate students - especially students from groups historically underrepresented in exact sciences

Duration: 10 weeks over the summer

Example Teaching:

Interdisciplinary tutorials: Image processing, machine learning

Technology tools tutorials: MatLab, SPSS

Presentations by mentors about projects

Example Activities:

Follow-on activities

Bi-weekly group meetings. presentations to entire MedIX group, final reports (in conference formats), seminars to support student publication

Special events

Day in the life of a PhD student”, “Developing a research career”, “Women in science”, Tours of medical facilities, etc

Unique Site Aspect:

Multi-institution & multi-disciplinary site @ the frontier between computer science & medicine

Outcomes (2005-2007) 88% students had at least one research publication

over 23 publications (1 journal paper, 15 conference papers, 8 extended abstracts)

3 honor theses & senior projects, 4 graduate fellowships, and 1 CRA) honor mention for outstanding undergraduate research

Statistics (2005-2007) Students demographics: 8 per year

Female: 46 %; First generation college: 15%; Outside of home institutions: 73%

Previously presenting a visual (poster) research presentation (31%) or an oral research presentation (27%), (co-) authored a publication in an academic journal (12%), or in the previous two years been involved in any research projects (42%).

Total number of Faculty mentors: 4

Years of operation: 2005 to 2010

Example Research topics: see on the left side

Introduction This work thoroughly investigates ways to predict the results of a semantic-based image re-

trieval system by using solely content-based image features. We extend our previous work1 by

studying the relationships between the two types of retrieval, content-based and semantic-based,

with the final goal of integrating them into a system that will take advantage of both retrieval ap-

proaches. Our results on the Lung Image Database Consortium (LIDC) dataset show that a sub-

stantial number of nodules identified as similar based on image features are also identified as

similar based on semantic characteristics. Furthermore, by integrating the two types of features,

the similarity retrieval improves with respect to certain nodule characteristics.

Methodology

Computation to best represent semantic-based similarity val-ues using only con-tent-based features.

The goal is to find similar nodules to make a better diagnosis of the query. Content-based image re-

trieval is the goal, as that would in-volve little human interaction on

very large data sets.

The 149 CT scans - one of each nodule - are from the Lung Imaging Database

Consortium (LIDC).

Results improve useful-ness of content-based image retrieval system

greatly.

Up to four radiologists rated the nodules on 9 distinct features. Only

7 features varied enough to incorporate, which are rated on a

scale of 1 to 5.

The radiologist compares similar nodules to aid in his diagnosis. Often, comparing similar nod-ules can lead to a more certain

diagnosis.3

Figure 1 — Methodology

The LIDC contains complete thoracic CT scans for 85 patients with lesions. Nodules with a

diameter larger than three millimeters were rated by a panel of four radiologists.2

They rated 9 characteristics of the nodules the masses that they considered nodules. Seven

of those characteristics are useful to our analysis, which were all on a scale of one to five:

Lobulation, Malignancy, Margin, Sphericity, Spiculation, Subtlety, and Texture

For each image, we calculated 64 different content-based features1:

Shape Features: circularity, roughness, elongation, compactness, eccentricity, solidity, extent,

and standard deviation of radial distance

Size Features: area, convex area, perimeter, convex perimeter, equivalence diameter, major

axis length, and minor axis length

Gray-Level Intensity Features: minimum, maximum, mean, standard deviation, and differ-

ence

Texture Features based on co-occurrence matrices, Gabor filters, and Markov random fields

Content-based versus Semantic-based Similarity Retrieval: A LIDC Case Study Sarah Jabona, Jacob Furstb, Daniela Raicub

aRose-Hulman Institute of Technology, Terre Haute, IN 47803, bIntelligent Multimedia Processing Laboratory, School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL, USA, 60604

Using k Number of Matches

The number of nodules that had 2 - 5 matches was relatively consistent throughout all image fea-

tures, but slightly higher for Gabor and Markov. No combination of image features had more than

10 matches out of the twenty most similar.

Below is a scatter plot of the content-based similarity versus the semantic-based similarity value.

[1] Lam, M., Disney, T., Pham, M., Raicu, D., Furst, J., “Content-Based Image Retrieval for Pulmonary Computed Tomography Nodule Images”, SPIE Medical Imaging Conference, San Diego, CA, February 2007.

[2] The National Cancer Institute, “Lung Imaging Database Consortium (LIDC), http://imaging.cancer.gov/programsandresources/InformationSystems/LIDC.

[3] Li, Q., Li, F., Shiraishi, J., Katsuragwa, S., Sone, S., Doi, K., “Investigation of New Psychophysical Measures for Evaluation of Similar Images on Thoracic Computed Tomography for Distinction between Benign and Malignant Nodules”, Medical Physics 30:2584-2593, 2003.

[4] Han, J., Kamber, M., [Data Mining: Concepts and Techniques], London: Academic P, 2001.

Image Data

Calculating Similarity Similarity Comparisons In order to assess the correlation between the two similarity measures, we used a round robin ap-

proach where we extracted one nodule as a query and compared it to the remaining 148 nodules. We

took the k most similar values from each query’s semantic-based similarity ordered list and content-

based similarity ordered list and counted how many nodules were common to both lists.

Here is an example with nodule 117 as the query nodule. Below are the most similar nodules

listed with their attributes.

Notice that the semantic similarity values have a much smaller range— from 0 to about 0.3,

whereas the content-based similarities range from 0 to 1. Most of the semantic features are very

similar. A ranking of i signifies that nodule was the ith most similar nodule in the list of similar nod-

ules based on the appropriate feature set.

Analysis

References

Conclusions

Our preliminary results show that a substantial number of nodules identified as similar based on

image features are also identified as similar based on semantic characteristics and therefore, the im-

age features capture properties that radiologists look at when interpreting lung nodules. There are

many similarity metrics that can be used to try to correlate the two retrieval systems. We found the

Euclidean distance to be better for the content–based features and the cosine similarity measure to

be best for the semantic-based characteristics. In our future work, we will try principle component

analysis and linear regression on the data. Further research is necessary to investigate further the

correlations between the two types of features and integrate them in one retrieval system that will be

of clinical use.

Rad. Lob. Mal. Marg. Spher. Spic. Subt. Text.

A 3 4 4 2 4 3 4

B 4 3 4 4 3 5 5

C 4 2 3 4 3 4 5

D 4 3 2 2 4 3 3

4 3 4 3 3 3 5

Summarized:

Figure 2 — Sample CT Scan with Four Radiologists’ Ratings

Semantic-Based Features

Content-Based Features

0.400.200.00

VAR00002

1,200

1,000

800

600

400

200

0

Freq

uenc

y

Mean =0.0766Std. Dev. =0.06374

N =11,026

1.0000000.8000000.6000000.4000000.2000000.000000

VAR00001

600

400

200

0

Freq

uenc

y

Mean =0.2840127Std. Dev. =0.154278896N =11,026

At right is a histogram of the content-based similar-

ity values for all 11,026 nodule pairs. The similarity val-

ues are calculated with the Euclidean distance, which is

defined below, and then min-max normalization is ap-

plied.4

At the end of the feature extraction process, each

nodule is represented by a vector as shown below,

where c stands for a semantic concept and f for a image

feature. Figure 4 — Histogram of Content- Based Similarity

Figure 3 — Histogram of Semantic- Based Similarity

The cosine similarity measure minimized the ceil-

ing effect. The similarity value calculation using the

cosine formula is shown below.

The histogram to the right is of the semantic-

based similarity values for all 11,026 nodule pairs.

Although the values do not represent a perfect nor-

mal curve, the ceiling effect was drastically im-

proved from performing a simple distance on the

seven characteristics.

Query Nodule (Q):

Database Nodule (N):

No. Image Semantic-Based Content-Based Semantic Feature Vector

Ranking Similarity Value Ranking Similarity Value Lob Mal Mar Sph Spic Sub Tex

117

- 0 - 0 2 3 5 5 2 4 5

104

2 0.004452 5 0.415918 2 3 4 4 2 3 4

126

3 0.004596 6 0.421249 2 3 5 5 1 4 5

98 6 0.006817 17 0.505317 2 3 4 5 2 3 5

28

8 0.009119 16 0.504996 1 3 5 5 1 4 5

27

11 0.012752 2 0.380517 1 3 5 5 1 3 5

137 14 0.013072 9 0.430289 1 3 4 5 1 3 4

127

16 0.013606 11 0.474226 2 4 5 4 3 4 5

119

17 0.015268 20 0.538589 3 3 4 4 2 3 5

90

20 0.016383 7 0.425751 1 2 3 4 1 2 4

Figure 5 — Example of Image Retrieval Results

Applying a Threshold

We analyzed the difference in the scales of similarity by seeing how many matches there were

based on thresholds. Below is a graph of two different thresholds of similarity—0.02 and 0.04.

These thresholds are applied to the semantic similarity values. There were many more matches

within these thresholds. Matches Gabor Markov Co-Occurrence Gabor, Markov, and

Co-Occurrence All Features

6 – 10 24 18 31 36 43

2 – 5 107 104 94 98 93

0 – 1 18 27 24 15 13

Figure 6 — Match Count in 20 Most Similar Nodules

Figure 7 — Content-Based Similarity vs. Semantic-Based Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Cont

ent-

Base

d Si

mila

rity

Semantic-Based Similarity

Similarity Values: Content-Based vs. Semantic-Based

Thresholds on Semantic-Based Characteristics

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56

Number of Matches

Occ

urr

ence

(O

ut

of

149)

Threshold 0.02

Threshold 0.04

Figure 6 — Match Based on All Features and Thresholds

Documents

NSF REU Program in Medical Informatics 1 D. Raicu, 1 J. Furst, 2 D. Channin, 3 S. Armato, and 3 K. Suzuki 1 DePaul University, 2 Northwestern University,