1
NSF REU Program in Medical Informatics 1 D. Raicu, 1 J. Furst, 2 D. Channin, 3 S. Armato, and 3 K. Suzuki 1 DePaul University, 2 Northwestern University, and University of Chicago REU Data Overview Goal: continue promoting interdisciplinary studies at the frontier between information technology and medicine to undergraduate students - especially students from groups historically underrepresented in exact sciences Duration: 10 weeks over the summer Example Teaching: Interdisciplinary tutorials : Image processing, machine learning Technology tools tutorials: MatLab, SPSS Presentations by mentors about projects Example Activities: Follow-on activities Bi-weekly group meetings. presentations to entire MedIX group, final reports (in conference formats), seminars to support student publication Special events Day in the life of a PhD student”, “Developing a research career”, “Women in science”, Tours of medical facilities, etc Unique Site Aspect: Multi-institution & multi- disciplinary site @ the frontier between computer science & medicine Outcomes (2005-2007) 88% students had at least one research publication over 23 publications (1 journal paper, 15 conference papers, 8 extended abstracts) 3 honor theses & senior projects, 4 graduate fellowships, and 1 CRA) honor mention for outstanding undergraduate research Statistics (2005-2007) Students demographics: 8 per year Female: 46 %; First generation college: 15%; Outside of home institutions: 73% Previously presenting a visual (poster) research presentation (31%) or an oral research presentation (27%), (co-) authored a publication in an academic journal (12%), or in the previous two years been involved in any research projects (42%). Total number of Faculty mentors: 4 Years of operation: 2005 to 2010 Example Research topics: see on the left side Introduction This w ork thoroughly investigates w ays to predictthe results ofa sem antic-based im age re- trievalsystem by using solely content-based im age features.W e extend ourprevious w ork 1 by studying the relationships betw een the tw o types ofretrieval,content-based and sem antic-based, w ith the finalgoalofintegrating them into a system thatwilltakeadvantage ofboth retrievalap- proaches.O urresults on the Lung Im age D atabase Consortium (LID C)datasetshow thata sub- stantialnum ber ofnodules identified as sim ilar based on im age features are also identified as sim ilarbased on sem antic characteristics. Furtherm ore,by integrating the tw o typesoffeatures, thesim ilarity retrievalim provesw ith respectto certain nodule characteristics. M ethodology Com putation to best representsemantic- based sim ilarity val- uesusing only con- tent-based features. The goalisto find sim ilarnodules to m ake a betterdiagnosisofthe query.Content-based im agere- trievalisthe goal, asthatw ould in- volve little hum an interaction on very large data sets. The149 CT scans- one ofeach nodule - are from theLung Im aging Database Consortium (LID C). Resultsim prove useful- nessofcontent-based im age retrievalsystem greatly. U p to fourradiologists rated the noduleson 9 distinctfeatures.O nly 7 featuresvaried enough to incorporate, w hich are rated on a scale of1 to 5. Theradiologistcom paressim ilar nodulesto aid in hisdiagnosis. O ften,com paring sim ilarnod- ulescan lead to a m ore certain diagnosis. 3 F igure 1 — Methodology The LID C containscom plete thoracicC T scansfor85 patientsw ith lesions. N odulesw ith a diam eterlargerthan three m illim etersw ere rated by a paneloffourradiologists. 2 They rated 9 characteristics ofthe nodulesthem assesthatthey considered nodules. Seven ofthose characteristicsare usefulto ouranalysis, w hich w ere allon a scale ofoneto five: Lobulation , Malignancy , M argin , Sphericity , Spiculation , Subtlety , and Texture Foreach im age, w e calculated 64 differentcontent-based features 1 : Shape Features :circularity,roughness, elongation,com pactness, eccentricity, solidity, extent, and standard deviation ofradialdistance Size Features :area, convex area, perim eter,convex perim eter, equivalencediam eter, m ajor axislength, and m inoraxislength G ray-LevelIntensity Features :m inim um , m axim um , m ean, standard deviation, and differ- ence Texture Features based on co-occurrence m atrices,G aborfilters, and M arkov random fields Content-based versusSem antic-based Sim ilarity R etrieval:A LID C Case Study Sarah Jabon a ,Jacob Furst b ,D aniela R aicu b a Rose-H ulm an Institute ofTechnology, Terre H aute,IN 47803, b IntelligentM ultim edia Processing Laboratory, SchoolofCom puterScience, Telecom m unications, and Inform ation System s, D ePaul U niversity, Chicago, IL, U SA , 60604 Using k N um berofM atches The num berofnodules thathad 2 -5 m atches w as relatively consistentthroughoutallim age fea- tures,butslightly higher for G aborand M arkov.N o com bination ofim age features had m ore than 10 m atches outofthe tw enty m ostsim ilar. Below isa scatterplotofthe content-based sim ilarity versusthe sem antic-based sim ilarity value. [1] Lam ,M .,D isney,T.,Pham ,M .,Raicu,D .,Furst,J.,“Content-Based Im age RetrievalforPulm onary Com puted Tom ography N odule Im ages”, SPIE M edicalIm aging Conference,San D iego,CA , February 2007. [2] The National Cancer Institute, “Lung Im aging Database Consortium (LIDC), http://im aging.cancer.gov/programsandresources/ Inform ationSystem s/LIDC . [3]Li,Q .,Li,F.,Shiraishi,J.,K atsuragw a,S.,Sone,S.,D oi,K .,“Investigation ofN ew PsychophysicalM easuresforEvaluation ofSim ilarIm ageson Thoracic Com puted Tom ography forD istinction betw een Benign and M alignantN odules”, Medical Physics 30:2584-2593, 2003. [4]H an, J.,K am ber,M .,[D ataM ining: Conceptsand Techniques],London: A cadem ic P, 2001. Im age D ata C alculating Sim ilarity Sim ilarity C om parisons In orderto assessthe correlation betw een the tw o sim ilarity m easures, w eused a round robin ap- proach w herew e extracted onenodule asa query and com pared itto therem aining 148 nodules. W e took the k m ostsim ilarvaluesfrom each query’ssem antic-based sim ilarity ordered listand content- based sim ilarity ordered listand counted how m any nodulesw ere com m on to both lists. H ere isan exam plew ith nodule117 asthequery nodule.Below are the m ostsim ilarnodules listed w ith theirattributes. N oticethatthe sem anticsim ilarity valueshavea m uch sm allerrange— from 0 to about0.3, w hereasthe content-based sim ilaritiesrange from 0 to 1. M ostofthesem antic featuresare very sim ilar. A ranking of i signifiesthatnodule w asthe i th m ostsim ilarnodule in thelistofsim ilarnod- ulesbased on the appropriate featureset. A nalysis R eferences C onclusions O urprelim inary resultsshow thata substantialnum berofnodulesidentified assim ilarbased on im age featuresare also identified assim ilarbased on sem antic characteristicsand therefore, the im - age features capturepropertiesthatradiologistslook atw hen interpreting lung nodules. There are m any sim ilarity m etricsthatcan be used to try to correlate thetw o retrievalsystem s. W e found the Euclidean distanceto bebetterforthe content–based featuresand the cosine sim ilarity m easureto be bestforthe sem antic-based characteristics. In ourfuture w ork, w e w illtry principle com ponent analysisand linearregression on the data. Furtherresearch isnecessary to investigate furtherthe correlationsbetw een thetw o typesoffeaturesand integratethem in one retrievalsystem thatwillbe ofclinicaluse. R ad. Lob. M al. M arg. Spher. Spic. Subt. Text. A 3 4 4 2 4 3 4 B 4 3 4 4 3 5 5 C 4 2 3 4 3 4 5 D 4 3 2 2 4 3 3 4 3 4 3 3 3 5 Sum m arized: F igure 2 — Sample C T Scan with F our Radiologists’ Ratings Sem antic-Based Features C ontent-Based Features 0.40 0.20 0.00 VA R 00002 1,200 1,000 800 600 400 200 0 Frequency M ean =0.0766 Std.Dev.=0.06374 N =11,026 1.000000 0.800000 0.600000 0.400000 0.200000 0.000000 VA R 00001 600 400 200 0 Frequency M ean =0.2840127 Std.Dev.=0. 154278896 N =11,026 A trightisa histogram ofthe content-based sim ilar- ity valuesforall11,026 nodule pairs.The sim ilarity val- ues are calculated w ith the Euclidean distance,which is defined below,and then m in-m ax norm alization is ap- plied. 4 At the end of the feature extraction process, each nodule is represented by a vector as show n below , w here c stands fora sem antic conceptand f fora im age feature. F igure 4 — Histogram of C ontent- Based Similarity F igure 3 — Histogram of Semantic- Based Similarity The cosine sim ilarity m easure m inim ized the ceil- ing effect.The sim ilarity value calculation using the cosine form ula isshow n below. The histogram to the right is of the sem antic- based sim ilarity values forall11,026 nodule pairs. A lthough the values do notrepresenta perfectnor- m al curve, the ceiling effect was drastically im - proved from perform ing a sim ple distance on the seven characteristics. Q uery N odule(Q ): D atabaseN odule (N ): N o. Im age Sem antic-B ased C ontent-B ased Sem antic Feature Vector R anking Sim ilarity Value R anking Sim ilarity Value Lob M al M ar Sph Spic Sub Tex 117 - 0 - 0 2 3 5 5 2 4 5 104 2 0.004452 5 0.415918 2 3 4 4 2 3 4 126 3 0.004596 6 0.421249 2 3 5 5 1 4 5 98 6 0.006817 17 0.505317 2 3 4 5 2 3 5 28 8 0.009119 16 0.504996 1 3 5 5 1 4 5 27 11 0.012752 2 0.380517 1 3 5 5 1 3 5 137 14 0.013072 9 0.430289 1 3 4 5 1 3 4 127 16 0.013606 11 0.474226 2 4 5 4 3 4 5 119 17 0.015268 20 0.538589 3 3 4 4 2 3 5 90 20 0.016383 7 0.425751 1 2 3 4 1 2 4 F igure 5 — E xample of Image Retrieval Results A pplying a Threshold W eanalyzed thedifference in the scales ofsim ilarity by seeing how m any m atchesthere w ere based on thresholds. Below isa graph oftw o differentthresholdsofsim ilarity— 0.02 and 0.04. These thresholdsare applied to the sem anticsim ilarity values. There w ere m any m orem atches w ithin these thresholds. M atches G abor M arkov C o-O ccurrence G abor, M arkov, and C o-O ccurrence A llFeatures 6 – 10 24 18 31 36 43 2 – 5 107 104 94 98 93 0 – 1 18 27 24 15 13 F igure 6 — Match C ount in 20 Most Similar Nodules F igure 7 — C ontent-Based Similarity vs. Semantic-Based Similarity 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Content-Based Sim ilarity Sem antic-Based Sim ilarity Sim ilarityValues: Content-Based vs. Sem antic-Based Thresholds on Sem antic-B ased C haracteristics 0 5 10 15 20 25 30 35 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 Num ber ofM atches O ccurrence (O utof149) Threshold 0.02 Threshold 0.04 F igure 6 — Match Based on All F eatures and Thresholds

NSF REU Program in Medical Informatics 1 D. Raicu, 1 J. Furst, 2 D. Channin, 3 S. Armato, and 3 K. Suzuki 1 DePaul University, 2 Northwestern University,

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: NSF REU Program in Medical Informatics 1 D. Raicu, 1 J. Furst, 2 D. Channin, 3 S. Armato, and 3 K. Suzuki 1 DePaul University, 2 Northwestern University,

NSF REU Program in Medical Informatics1D. Raicu, 1J. Furst, 2D. Channin, 3S. Armato, and 3K. Suzuki

1DePaul University, 2Northwestern University, and University of Chicago

REU DataOverview Goal: continue promoting interdisciplinary studies at the frontier between information technology and medicine to undergraduate students - especially students from groups historically underrepresented in exact sciences

Duration: 10 weeks over the summer

Example Teaching:

Interdisciplinary tutorials: Image processing, machine learning

Technology tools tutorials: MatLab, SPSS

Presentations by mentors about projects

Example Activities:

Follow-on activities

Bi-weekly group meetings. presentations to entire MedIX group, final reports (in conference formats), seminars to support student publication

Special events

Day in the life of a PhD student”, “Developing a research career”, “Women in science”, Tours of medical facilities, etc

Unique Site Aspect:

Multi-institution & multi-disciplinary site @ the frontier between computer science & medicine

Outcomes (2005-2007) 88% students had at least one research publication

over 23 publications (1 journal paper, 15 conference papers, 8 extended abstracts)

3 honor theses & senior projects, 4 graduate fellowships, and 1 CRA) honor mention for outstanding undergraduate research

Statistics (2005-2007) Students demographics: 8 per year

Female: 46 %; First generation college: 15%; Outside of home institutions: 73%

Previously presenting a visual (poster) research presentation (31%) or an oral research presentation (27%), (co-) authored a publication in an academic journal (12%), or in the previous two years been involved in any research projects (42%).

Total number of Faculty mentors: 4

Years of operation: 2005 to 2010

Example Research topics: see on the left side

Introduction This work thoroughly investigates ways to predict the results of a semantic-based image re-

trieval system by using solely content-based image features. We extend our previous work1 by

studying the relationships between the two types of retrieval, content-based and semantic-based,

with the final goal of integrating them into a system that will take advantage of both retrieval ap-

proaches. Our results on the Lung Image Database Consortium (LIDC) dataset show that a sub-

stantial number of nodules identified as similar based on image features are also identified as

similar based on semantic characteristics. Furthermore, by integrating the two types of features,

the similarity retrieval improves with respect to certain nodule characteristics.

Methodology

Computation to best represent semantic-based similarity val-ues using only con-tent-based features.

The goal is to find similar nodules to make a better diagnosis of the query. Content-based image re-

trieval is the goal, as that would in-volve little human interaction on

very large data sets.

The 149 CT scans - one of each nodule - are from the Lung Imaging Database

Consortium (LIDC).

Results improve useful-ness of content-based image retrieval system

greatly.

Up to four radiologists rated the nodules on 9 distinct features. Only

7 features varied enough to incorporate, which are rated on a

scale of 1 to 5.

The radiologist compares similar nodules to aid in his diagnosis. Often, comparing similar nod-ules can lead to a more certain

diagnosis.3

Figure 1 — Methodology

The LIDC contains complete thoracic CT scans for 85 patients with lesions. Nodules with a

diameter larger than three millimeters were rated by a panel of four radiologists.2

They rated 9 characteristics of the nodules the masses that they considered nodules. Seven

of those characteristics are useful to our analysis, which were all on a scale of one to five:

Lobulation, Malignancy, Margin, Sphericity, Spiculation, Subtlety, and Texture

For each image, we calculated 64 different content-based features1:

Shape Features: circularity, roughness, elongation, compactness, eccentricity, solidity, extent,

and standard deviation of radial distance

Size Features: area, convex area, perimeter, convex perimeter, equivalence diameter, major

axis length, and minor axis length

Gray-Level Intensity Features: minimum, maximum, mean, standard deviation, and differ-

ence

Texture Features based on co-occurrence matrices, Gabor filters, and Markov random fields

Content-based versus Semantic-based Similarity Retrieval: A LIDC Case Study Sarah Jabona, Jacob Furstb, Daniela Raicub

aRose-Hulman Institute of Technology, Terre Haute, IN 47803, bIntelligent Multimedia Processing Laboratory, School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL, USA, 60604

Using k Number of Matches

The number of nodules that had 2 - 5 matches was relatively consistent throughout all image fea-

tures, but slightly higher for Gabor and Markov. No combination of image features had more than

10 matches out of the twenty most similar.

Below is a scatter plot of the content-based similarity versus the semantic-based similarity value.

[1] Lam, M., Disney, T., Pham, M., Raicu, D., Furst, J., “Content-Based Image Retrieval for Pulmonary Computed Tomography Nodule Images”, SPIE Medical Imaging Conference, San Diego, CA, February 2007.

[2] The National Cancer Institute, “Lung Imaging Database Consortium (LIDC), http://imaging.cancer.gov/programsandresources/InformationSystems/LIDC.

[3] Li, Q., Li, F., Shiraishi, J., Katsuragwa, S., Sone, S., Doi, K., “Investigation of New Psychophysical Measures for Evaluation of Similar Images on Thoracic Computed Tomography for Distinction between Benign and Malignant Nodules”, Medical Physics 30:2584-2593, 2003.

[4] Han, J., Kamber, M., [Data Mining: Concepts and Techniques], London: Academic P, 2001.

Image Data

Calculating Similarity Similarity Comparisons In order to assess the correlation between the two similarity measures, we used a round robin ap-

proach where we extracted one nodule as a query and compared it to the remaining 148 nodules. We

took the k most similar values from each query’s semantic-based similarity ordered list and content-

based similarity ordered list and counted how many nodules were common to both lists.

Here is an example with nodule 117 as the query nodule. Below are the most similar nodules

listed with their attributes.

Notice that the semantic similarity values have a much smaller range— from 0 to about 0.3,

whereas the content-based similarities range from 0 to 1. Most of the semantic features are very

similar. A ranking of i signifies that nodule was the ith most similar nodule in the list of similar nod-

ules based on the appropriate feature set.

Analysis

References

Conclusions

Our preliminary results show that a substantial number of nodules identified as similar based on

image features are also identified as similar based on semantic characteristics and therefore, the im-

age features capture properties that radiologists look at when interpreting lung nodules. There are

many similarity metrics that can be used to try to correlate the two retrieval systems. We found the

Euclidean distance to be better for the content–based features and the cosine similarity measure to

be best for the semantic-based characteristics. In our future work, we will try principle component

analysis and linear regression on the data. Further research is necessary to investigate further the

correlations between the two types of features and integrate them in one retrieval system that will be

of clinical use.

Rad. Lob. Mal. Marg. Spher. Spic. Subt. Text.

A 3 4 4 2 4 3 4

B 4 3 4 4 3 5 5

C 4 2 3 4 3 4 5

D 4 3 2 2 4 3 3

4 3 4 3 3 3 5

Summarized:

Figure 2 — Sample CT Scan with Four Radiologists’ Ratings

Semantic-Based Features

Content-Based Features

0.400.200.00

VAR00002

1,200

1,000

800

600

400

200

0

Freq

uenc

y

Mean =0.0766Std. Dev. =0.06374

N =11,026

1.0000000.8000000.6000000.4000000.2000000.000000

VAR00001

600

400

200

0

Freq

uenc

y

Mean =0.2840127Std. Dev. =0.154278896N =11,026

At right is a histogram of the content-based similar-

ity values for all 11,026 nodule pairs. The similarity val-

ues are calculated with the Euclidean distance, which is

defined below, and then min-max normalization is ap-

plied.4

At the end of the feature extraction process, each

nodule is represented by a vector as shown below,

where c stands for a semantic concept and f for a image

feature. Figure 4 — Histogram of Content- Based Similarity

Figure 3 — Histogram of Semantic- Based Similarity

The cosine similarity measure minimized the ceil-

ing effect. The similarity value calculation using the

cosine formula is shown below.

The histogram to the right is of the semantic-

based similarity values for all 11,026 nodule pairs.

Although the values do not represent a perfect nor-

mal curve, the ceiling effect was drastically im-

proved from performing a simple distance on the

seven characteristics.

Query Nodule (Q):

Database Nodule (N):

No. Image Semantic-Based Content-Based Semantic Feature Vector

Ranking Similarity Value Ranking Similarity Value Lob Mal Mar Sph Spic Sub Tex

117

- 0 - 0 2 3 5 5 2 4 5

104

2 0.004452 5 0.415918 2 3 4 4 2 3 4

126

3 0.004596 6 0.421249 2 3 5 5 1 4 5

98 6 0.006817 17 0.505317 2 3 4 5 2 3 5

28

8 0.009119 16 0.504996 1 3 5 5 1 4 5

27

11 0.012752 2 0.380517 1 3 5 5 1 3 5

137 14 0.013072 9 0.430289 1 3 4 5 1 3 4

127

16 0.013606 11 0.474226 2 4 5 4 3 4 5

119

17 0.015268 20 0.538589 3 3 4 4 2 3 5

90

20 0.016383 7 0.425751 1 2 3 4 1 2 4

Figure 5 — Example of Image Retrieval Results

Applying a Threshold

We analyzed the difference in the scales of similarity by seeing how many matches there were

based on thresholds. Below is a graph of two different thresholds of similarity—0.02 and 0.04.

These thresholds are applied to the semantic similarity values. There were many more matches

within these thresholds. Matches Gabor Markov Co-Occurrence Gabor, Markov, and

Co-Occurrence All Features

6 – 10 24 18 31 36 43

2 – 5 107 104 94 98 93

0 – 1 18 27 24 15 13

Figure 6 — Match Count in 20 Most Similar Nodules

Figure 7 — Content-Based Similarity vs. Semantic-Based Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Cont

ent-

Base

d Si

mila

rity

Semantic-Based Similarity

Similarity Values: Content-Based vs. Semantic-Based

Thresholds on Semantic-Based Characteristics

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56

Number of Matches

Occ

urr

ence

(O

ut

of

149)

Threshold 0.02

Threshold 0.04

Figure 6 — Match Based on All Features and Thresholds