ECHO IMAGE SEGMENTATION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/4304/13/13_chapter 3.pdf · CHAPTER 3 – ECHO IMAGE SEGMENTATION 63 the RDBMS system [Wei Li, 2004]

CHAPTER 3

ECHO IMAGE SEGMENTATION

3.1 INTRODUCTION

Segmentation is a vital aspect of medical imaging. It aids in the visualization of medical

data and diagnostics of various diseases. Ultrasound image segmentation, in particular echo

image, is strongly influenced by the quality of data. There are characteristic artifacts which

make the segmentation task complicated such as attenuation, speckle, shadows, and signal

dropout; due to the orientation dependence of acquisition that can result in missing boundaries.

Further complications arise as the contrast between areas of interest is such as LV, RV, etc, is

often low [Alison, 2006]. Many researchers have described in literature about general

segmentation methods which show application of a particular method to one or two ultrasound

images without specific reference to ultrasound image formation or context. However, the

focus of this work is more specific to echo images from which the regions of interest(s) such as

LV, RV, etc. are extracted.

Segmentation process subdivides an image into its constituent regions or objects. Automatic

LV segmentation results in relatively poor quality due to speckle noise present in echo images

[Alison, 2006]. Nearly 75 years ago, Wertheimer pointed out the importance of perceptual

grouping and organization in vision and listed several key factors, such as similarity, proximity,

and good continuation, which lead to visual grouping. However, even to this day, many of the

computational issues of perceptual grouping have remained unresolved. Many researchers have

proposed algorithms in the past for image segmentation tasks, but all of them undergo

extensive processing that consumes more computational time. Hence, fast image segmentation

is an important issue to be addressed.

Among different clustering algorithms the most widely used algorithm is k-mean clustering

in general and Lloyd’s algorithm in particular [Bommanna, 2010]. As part of this work a novel

technique for segmenting 2D echo images is proposed. It is primarily based on the K-Means

clustering algorithm being implemented with a set of DBMS tables and SQL statements. The

SQL implementation stays closer to the dataset and avoids huge raw data transfers to and from

CHAPTER 3 – ECHO IMAGE SEGMENTATION

63

the RDBMS system [Wei Li, 2004]. Here five variants of K-Means algorithm are proposed to

reduce the time complexity and speedup the query processing.

1. SQL based K-Means Algorithm using UPDATE statement

2. Fast SQL based K-Means using TRUNCATE-INSERT statements

3. Quick K-Means

4. K-Means with Stored Procedures

5. K-Means with External Procedures

All the above said algorithms have been implemented and evaluated for their performance

in terms of execution time considering 2D and color Doppler echo image data.

3.2 ECHO IMAGE PREPROCESSING

Ultrasound images suffer from an inherent imaging artifact called speckle. Speckle is the

random granular texture that obscures anatomy in ultrasound images and is usually described as

“noise”. Speckle is created by a complex interference of ultrasound echoes made by reflectors

spaced closer together than the ultrasound system’s resolution limit. Many different

preprocessing, or advanced image processing, approaches have been proposed for speckle

reduction. The most common categories of approaches are: median filters, Weiner filters,

diffusion filters, and wavelet filters. Another latest technique called SRI (Speckle Reduction

Imaging) is the first real-time algorithm that removes speckle without the disadvantages that

have plagued other methods. Despite its qualities median filter is used in this research work to

remove speckle from 2D echocardiographic images, because of its simplicity [Dong, 2008].

(a) (b)

Fig. 3.1 Median Filtering (a) Original echo image (b) After applying median filter

Median filtering is one kind of smoothing technique, as is linear Gaussian filtering and

provides a O(1) algorithm for this problem. All smoothing techniques are effective in removing


64

noise in smooth regions of a signal, but adversely affect edges. Especially, in echo images it is

important to preserve the edges for tracing the cardiac boundary accurately. Edges are of

critical importance to the visual appearance of images. For small to moderate levels of

Gaussian noise, the median filter is demonstrably better than Gaussian blur in removing noise

whilst preserving edges for a given, fixed window size. However, its performance is not better

than Gaussian blur for high levels of noise, whereas, for speckle noise, it is particularly

effective. As the focus of the research work is to develop an efficient segmentation algorithm,

the existing median filter class of AForge.NET is used as shown in Figure 3.1.

// AForge.NET Median filter class

Median filter = new Median();

// apply the filter

filter.ApplyInPlace(queryImage);

Here queryImage is the input echo image to which the median filter is applied. To preserve

the sharp edges, the window size of 1px is selected for the filter which is the default value of

the constructor. Larger the window size more will be the blurring effect and it erodes the edges

which lead to a number of discontinuities in the LV and other chambers.

3.3 CONVENTIONAL K-MEANS CLUSTERING ALGORITHM

The goal of this section is to describe the traditional K-Means algorithm (also called as

Lloyd’s algorithm in the Computer Science community) and various issues related to the same.

The goal is to design a simple, elegant yet robust algorithm that segments a cardiac image for

extracting its main features [Francis, 2008]. For this purpose, K-Means clustering algorithm is

selected which partitions a data set into several groups such that the intra cluster points are

similar and the inter-cluster points are dissimilar. Segmentation is a fundamental process for

higher level medical image analysis and K-Means is suitable for biomedical image

segmentation since the number of clusters (k) is usually known for images of particular regions

of human anatomy [Bommanna, 2010]. Watershed segmentation is another popular method, but

it suffers the drawbacks as mentioned in [Ashish, 2005]. Though K-Means has been shown to

be effective in producing good clustering results, its main drawback is the poor time

complexity: O(nkd), where n is the number of data points, k is the number of clusters, and d is

the number of dimensions [Fahim, 2006] [Chang, 1998] [Khaled]. Integrating K-Means

algorithm and SQL has many advantages as follows:


65

1. The image data can easily be stored in relational DBMS and we can perform all

computations faster in SQL [Pitchaimalai, 2008]

2. Since the resolution of the image is generally large, handling such huge data sets is

much easier with the help of DBMS [Carlos, 2006a]

3. No need to transfer data from DBMS address space to application address space and

vice-versa, because all the patient data reside in DBMS

3.3.1 CLASSICAL PARTITIONING METHOD: K-MEANS

The most well-known and commonly used partitioning method for clustering the data is

K-Means and its variations such as K-Mediods, C-Means, etc. The K-Means algorithm takes

the input parameter, k, and partitions a set of n objects into k clusters so that the resulting

intracluster similarity is high but the intercluster similarity is low. Cluster similarity is

measured with the mean value of the objects in a cluster, which can be viewed as the cluster’s

centroid or center of gravity.

The working of K-Means algorithm can be explained as follows: First, it randomly selects

k-of the objects, each of which initially represents a cluster mean or center. For each of the

remaining objects, an object is assigned to the cluster to which it is the most similar, based on

the distance between the object and the cluster mean.

Algorithm K_MeansCluster (k)

// k - number of Clusters

// D - data set containing n objects

// Cj - clusters containing subset of n objects, where j {1, 2, .., k}

Randomly choose k objects from D as the initial clusters

Repeat

foreach object il (where, l {1, 2, .., n}) do

(re)assign il to Cj using Euclidean distance calculation (similar objects)

foreach Cluster Cj do

Update the cluster means (or centroids) considering all objects currently in

the cluster

Until no more reassigning

end K_MeansCluster.

Fig. 3.2 The Classical K-Means Algorithm

It then computes the new mean for each cluster. This process iterates until the criterion

function converges. The K-Means procedure is summarized in Figure 3.2. Typically, the

square-error criterion is used to terminate the algorithm and is defined in equation 3.1.


66

2

1

||

k

ii

Cp

impE (3.1)

where, E is the sum of the square error for all objects in the data set; p is the given object;

mi is the mean of cluster Ci. In other words, for each object in each cluster, the distance from

the object to its cluster center is squared, and the distances are added.

Working of K-Means Algorithm

Suppose there are a set of objects, marked as black small triangles located in space as

shown in Figure 3.3(a) and let k = 3. According to the algorithm shown in Figure 3.2, three

objects are selected randomly as marked red, black, and blue circles in Figure 3.3(b). Each

object is assigned to a cluster based on the cluster center to which it is the nearest.

Fig. 3.3 Clustering of n-objects based on k-Means method, assuming k = 3.

The nearest distance between each centroid (Cj) and the objects (Di) within each cluster for

all dimensions, d can be computed using Euclidean distance [Carlos, 2006a] [Shehroz, 2004] as

given in equation 3.2.

2

1

)(DistanceEuclidean lj

d

l

li CD

(3.2)

Next, the cluster centers are updated. This means, the mean value of each cluster is

recalculated based on the current objects in the cluster. Using the new cluster centers, the


67

objects are reassigned to the clusters based on which cluster center is the nearest. This

reassignment is shown as dotted circle in Figure 3.3(c).

This process is iterated until no objects move from one cluster to the other. This situation is

shown in Figure 3.3(e) and at this point the algorithm terminates. The final clusters with the

objects are returned as the result. The algorithm attempts to determine k partitions that

minimizes the square-error function. The method is scalable and efficient in processing large

data sets, because the theoretical complexity is linear. However, practically it takes more time

and the method often terminates at local optimum.

Variations of K-Means

A number of variations to the K-Means algorithm have been developed in an effort to

improve its computational efficiency or extend its expressiveness in categorical and mixed

data. The ISODATA algorithm uses the technique of merging and splitting clusters in order to

obtain the optimal partition, starting from any arbitrary initial partition, utilizing appropriate

threshold values for performing this process [Georgios, 2007].

The dynamic clustering algorithm permitted other representations than the center of a

cluster utilizing maximum-likelihood estimation, selecting a different criterion function. Other

research efforts improved computational complexity by reducing the number of (dis)similarity

calculations. But two very important steps in the evolution of the K-Means algorithm family

involve its extension to categorical, mixed numeric and categorical values through the

development of the K-modes and K-prototypes algorithms. K-modes uses a simple matching

dissimilarity measure to deal with categorical objects while the K-prototypes defines a

combined dissimilarity measure, integrating the K-modes and K-Means algorithms to allow for

clustering of mixed numeric and categorical attributes.

Issues of K-Means Method

There are several issues to be addressed in implementing K-means for any practical

applications such as image segmentation. These are summarized below:

1. Selection of number of clusters. With regard to echo image clustering, it is normally set

to 3 as there is no need to identify more regions.

2. Initial values or seed values for the cluster centroids. Most of the researchers follow

random selection of initial cluster centers.


68

3. If one or more clusters become empty, this has to be taken care in the design of the

model and SQL statements.

4. Use of appropriate distance metrics. There are number of methods that are specified in

the literature for the distance calculations: Manhattan, Minkowski, Cosine-similarity,

etc. However, Euclidean distance calculation is proved to be better compared to other

techniques.

5. Condition to terminate the algorithm. K-Means, as explained earlier, takes more time,

because it depends upon the distribution of data. Fixing the number of iterations in

advance is a better and faster approach.

One of the major issues of K-Means algorithm is its poor time complexity for large data

sizes [Fahim, 2006]. Due to this drawback, K-Means may not be suitable for practical

applications such as images with high resolution. To address this, instead of implementing this

algorithm in application address space, a model can be devised where all the K-Means steps are

executed within the DBMS address space. This procedure is expected to speed up the process

with appropriate SQL and/or PL/SQL statements.

3.4 SQL BASED K-MEANS ALGORITHM: USING UPDATE

STATEMENT

This section describes the modified K-Means algorithm for the image segmentation process

by using a well designed schema (a set of DBMS tables) and SQL statements. The main

contribution lies in speeding up the Euclidean distance and update operations of cluster

assignment. For K-means, the most intensive step is distance computation, which has time

complexity O(dkn). This step requires both significant CPU use and I/O [Carlos, 2006a].

3.4.1 DEFINITIONS

The input for K-Means is a data set D containing n points with d dimensions, D = {i1, i2, i3,

.., in}. Here, choose the value of k = 3, because echo images are segmented into three major

regions in the cardiac chamber - blood pool (black region), near endocardium (white region),

and the rest (gray region). As explained earlier in Chapter 2, the data set is the pixel values of

the given image f(x, y) of size M × N, where f(x, y) is the gray scale value of a pixel at location

(x, y). All 2D echo images are simple RGB images with same intensity values. Hence, the pixel

data set can be considered as 1-D rather than 3-D; thus making the algorithm simpler.

Table 3.1 Matrices / tables


69

Matrix Size Description

Data n × d Pixel Data

Centroid k × d Cluser Mean

Eucl n × k Euclidean Distance

CV n × 1 Cluster Assignment

CD n × d Cluster Data

A total of five tables are needed for this algorithm and their structures are shown in Table

3.1 which store the image pixel data.

Each tuple in Data represents a pixel with its spatial co-ordinates and the gray scale

intensity [0-255] value. Since k = 3, the Centroid table always contains 3 rows with the pixels

being selected as centroids in each iteration. Next, in order to store the Euclidean distances, the

table Eucl is used and each entry in this table gives the distance of ith pixel to the respective

centroids in k clusters. The table CV (Cluster Vector) is a temporary table that stores the pixels

and their assigned cluster number (j). Finally, CD (Cluster Data) is a join of CV and Data tables

giving the pixel (its x, y, and the gray scale value) details along with the cluster number

assigned after the specified number of iterations. This is the desired output and this table can be

used for processing. The following subscripts are defined as:

i : 1..n : number of data points (pixels)

j : 1..k : number of clusters

l : 1..d : number of dimensions

3.4.2 PROPOSED METHOD – K-MEANS SQL ALGORITHM

This subsection describes the proposed algorithm for segmentation process by using an

efficient implementation of K-Means algorithm using SQL.

Data Centroid Eucl

CV CD

Fig. 3.4 Schema of all five tables for K-Means Algorithm. Shaded cells are Primary Keys.

It follows almost the same approach as that of the algorithm shown in Figure 3.2, except

that all computations are carried out with the five relational DBMS tables shown above and a

i x y val

j x y val

i d1 d2 d3

i j

i j x y val

Table 3.1 Matrices / tables


70

series of UPDATE statements written in SQL. Figure 3.4 shows the relational schema with the

attributes.

The Data table consists of 4 attributes. The first attribute, i is the identification number (id)

of each pixel, the second and third attributes are <x, y> signify the spatial co-ordinate of the ith

pixel in the echo image, and the fourth attribute is its intensity value. Here, i is declared as the

primary key constraint. Next, to store the mean value of each cluster Centroid table is used. For

this table j is selected as primary key. It is also a foreign key which references the primary key i

in Data table. The seed values for the three clusters are computed using random number

generation picked from the Data table and stored in this table. Subsequently, the new mean of

each cluster will be calculated and updated. During these iterations <x, y> values have no

meaning. Since k = 3, the Centroid table will always have 3 tuples. For larger values of k (k >

3), those many rows are inserted in this table and no change in schema is required.

The distance between each data point in Data with each of cluster mean, i.e. each row in

Centroid, is computed using Euclidean distance and the computed value is stored in Eucl table.

The attributes d1, d2, and d3 represent the distance of ith pixel to each cluster center. To assign

a cluster number for each pixel, the minimum of <d1, d2, d3> is computed and stored in the

table CV. That is, the attribute i is the pixel id and j is the assigned cluster number during a

particular iteration. To produce the result a natural join of CV with Data is carried out to obtain

the final output table CD, showing the pixel id, cluster number, x-y coordinates, and pixel

intensity value.

Except Data table, all the other tables are updated during every iteration. It is assumed that

all these tables are properly indexed. The algorithm is terminated after a finite number of

iterations. Better results for clustering were obtained with 4 to 6 iterations.

Algorithm K_Means_SQL(Q)

// Input: Grayscale image, I; Q - # of iterations

// Output: Segmented Image, S

DeleteTableData() // Delete all rows in Data

Data ← I

Initialize Centroid table with random values selected from Data table

Initialize Eucl, CV, CD tables

for m ← 1 to Q do

a) Compute the mean for each cluster grouped by j in CD table and update

Centroid table


71

b) Compute the Euclidean distance for each pixel in Data table with each

cluster mean in Centroid table and update Eucl table

c) Compute the minimum for each row in Eucl and assign cluster number.

Update CV table.

d) Update CD table by joining CV and Data tables for the next iteration

Foreach cluster data do

a) C1 ← 0; C2 ← 150; C3 ← 255

b) Update CD

Create the image, S out of CD table data

Return S

end K_Means_SQL.

Fig. 3.5 Algorithm for SQL based K-Means

The proposed algorithm is shown in Figure 3.5. The main strength of achieving the running

time efficiency lies in the for loop. These steps are executed using only UPDATE statements

and not the two step costly process of DLETE and INSERT.

3.4.3 INITIALIZATION STEPS

For proper working of the SQL based K-Means the database tables must be initialized with

proper data as explained below.

(1) DeleteTableData(): Before populating the database tables the old data must be deleted, if

any, to avoid integrity constraints. This is done using following SQL statements:

DELETE FROM CD;

DELETE FROM CV;

DELETE FROM Eucl;

DELETE FROM Centroid;

DELETE FROM Data;

(2) Load image data to Data table: Further, loading the image data into Data table

accomplished using:

INSERT INTO Data (i, x, y, val)

VALUES (:i, :x, :y, :val);

where :i, :x, :y, :val are the pixel id, pixel coordinates, and the intensity value arrays. This

method of parameterized insertion is more efficient. Instead of inserting one row at a time, that


72

consumes time, four single dimension arrays are used as id, x, y, and val. This ensures that each

array being populated at single instance.

(3) Initialize Centroid: In order to initialize the Centroid table, 3 pixels are selected from Data

using random number generation.

INSERT INTO Centroid (

SELECT 1, x, y, val FROM Data WHERE i = "j1");





where, j1, j2, and j3 are the three random numbers generated using the host language

(C#.NET) restricted to 0-255.

(4) Initializing other Tables: The other tables are initialized in a similar way and are shown

below:

INSERT INTO Eucl (

SELECT i, sqrt(power((Data.val - c1.val), 2)) as d1,

sqrt(power((Data.val - c2.val), 2)) as d2,

sqrt(power((Data.val - c3.val), 2)) as d3

FROM Data,

(SELECT * FROM Centroid WHERE j = 1) c1,

(SELECT * FROM Centroid WHERE j = 2) c2,

(SELECT * FROM Centroid WHERE j = 3) c3)

ORDER BY i;

INSERT INTO CV (

SELECT i,

Case when d1 <= d2 and d1 <= d3 then 1

when d2 <= d3 and d2 <= d1 then 2


end as j

FROM Eucl);


73

INSERT INTO CD (

SELECT Data.i, j, Data.x, Data.y, Data.val

FROM Data, CV

WHERE Data.i = CV.i);

Here the Euclidean distance is calculated with a single SQL statement for all pixels with

respect to the centroids without a single join. The time taken for joining operation of c1, c2, c3

with Data table is negligible, because the number of rows in Centroid table is 3. Similarly, CV

table is populated with minimum distance at one time instant with a Case statement. Using the

cluster assignment details from CV table, the table CD gets all the details of the data points with

an equi-join operation.

3.4.4 UPDATING DBMS TABLES

The four sub-steps (a) to (d) under the for loop of the algorithm can be designed using an

UPDATE statement. Another alternate method is to use the combination DELETE-INSERT

statement. But, running time required to execute DELETE statement is more in any database,

because it stores all the deleted records in the redo/undo log files for a possible rollback

operation. This is more often a row-by-row operation rather than a bulk operation.

The fundamental requirement is that except the Data table, all the other tables must be

updated. Depending upon the data, certain clusters may be empty. For instance, assume that in

a given image all the pixels are of same intensity leading to just one single cluster. Then the

other two clusters would be empty. This normally issues error message during the SQL join

operations. Hence, the queries are designed to take care of this situation using left-outer join.

The sub-steps of for loop in Figure 3.5 can be accomplished using the following SQL

statements:

(a) Update of Centroid Table

During each iteration, the table data of CD is nothing but the pixels with assigned cluster id

(1, or, 2, or 3). As per the K-Means algorithm the new mean must be computed and stored in

Centroid table. This is obtained by executing the following query:

SELECT j, Avg(val) as val

FROM CD

GROUP BY j;


74

This query produces 3 rows, one for each cluster. The updation must be carried out in

Centroid without deleting the old data. This is achieved by using a correlated subquery

technique as shown below:

UPDATE Centroid c3

SET (j, val) =

( SELECT c1.j, c2.val

FROM Centroid c1,

(SELECT j, Avg(val) as val

FROM CD

GROUP BY j) c2

WHERE c1.j = c2.j(+) AND c1.j = c3.j );

This concept is adopted for all table updates.

(b) Update of Eucl Table

The purpose of Eucl table is to find the Euclidean distance from each pixel in Data table to

each of the cluster centroid, i.e. each Centroid table row. This operation must be carried out

with a single SQL statement so that the design can be made more general. In other words, the

same SQL statement will be applicable for any value k; thus achieving scalability.

UPDATE Eucl e1

SET (i, d1, d2, d3) =

( SELECT i, sqrt(power((e2.val - c1.val), 2)) as d1,

sqrt(power((e2.val - c2.val), 2)) as d2,

sqrt(power((e2.val - c3.val), 2)) as d3

FROM Data e2,

( SELECT j, x, y, NVL(val, 0) as val

FROM Centroid

WHERE j = 1) c1,


FROM Centroid

WHERE j = 2) c2,


FROM Centroid

WHERE j = 3) c3

WHERE e1.i = e2.i);

The subquery generates four tables: Data, c1, c2, and c3 corresponding to pixel data and the

three clusters. The SELECT clause computes the distance and updates the three distances d1,


75

d2, and d3. Note that NVL function is used to take care of the possible NULL values in any of

the clusters.

(c) Update CV Table

Cluster assignment is another important step in which each pixel is assigned an appropriate

cluster id based upon the min value out of d1, d2, and d3 from the table CV. This is done using

the following query:

UPDATE CV v1

SET (i, j) =

( SELECT i,




End as j

FROM Eucl v2

WHERE v1.i = v2.i);

It is interesting to note that this query uses a Case statement and no join operation, except

for correlated update operation.

(d) Update CD Table

For the next iteration to work properly, the actual input data pixels and its assigned cluster

id must be ready. This means that the Data table and CV table must be joined to obtain the

same. Below query shows this operation:

UPDATE CD c1

SET (i, j, x, y, val) =

( SELECT d2.i, j, d2.x, d2.y, d2.val

FROM Data d2, CV d3

WHERE d2.i = d3.i AND c1.i = d2.i);

Instead of DELETE and then INSERT, an UPDATE statement is used for this purpose.

After the desired number of iterations, i.e. Q the final cluster assignment can be found in table

CD.

(e) Final Output Table - CD

In order to display the segmented image, 0 (black) is assigned to all pixels in cluster 1, 150

(gray) is assigned to all pixels in cluster 2, and 255 (white) is assigned to all pixels in cluster 3.


76

UPDATE CD c1

SET (j, val) =

SELECT j, DECODE (j, 1, 0,

2, 150,

3, 255) val

FROM CD c2

WHERE c1.i = c2.i);

Fig. 3.6 Query for segmented image pixels

Since it is faster to do this task with a single SQL statement, the same is achieved by using

a DECODE statement as shown in Figure 3.6. Now the table CD contains the pixels with

appropriate intensity values (0/150/255) corresponding to the 3 clusters and using this data we

can construct an image. The computational efficiency, quality of clustering, and other details

related to this design are discussed later in this chapter.

3.4.5 ISSUES RELATED TO SQL K-MEANS ALGORITHM

Though the SQL based K-Means algorithm is faster than the other versions, it suffers from

the following shortcomings:

1. The major drawback of the current design is that UPDATE operation is slower, because

it is a row-by-row operation.

2. To achieve UPDATE query design, the correlated version requires an extra join

operation. This obviously slows down the execution.

3. There are two tables CV and CD used in the design. The table can be reduced to a single

table instead of two.

These weaknesses can be addressed by modifying the schema design and this modified

design is discussed in the next section.

3.5 FAST SQL BASED K-MEANS ALGORITHM USING

TRUNCATE-INSERT STATEMENTS

As explained in section 3.4.5 UPDATE is a row based statement leading to inefficient

execution. To overcome this inefficiency DELETE-INSERT combination can be used to update

the rows in each of the tables.

However, DELETE is time consuming as stated earlier and therefore a faster version which

can be used in place of DELETE is TRUNCATE. It is faster for the reason that this statement


77

does not store the records in the redo/undo log buffer and it is a page-wise operation. Also,

merging CV and CD tables into one table called as CVCD will further speed up the execution.

Data Centroid Eucl

CVCD SI

Fig. 3.7 Schema for Fast K-Means Algorithm. Shaded cells are Primary Keys

The modified schema is shown in Figure 3.7 where the other table structures remain same

as the previous design except that CV and CD are merged and called as CVCD. The segmented

image pixel data will be available in a new table called SI. In this design we reduce the update

operations/steps from 4 to 3 as shown in Figure 3.8.

Algorithm FKM_SQL(I, Q)

// Input: Echo image, I and # of iterations: Q

// Output: Segmented Image, S

1. DeleteTableData() // TRUNCATE table

2. Load Image pixel values to Data table.

3. Initialize Centroid table with random values selected from Data table.

4. Initialize Eucl and CVCD tables.

5. for i ← 1 to Q do

a) INSERT INTO Centroid ← Average(CVCDGroup By j) // average of each

cluster, j = 1..k

b) INSERT INTO Eucl ← Euclidean distance (Datai * Centroidj), for i = 1..n

and j = 1..k

c) INSERT INTO CVCD ← Min(Eucli(d1, d2, d3)) // cluster assignment

6. For each cluster data in CVCD, assign a special value [0, 150, 255] and insert

into SI.

7. Create the image, S out of SI table data.

8. Return the segmented image, S.

9. end FKM-SQL.

Fig. 3.8 Algorithm for Fast SQL based K-Means

i x y val

j x y val

i d1 d2 d3

i j val

i j val


78

Before inserting any new image data pixels into Data table, all the rows in this table must

be deleted. For this task, the following statements are used:

TRUNCATE TABLE CVCD;

TRUNCATE TABLE EUCL;

TRUNCATE TABLE CENTROID;

TRUNCATE TABLE DATA;

The for loop iterates Q times and executes in only 3 steps. In sub-step (a), deletes the

Centroid table data (using TRUNCATE) and inserts the newly computed average or mean of

each cluster. Next, step (b) computes the distance between each pixel in Data table with the

cluster centroid and insert them into Eucl table. Finally, step (c) computes the minimum out of

d1, d2, d3 and assigns the cluster id to each pixel and inserts them into CVCD.

The experimental results show that approximately 4 to 6 iterations is sufficient to attain

good segmentation. To show the segmented image, the pixel data in each cluster is assigned a

distinct color: black (0), gray (155), white (255). Line 6 and 7 perform these operations and the

final segmented image data is stored in table SI.

SQL Implementation

The SQL code for update operations of lines 5(a), 5(b), and 5(c) are as shown here.

Inserting into Centroid Table

As the number of tuples in Centroid is always 3, there is no need to apply TRUNCATE and

then INSERT statements. A direct approach is to use the UPDATE statement itself as shown

below:

UPDATE Centroid c3

SET ( j, val ) =

( SELECT c1.j, c2.val FROM Centroid c1,

( SELECT j, Avg(val) as val

FROM CVCD

GROUP BY j ) c2

WHERE c1.j = c2.j(+) AND c1.j = c3.j );

Inserting into Eucl Table

Before inserting new values into Eucl table, its contents are removed by executing

TRUNCATE statement.


79

INSERT INTO Eucl (i, d1, d2, d3)

( SELECT i, sqrt(power((e2.val - c1.val), 2)) as d1,

sqrt(power((e2.val - c2.val), 2)) as d2,

sqrt(power((e2.val - c3.val), 2)) as d3

FROM Data e2,

(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 1) c1,

(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 2) c2,

(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 3) c3 );

Inserting into CVCD Table

The previous SQL statement populates each row of Eucl table with the three distances

which represent the cluster distances. Now the ith

pixel in this table belongs to one of the

clusters (i.e. 1 or 2 or 3) depending upon which d is the minimum.

INSERT INTO CVCD (i, j, val)

( SELECT v1.i,




end as j, v1.val

FROM Eucl v2, Data v1

WHERE v2.i = v1.i);

A Case statement is used in order to find the least value out of d1, d2, and d3 and assign the

corresponding value as cluster id. Hence, after the execution of this query the CVCD table

would contain pixel data id, cluster number, and the actual grayscale value corresponding to i,

j, and val respectively.

These three operations are executed Q times and the algorithm converges quickly as

explained earlier. The next step is to assign distinct grayscale values to pixels in each cluster.

The following query does this:

INSERT INTO SI (i, j, val)

( SELECT i, j, DECODE

(j, 1, 0,

2, 150,

3, 255) val

FROM CVCD

);


80

It is a simple query that assigns 0 (black) to all pixels in cluster 1, 150 (gray) to all pixels in

cluster 2, and 255 (white) to all pixels in cluster 3 and stores these pixel data in a new table SI.

3.5.1 EXAMPLE

To illustrate the working of the algorithm as shown in Figure 3.8, assume just 7 rows in the

Data table. Figure 3.9 shows the initial data in the following tables: Data, Centroid, Eucl, and

CVCD.

Fig. 3.9 Table data after initialization process

Now when the SQL statements in the for loop are executed, i.e. the update of Centroid and

two insert statements of Eucl and CVCD tables, the table data get modified. The output of these

tables at the end of first iteration is shown in Figure 3.10.

Centroid

j x y val

1 0 0 1.5

2 0 2 26.33

3 2 1 8.5

Data

i x y val

1 0 0 1

2 0 1 2

3 0 2 14

4 1 0 8

5 1 1 15

6 2 1 9

7 2 2 50

Centroid

j x y val

1 0 0 1

2 0 2 14

3 2 1 9

Eucl

i d1 d2 d3

1 0 13 8

2 1 12 7

3 13 0 5

4 7 6 1

5 14 1 6

6 8 5 0

7 49 36 41

CVCD

i j val

1 1 1

2 1 2

3 2 14

4 3 8

5 2 15

6 3 9

7 2 50


81

Eucl

i d1 d2 d3

1 .5 25.33 7.5

2 .5 24.33 6.5

3 12.5 12.33 5.5

4 6.5 18.33 .5

5 13.5 11.33 6.5

6 7.5 17.33 .5

7 48.5 23.66 41.5

CVCD

i j val

1 1 1

2 1 2

3 3 14

4 3 8

5 3 15

6 3 9

7 2 50

Fig. 3.10 Contents of tables at the end of first iteration

Comparing the CVCD after first iteration with the initial values, it is clear that no change

occurs in cluster 1. But elements 14 and 15 have moved from cluster 2 to cluster 3. Hence, the

cluster data after the first iteration is C1 = {1, 2}, C2 = {50}, C3 = {14, 8, 15, 9}. We can easily

observe that any further iteration will not shift the elements. The termination of the algorithm is

purely based on Q, which is set by visually verifying the quality of clustering output.

(a) (b)

Fig. 3.11 Segmentation by applying Fast SQL K-Means Algorithm (a) Original image

(b) Segmented image with Q = 6


82

To show an example for the segmentation process by applying the Fast SQL based K-

Means algorithm, Figure 3.11 shows the original echo image in apical 4 chamber view of a

normal patient and its segmented images in (b).

3.5.2 THEORETICAL TIME COMPLEXITY

In a clustering problem, given n points in Rd and an integer k which denotes the number of

clusters, a partition S = (S1, S2,…,Sk) of the given points into disjoint non-empty sets is

constructed and corresponding centers C = (c1, c2, …,ck) is chosen such that

is minimized . So given the clusters, the centers would have to be the

centroids. Similarly, given the centers, the clusters would be formed by assigning each point in

the set to its nearest center. However, if none of them is given, the problem is NP - complete

[Leonard, 2009] even for n = 2. This is denoted as

.

The most common clustering algorithm used is the Lloyd's algorithm. The algorithm starts

with some k clusters, then computes centroids of those clusters and does a reassignment of

points based on the following. For each point, the nearest centroid is designated and the point is

said to belong to the corresponding cluster. This is repeated until the point assignment

stabilizes; that is, there is no more rearrangement of points. It can be observed that the

algorithm gives rise to a local optimum. For this process a theoretical upper bound has to be

determined.

There is a trivial upper bound of O(kn) iterations since no partition of points into clusters is

ever repeated during the course of the algorithm. In d-dimensional space, this bound was

slightly improved by Inaba et al. to O(nkd

) by counting the number of distinct Voronoi

partitions on n points [Leonard, 2009].

In some cases the upper bound of traditional K-Means algorithm determined to be

(m k d n), where n is the number of data objects, d is the number of dimensions, k is the

number of clusters, and m is the number of iterations. In this algorithm more time is spent in

computing the vector distances. Even a linear algorithm can be quite slow if one of the

arguments of (...) is large, and it is observed that d is usually large. In echo image

segmentation case the value of d is just 1 for grayscale images and 3 for color images.

In the SQL implementation, the time complexity of the algorithm shown in Figure 3.8 can

be written as follows. The active operations that contribute for the complexity analysis are:


83

for i ← 1 to Q do

(a) Centroid CVCD

(b) Data Centroidj=1 Centroidj=2 Centroidj=3

(c) Data EUCL

end for

The complexity of these three steps can be calculated as:

Step a: Since CVCD table is indexed, its time-complexity is O (log n), assuming the

size of Centroid is small (in this case, it is 3)

Step b: In the worst-case, all pixels may be in a single cluster and Data table is

indexed. Hence, the upper bound is O(log n)

Step c: Both Data and Eucl tables are of size n. Assuming Data table is indexed, the

upper bound for this step will be O(log n * n)

Since, Q is constant and does not depend on n, it can be represented as,

T(n) = O(log n) + O(log n) + O(log n * n)

= O(log n * n) = O(n log n)

This result suggests that the proposed algorithm with proper indexing of the tables should

give better performance than the traditional algorithm.

3.6 QUICK K-MEANS: AN IMPROVED VERSION

The Fast SQL K-Means was proposed as alternate for the conventional K-Means algorithm.

However, the shortcoming of this algorithm is the use of two tables Eucl and CVCD which

need to be truncated and populated in each iteration which is still a time consuming process.

Instead this can be denormalized to a single table to accomplish the entire task. It is observed

that this has resulted in an improvement of 40%-50% compared to Fast K-Means algorithm.

3.6.1 SQL QUERY

The design is based on the use of “With” statement of SQL which helps in writing a single

query that creates virtual tables and carry out the desired operations in one time. The query is,

Insert into Eucl2(i, d1, d2, d3, val, j)

Select * From (

With dist as (Select i, sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM

Centroid2 WHERE j = 1)), 2)) as d1,

sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM

Centroid2 WHERE j = 2)), 2)) as d2,


84

sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM

Centroid2 WHERE j = 3)), 2)) as d3, e2.val

From Data2 e2)

Select i, dist.d1, dist.d2, dist.d3, dist.val,

Case when dist.d1 <= dist.d2 and dist.d1 <= dist.d3 then 1

when dist.d2 <= dist.d3 and dist.d2 <= dist.d1 then 2

when dist.d3 <= dist.d2 and dist.d3 <= dist.d1 then 3

End as j

From dist );

The schema of Eucl2 is modified for adding two more attributes namely val and j. Here, j is

the cluster number and val is the grayscale value which are part of the CVCD table in the earlier

design. So the virtual table created with the help of With statement is named as “dist”. The

second Select statement in the Insert clause does the same job as that of Fast K-Means with dist

as the table name. So the advantage is that no separate table is required, but all these can be

accomplished in a single SQL query. This avoids deleting huge number of tuples in these

tables, thus speeding up the process.

3.7 K-MEANS ALGORITHM USING PL/SQL STORED PROCEDURE

A stored procedure is a named set of PL/SQL statements designed to perform an action.

Stored procedures are stored inside the database. They define a programming interface for the

database rather than allowing the client application to interact with database objects directly. As

objects such as stored procedures and triggers become popular, more application code will

move away from external programs and into the database engine. In a database, applications

can be stored and deployed with all of the process logic stored in either Java or PL/SQL blocks.

The aim of this section is to describe method to accomplish the K-Means algorithm using

stored procedures.

3.7.1 ADVANTAGES OF PL/SQL STORED PROCEDURES

There are many compelling benefits to putting all Oracle SQL inside stored

procedures. These include:

Better performance - Stored procedures are loaded once into the SGA and remain there

unless they become paged out. Subsequent executions of the stored procedure are far

faster than external code. Also, stored procedures are executed on the database server


85

which is likely to be more powerful than the clients which in turn means that stored

procedures should run faster.

Coupling of data with behavior - Relational tables can be coupled with the behaviors

that are associated with them by using naming conventions.

Isolation of code - Since all SQL is moved out of the external programs and into stored

procedures, the application programs become nothing more than calls to stored

procedures. As such, it becomes very simple to swap-out one database and swap-in

another.

The code is stored in a compiled-form which means that it is syntactically valid and

does not need to be compiled at run-time, thereby saving resources.

Less network traffic and hence improves scalability.

The use of packages to combine related objects (procedures, functions and variables) into

one physical unit enhances these advantages.

3.7.2 THE DESIGN

The overall design of the stored procedure approach has two fundamental steps. First, to

execute a a set of procedures to initialize various database tables. Second, a procedure that

executes to update these tables.

CREATE OR REPLACE PROCEDURE INIT_CENTROID_3 (N IN PLS_INTEGER)

IS

BEGIN

DECLARE

Rand1 PLS_INTEGER;

Rand2 PLS_INTEGER;

Rand3 PLS_INTEGER;

BEGIN

Select floor(dbms_random.value(1,N)) Rand1 From Dual;



Insert into Centroid ( Select 1, X, Y, val From Data Where Data.i = Rand1 );



Commit;

END;

/


86

Another important addition to this new design is the introduction of /*+APPEND*/

comment to speed up the insert DML operation of Eucl and CVCD tables. The code shown in

the previous page shows the PL/SQL stored procedure for initializing Centroid table and the

below two procedures are for initializing Eucl, and CVCD tables as explained under Fast K-

Means algorithm earlier in this chapter.

The INIT_CENTROID_3 procedure generates 3 random numbers (since k = 3) from 1 to N,

where N is total number of pixels sent as a parameter. Similar to Fast K-Means, the other two

tables are also initialized as shown below:

CREATE OR REPLACE PROCEDURE INIT_EUCL_3 IS

BEGIN

Insert /*+APPEND*/ into Eucl (

Select i, sqrt(power((data.val-c1.val),2)) as d1,

sqrt(power((data.val-c2.val),2)) as d2,

sqrt(power((data.val-c3.val),2)) as d3

From Data, (Select * From Centroid where j = 1) c1,

(Select * From Centroid where j = 2) c2,

(Select * From Centroid where j = 3) c3)

Order by i;

END;

/

CREATE OR REPLACE PROCEDURE INIT_CVCD_3 IS

BEGIN

Insert /*+APPEND*/ into CVCD (

Select Data.i,




End as j, Data.val

From Eucl, Data

Where Eucl.i = Data.i);

END; /

Oracle's "APPEND" hint keyword is a tool to bypass the use of existing half-empty empty

data blocks from the freelist chain. Instead, Oracle 10g extends the table and uses brand-new

dead-empty data blocks for inserts. This results in more rows per I/O, speeding-up insert


87

performance. The next PL/SQL procedure UPDATE_TABLES_3 updates three tables namely

Centroid, Eucl, and CVCD by using INSERT statements combined with APPEND comment.

The code is given below:

CREATE OR REPLACE PROCEDURE UPDATE_TABLES_3 IS

BEGIN

Execute immediate 'Truncate Table Eucl';

Update Centroid c3

Set (j, val) =

( Select c1.j, c2.val

From Centroid c1, ( Select j, Avg(val) as val

From CVCD Group by j) c2

Where c1.j = c2.j(+) and c1.j = c3.j);

Insert /*+APPEND*/ into Eucl (i, d1, d2, d3)

( Select i, sqrt(power((e2.val-c1.val), 2)) as d1,

sqrt(power((e2.val-c2.val), 2)) as d2,

sqrt(power((e2.val-c3.val), 2)) as d3

From Data e2, (Select j, x, y, NVL(val,0)as val From Centroid where j = 1) c1,

(Select j, x, y, NVL(val,0)as val From Centroid where j = 2) c2,

(Select j, x, y, NVL(val,0)as val From Centroid where j = 3) c3

);

Execute immediate 'Truncate Table CVCD';

Insert /*+APPEND*/ into CVCD (i, j, val)

( Select Data.i,




End as j, Data.val

From Eucl, Data

Where Eucl.i = Data.i);

Commit;

END; /

There are other ways to enhance performance of bulk insert. For example, Oracle 11g

Release 2 offers APPEND_VALUES along with FORALL statement which is many times

faster than row-by-row insert. With these procedures compiled and stored in the database

server, the application program (C#.NET) invokes these procedures using the following way:


88

using (OracleConnection conn = new OracleConnection("Data Source=ORCL; User

ID=scott; Password=tiger"))

{

OracleCommand sqlCommandEnable = new OracleCommand();

sqlCommandEnable.CommandType = CommandType.StoredProcedure;

sqlCommandEnable.Connection = conn;

sqlCommandEnable.CommandText = "INIT_TABLES_3";

conn.Open();

sqlCommandEnable.ExecuteNonQuery();

}

using (OracleConnection conn = new OracleConnection("Data Source=ORCL; User

ID=scott; Password=tiger"))

{

OracleCommand sqlCommandEnable = new OracleCommand();

sqlCommandEnable.CommandType = CommandType.StoredProcedure;

sqlCommandEnable.Connection = conn;

sqlCommandEnable.CommandText = "UPDATE_TABLES_3";

conn.Open();

sqlCommandEnable.ExecuteNonQuery();

}

The only difference between the two blocks in the above code is that the CommandText

parameter must be loaded with the appropriate procedure name.

3.8 SEGMENTATION USING K-MEANS ALGORTIHM AS

EXTERNAL PROCEDURE (KMEP)

Any database query based design must access each record from the specified table for

manipulation. Without proper choice of the layered design, the applications may not run faster.

Designing a database application for optimal performance can seem a daunting challenge.

There are many choices to make - development tools, database design, application structure,

query design, choice of interface - and the "right" choices in each of these areas depend on the

unique application requirements and on the skills set of the designer. Few basic principles and

trade-offs of SQL Server development will help us tremendously to get optimum performance.

This section describes a simple but effective three-layered architecture to achieve a better

running time than the previous designs for running the K-Means clustering algorithm.


89

The main concept is to implement the K-Means algorithm using C language and create a

.dll file and call the same as an external procedure from the Oracle PL/SQL code. This module

is invoked for each patient’s echo image for segmentation.

3.8.1 EXTERNAL PROCEDURE

An external procedure, also sometimes referred to as an external routine, is a procedure

stored in a dynamic link library (DLL). For instance, a complete C program unit can be

compiled and produce a .dll file so that it can be dynamically called whenever needed from an

Oracle 10g PL/SQL program unit. These procedures participate fully in the current transaction

and can call back to the database to perform SQL operations. The procedures are loaded only

when necessary, so memory can be conserved. Because the decoupling of the call specification

from its implementation body means that the procedures can be enhanced without affecting the

calling programs.

External procedures have the following advantages:

Isolate execution of client applications and processes from the database instance to

ensure that any problem on the client side do not adversely impact the database.

Move computation-bound programs from client to server where they execute faster

(because they avoid the round-trips of network communication)

Interface the database server with external systems and data sources

Extend the functionality of the database server itself

These are important to this research work in terms of separating the CPU bound work of K-

Means algorithm from database but available as an external procedure so that speed and

security can be achieved.

3.8.2 ORACLE EXTERNAL PROCEDURE ARCHITECTURE

As shown in Figure 3.12, the process flow starts with a PL/SQL application that calls a

special PL/SQL "module body." In this example, it is the K-Means program. PL/SQL then

looks for a special Net8 listener process, which already is running in the background. Net8 is

the name for what was formerly the Oracle SQL*Net product. A Net8 listener is a background

process, typically configured by the DBA, which enables other processes to connect to a given

service such as the Oracle server. At this point, the listener will spawn an executable program


90

called extproc. This process loads the dynamic library and then invokes the desired routine in

the shared library, whereupon it returns its results back to PL/SQL.

Fig. 3.12 External Procedure (.dll) model in Oracle 10g

To limit overhead, only one extproc process needs to run for a given Oracle session; this

process starts with the first external procedure call and terminates when the session exits. For

each distinct external procedure, this extproc process loads the associated shared library, but

only if it hasn't already been loaded.

Calling a dynamically linked routine simply maps the shared code pages into the address

space of the "user" process. Then, when that process touches one of the pages of the shared

library, it will be paged into physical memory, if it isn't already there. The resident pages of the

mapped shared library file will be shared automatically between users of that library. In

practical terms, this means that heavy concurrent use of the external procedure often requires

a lot less computer processing power and memory than, say, the primitive approach that might

take with database pipes.

3.8.3 CREATING EXTERNAL PROCEDURES

In order to create an external procedure for K-Means algorithm, all the necessary functions

are written in C language and stored as .c or .cpp file. Since a callable DLL has to be created,

this program should not have a main() function. Along with this file a module definition file or

.def file containing one or more module statements that describe various attributes of a DLL, is

also created as part of the syntax rules.


91

This K-Means external DLL file is normally stored in the home path of the database.

External procedures require a listener to be set up as part of the Oracle Net. To call the

K-Means external function, a PL/SQL wrapper is created which maps names, parameter types,

and return types for the C program to the SQL types. For segmenting the echo images, another

PL/SQL procedure is written such that the image data and other parameters are sent to the

external procedure which returns the segmented data back to the database thereby achieving the

tightly-coupled design. A detailed step-by-step procedure for creating external procedures is

given in Appendix A.

3.8.4 DESIGN OF KMEP

The implementation of K-Means with External procedures (KMEP) is a C implementation

of the conventional K-Means algorithm. However, the implementation uses an array as the

main data structure to store the image data and cluster id. Consider the dataset with n objects

and d dimensions (attb1..attbn) being stored as a simple C/C++ array as shown in Figure

3.13(a). These data objects are to be grouped into k-clusters. The (d+1)th

column represents the

cluster id, which is set to 0 for all data objects initially.

Data Array

id attb1 attb2 attb3 attb4 attbi attbd cluster id

1 0 0 23 89 0 125 0

2 0 1 45 12 0 255 0

3 0 2 112 67 9 0 0

.. .. .. .. .. .. .. ..

.. .. .. .. .. .. .. ..

.. .. .. .. .. .. .. ..

n 400 250 .. .. .. 45 0

Centroid Array

id attb1 attb2 attb3 attb4 attbi attbd Mean

1 0 0 23 89 0 125 0

2 0 1 45 12 0 255 0

.. .. .. .. .. .. .. ..

k 400 250 .. .. .. 45 0

Fig. 3.13 (a) Structure of the Data Array for a general case (n data points, d dimensions, and k clusters)

(b) Centroid Array


92

First, the centroid array is initialized by selecting k rows from Data array at random. This

acts as the seed value. Next, the first data object is extracted from the data array and the first

row from the centroid array, and then the Euclidean distance between these two objects is

computed. This process is repeated for all cluster entries in the centroid table. Thus, k distances

are computed which are stored in a temporary single dimensional array, dist[1..k].

Data Array Centroid Array

id X Y val cluster id

1 0 0 125 2

2 0 1 255 1

3 0 2 0 0

.. .. .. .. ..

.. .. .. .. ..

.. .. .. .. ..

100000 400 250 45 0

Fig. 3.14 Array showing the pixel data and centroid for an image size of 400×250, k = 3.

dist[1..k] is a temporary array.

The position of the minimum distance is obtained and this gives the cluster id which in turn

is updated in the data array corresponding to the data object. This process is iterated for all data

objects. Figure 3.14 shows this for a typical case of a grayscale echo image for k = 3, d = 1, and

n = 100,000.

The Centroid array is updated during each iteration. The advantage of this method is that all

operations are carried out as in-memory data. As the data array itself is used along with an

extra dimension to store the cluster id, saves memory usage. Since updates are done with the

array index, a considerable speed up can be achieved. Hence, this method must provide an ideal

model for K-Means implementation that achieves both space and time efficiency.

3.9 CONSTRAINT-BASED K-MEANS CLUSTERING ALGORITHM

(CKM)

In the previous section the PL/SQL based external procedure has been designed using C

programming language for the K-Means algorithm to speedup the segmentation/clustering

process. This approach achieves a considerable speed up: an echo image of size 400×250 is

segmented in less than 0.5 seconds. However, this can further be improved. Here a novel

id X Y val mean

1 34 78 125 46

2 0 89 250 250

3 145 56 125 134

1 2 3

23.6 123.8 0 dist[1..k]


93

technique “Constraint-based k-Means” (CKM) algorithm is proposed. It is experimentally

proved that the performance of this approach can segment the same image approximately half

the time taken by the regular method.

Fig. 3.15 Typical echo image with black pixels shown in yellow contours

The regular K-Means, unfortunately, can not handle the constraint based approach for

clustering [Anthony, 2001]. The heuristics used in the clustering algorithm is based on the

property of the echo image. Consider a typical echo image as shown in Figure 3.15 where at

least 30% of pixels are black. The yellow islands, the pixels inside the LV, RV, etc. in Figure

3.15 are all black pixels that indicate the blood pool. These pixels need not be considered for

clustering. This intuition helps to formulate the constraint which is application specific and

results in approximately 50% improvement in execution speed.

3.9.1 CONSTRAINED CLUSTERING (CC) PROBLEM

Constrained clustering (CC)–finding clusters that specify user specified constraints-is

highly desirable in many applications. This leads to effective and fruitful data mining by

capturing application specific semantics. Formally a constrained clustering can be defined as

follows:

Definition: Given a set D with n data points, a distance function df : D × D → , a positive

integer k, and a set of constraints C, find a k clustering {Cl1, Cl2, ….Clk}.such that DISP =

(

k

i

ii repCldisp1

)),( is minimized, and each cluster satisfies the constraints C, denoted as Cli |=

C.

Black

pixels


94

Here, the “dispersion” of cluster Cli, disp(Cli, repi), measures the total distance between

each object in Cli and the repi. In the case of k-means the representative object is normally the

centroid of the cluster. Depending on the nature of the constraints and applications the CC

problem can be categorized into constraints on individual object, obstacle object as constraints,

etc.

This constraint limits clustering only a set of pixels and not all. It can be achieved by first

executing a SQL query to select only pixels that satisfy this constraint (retrieve only non-black

pixels for clustering) and retain the black pixels in their original spatial locations. With this

constraint, the reduced data can now be used with unconstrained k-means algorithm. However,

the number of black pixels in an echo image is not constant, but as per the samples collected

90% of the images have 30% pure black pixels.

3.10 RESULTS AND DISCUSSIONS

This section presents some important experimental results of executing the proposed

algorithms. The objective is to justify, through the experiments, that the proposed SQL based

K-Means algorithm is compared with other K-Means methods reported in literature [Carlos,

2006a] [Tapas, 2002] [Velmurugan, 2010]. To establish the practical efficiency, the algorithms

were implemented using C/C++, C#.NET, and Oracle 10g SQL with other software tools. The

input is the 2D echo images of various resolutions.

3.10.1 RESULTS OF SEGMENTATION

Figure 3.16 shows the segmented output of four 2D echo images. It can be noticed that the

cardiac objects are clearly visible and the region has no gradients.

Image Id Input Image Segmented Image

Img1


95

Image Id Input Image Segmented Image

Img2

Img3

Img4

Fig. 3.16 Two-Dimensional original and segmented echo images in different views (400x250)

3.10.2 COMPARISON WITH RESULTS OF OTHER AUTHORS

A comparison between the SQL, PL/SQL based K-Means algorithm and the OptKM,

IncrKM [Carlos, 2006a], Filtering algorithm [Tapas, 2002], and an Enhanced K-Means

[Velmurugan, 2010] is presented in Table 3.2.

Table 3.2 Running time comparison of Fast SQL K-Means with other variants

Sl.

No.

Running Time (secs)

Proposed

K-Means

Carlos Ordonez

[Carlos, 2006a],

4 CPUs, 40 AMPs

Java based

[Velmurugan

,2010]

Tapas

Kanungo

[Tapas, 2002]

Tian Zhang:

BIRCH

[Tian, 1996]

pp. 112-113

1

FKM:

n = 100K

k = 3

Q = 4

T = 10.1s

KMEP:

T = 1.407s

n = 100K

k = 4

d = 8

Q = 4

OptKM = 17.0 s

IncrKM = 44 s

n = 1000

T = 7.6 s

n = 262144

Dataset:

‘Israel’

k = 8

d = 4

T = 140.4 s

n = 100K

Dataset: DS1

T = 47.1 s

(increasing k

increases T)


96

According to these results, the proposed modified K-Means algorithms perform better than

other variants. Further, the machine configurations test scenarios of these variants are

comparable to the configurations of the machine used by the proposed algorithm.

3.10.3 CONSOLIDATED PERFORMANCE OF PROPOSED K-MEANS ALGORITHMS

Figure 3.17 shows comparison of all SQL and PL/SQL stored procedure based K-Means

algorithms: FKM (SQL), QKM (SQL), FKM (SP), and PL/SQL (EP). Here, d = 1, k = 3 are

considered and running time is shown in seconds for n varying from 10,000 to 1000,000 for

one stage of operation. It is seen that PL/SQL (EP) algorithm is faster compared to all other

algorithms. Comparing, FKM (SP) with QKM, it is noticed that for n greater than 600,000 data

points, Quick K-Means (QKM) outperforms FKM (SP) algorithm. It is because the QKM uses

denormalized tables whereas FKM (SP) algorithm uses several tables. This is significant for

large values of n rather than smaller values of n.

Fig. 3.17 Running time of SQL and PL/SQL (EP) and PL/SQL (SP) K-Means algorithms, d = 1, k = 3.

0 5

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

0 100 200 300 400 500 600 700 800 900 1000

Ru

nn

ing

Tim

e (

s)

Data Size (nx1000)

FKM (SQL)

QKM (SQL)

FKM (SP)

PL/SQL (EP)


97

(a)

(b)

Fig. 3.18 Running time of proposed algorithms with [Carlos, 2006a]

(a) n = 0 to 100, 000 (b) n = 0 to 1000, 000

Figure 3.18 shows comparison of the proposed algorithm with the variants proposed by

Carlos [Carlos, 2006a], using C++ and SQL implementations. Here k = 8, d = 8 are considered

to match with the experimental set up of Carlos. The results show a considerable improvement

0

2

4

6

8

10

12

14

16

18

20

10 20 30 40 50 60 70 80 90 100

Ru

nn

ing

Tim

e (

s)

Data Size (nx1000)

FKM (SQL) FKM (SP) PL/SQL (EP)

Carlos (C++) Carlos (SQL)

0

25

50

75

100

125

150

175

200

225

250

0 100 200 300 400 500 600 700 800 900 1000

Ru

nn

ing

Tim

e (

s)

Data Size (n x 1000)

FKM (SQL)

FKM (SP)

PL/SQL (EP)

Carlos (C++)

Carlos (SQL)


98

by the proposed algorithms as compared to other variants proposed in literature. Further, a

detailed discussion of various other results are given in Chapter 10.

3.11 K-MEANS VERSUS OTHER SEGMENTATION FRAMEWORKS

K-Means clustering is often suitable for biomedical image segmentation since the number

of clusters (k) is usually known for images of particular regions of human anatomy. In

biomedical applications, the spatially varying intensity change of a biomedical structure is

usually caused by inhomogeneity in the process of image acquisition, such as the

inhomogeneous distribution of the contrast agent in CT imaging or inhomogeneous distribution

of the magnetic field gradient in MR imaging [Chang Wen, 1998].

Many researchers have described in literature about general segmentation methods which

show application of a particular method to one or two ultrasound images without specific

reference to ultrasound image formation or context. However, the focus of this work is more

specific to echo images from which the regions of interest(s) such as LV, RV, etc. are

extracted. There are a number of segmentation algorithms mentioned in the literature:

watershed, fuzzy entropy based segmentation approach, Delaunay triangulation, fractals, and

edge flow, etc. What an algorithm can segment in this case is only regions not objects. To

obtain a high level object which is desirable in image analysis and retrieval human assistance is

needed. This is carried out with the help of the active contour model where an initial contour

being specified by the operator.

Several algorithms have been tried practically on echo images which include Otsu threshold

method, Markov random field, morphological based, anisotropic diffusion model, etc.

However, all these algorithms either spoil the shape and/or edges of the cardiac cavity regions.

The basic requirement in this research is to get the region free from uneven intensity of pixels

so that contour can move faster.

Another possibility is Hierarchical clustering algorithm, but this is normally used when the

number of clusters is unknown. In echo image clustering the k value is always 3, because

medically only three regions are of importance: endocarium, myocardium, pericardium.

Further, K-Means algorithm has been widely used by many researchers for medical image

segmentation purpose in the past and present [Bhagwati Charan, 2010], [Chang Wen, 1998],


99

[Ng, 2006], [Ridho, 2010], [Muthukannan, 2010], [AKJain, press]. In [Luo, 2011] blood vessel

segmentation is done based on k-means clustering and morphological thinning.

Another reason for choosing K-Means clustering approach is that all the algorithms used in

this research work are data mining oriented. K-Means clustering is a popular technique is used

for various applications including medical image segmentation [Han, 2010].

3.12 SUMMARY

Segmentation is a vital aspect of medical imaging. It aids in the visualization of medical

data and diagnostics of various diseases. Five variants of K-Means algorithm are proposed to

reduce the time complexity and speedup the query processing.

1. SQL based K-Means Algorithm using UPDATE statement

2. Fast SQL based K-Means using TRUNCATE-INSERT statements

3. Quick K-Means

4. K-Means with Stored Procedures

5. K-Means with External Procedures

These algorithms are implemented under the tightly-coupled database environment so that

the patient image data movement is avoided and the entire analysis is done in the DB address

space. According to the results obtained, the proposed modified K-Means algorithms perform

better than other variants. Further their clustering quality is adequate for the boundary detection

phase. Other image segmentation methods such as threshold, Otsu algorithm, anisotropic

diffusion, fuzzy k-means, watershed, etc. are tried on echo images experimentally and found

that the K-Means is better in terms of speedup and quality of segmentation.

Documents

ECHO IMAGE SEGMENTATION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/4304/13/13_chapter 3.pdf · CHAPTER 3 – ECHO IMAGE SEGMENTATION 63 the RDBMS system [Wei Li, 2004]