Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CHAPTER 3
ECHO IMAGE SEGMENTATION
3.1 INTRODUCTION
Segmentation is a vital aspect of medical imaging. It aids in the visualization of medical
data and diagnostics of various diseases. Ultrasound image segmentation, in particular echo
image, is strongly influenced by the quality of data. There are characteristic artifacts which
make the segmentation task complicated such as attenuation, speckle, shadows, and signal
dropout; due to the orientation dependence of acquisition that can result in missing boundaries.
Further complications arise as the contrast between areas of interest is such as LV, RV, etc, is
often low [Alison, 2006]. Many researchers have described in literature about general
segmentation methods which show application of a particular method to one or two ultrasound
images without specific reference to ultrasound image formation or context. However, the
focus of this work is more specific to echo images from which the regions of interest(s) such as
LV, RV, etc. are extracted.
Segmentation process subdivides an image into its constituent regions or objects. Automatic
LV segmentation results in relatively poor quality due to speckle noise present in echo images
[Alison, 2006]. Nearly 75 years ago, Wertheimer pointed out the importance of perceptual
grouping and organization in vision and listed several key factors, such as similarity, proximity,
and good continuation, which lead to visual grouping. However, even to this day, many of the
computational issues of perceptual grouping have remained unresolved. Many researchers have
proposed algorithms in the past for image segmentation tasks, but all of them undergo
extensive processing that consumes more computational time. Hence, fast image segmentation
is an important issue to be addressed.
Among different clustering algorithms the most widely used algorithm is k-mean clustering
in general and Lloyd’s algorithm in particular [Bommanna, 2010]. As part of this work a novel
technique for segmenting 2D echo images is proposed. It is primarily based on the K-Means
clustering algorithm being implemented with a set of DBMS tables and SQL statements. The
SQL implementation stays closer to the dataset and avoids huge raw data transfers to and from
CHAPTER 3 – ECHO IMAGE SEGMENTATION
63
the RDBMS system [Wei Li, 2004]. Here five variants of K-Means algorithm are proposed to
reduce the time complexity and speedup the query processing.
1. SQL based K-Means Algorithm using UPDATE statement
2. Fast SQL based K-Means using TRUNCATE-INSERT statements
3. Quick K-Means
4. K-Means with Stored Procedures
5. K-Means with External Procedures
All the above said algorithms have been implemented and evaluated for their performance
in terms of execution time considering 2D and color Doppler echo image data.
3.2 ECHO IMAGE PREPROCESSING
Ultrasound images suffer from an inherent imaging artifact called speckle. Speckle is the
random granular texture that obscures anatomy in ultrasound images and is usually described as
“noise”. Speckle is created by a complex interference of ultrasound echoes made by reflectors
spaced closer together than the ultrasound system’s resolution limit. Many different
preprocessing, or advanced image processing, approaches have been proposed for speckle
reduction. The most common categories of approaches are: median filters, Weiner filters,
diffusion filters, and wavelet filters. Another latest technique called SRI (Speckle Reduction
Imaging) is the first real-time algorithm that removes speckle without the disadvantages that
have plagued other methods. Despite its qualities median filter is used in this research work to
remove speckle from 2D echocardiographic images, because of its simplicity [Dong, 2008].
(a) (b)
Fig. 3.1 Median Filtering (a) Original echo image (b) After applying median filter
Median filtering is one kind of smoothing technique, as is linear Gaussian filtering and
provides a O(1) algorithm for this problem. All smoothing techniques are effective in removing
CHAPTER 3 – ECHO IMAGE SEGMENTATION
64
noise in smooth regions of a signal, but adversely affect edges. Especially, in echo images it is
important to preserve the edges for tracing the cardiac boundary accurately. Edges are of
critical importance to the visual appearance of images. For small to moderate levels of
Gaussian noise, the median filter is demonstrably better than Gaussian blur in removing noise
whilst preserving edges for a given, fixed window size. However, its performance is not better
than Gaussian blur for high levels of noise, whereas, for speckle noise, it is particularly
effective. As the focus of the research work is to develop an efficient segmentation algorithm,
the existing median filter class of AForge.NET is used as shown in Figure 3.1.
// AForge.NET Median filter class
Median filter = new Median();
// apply the filter
filter.ApplyInPlace(queryImage);
Here queryImage is the input echo image to which the median filter is applied. To preserve
the sharp edges, the window size of 1px is selected for the filter which is the default value of
the constructor. Larger the window size more will be the blurring effect and it erodes the edges
which lead to a number of discontinuities in the LV and other chambers.
3.3 CONVENTIONAL K-MEANS CLUSTERING ALGORITHM
The goal of this section is to describe the traditional K-Means algorithm (also called as
Lloyd’s algorithm in the Computer Science community) and various issues related to the same.
The goal is to design a simple, elegant yet robust algorithm that segments a cardiac image for
extracting its main features [Francis, 2008]. For this purpose, K-Means clustering algorithm is
selected which partitions a data set into several groups such that the intra cluster points are
similar and the inter-cluster points are dissimilar. Segmentation is a fundamental process for
higher level medical image analysis and K-Means is suitable for biomedical image
segmentation since the number of clusters (k) is usually known for images of particular regions
of human anatomy [Bommanna, 2010]. Watershed segmentation is another popular method, but
it suffers the drawbacks as mentioned in [Ashish, 2005]. Though K-Means has been shown to
be effective in producing good clustering results, its main drawback is the poor time
complexity: O(nkd), where n is the number of data points, k is the number of clusters, and d is
the number of dimensions [Fahim, 2006] [Chang, 1998] [Khaled]. Integrating K-Means
algorithm and SQL has many advantages as follows:
CHAPTER 3 – ECHO IMAGE SEGMENTATION
65
1. The image data can easily be stored in relational DBMS and we can perform all
computations faster in SQL [Pitchaimalai, 2008]
2. Since the resolution of the image is generally large, handling such huge data sets is
much easier with the help of DBMS [Carlos, 2006a]
3. No need to transfer data from DBMS address space to application address space and
vice-versa, because all the patient data reside in DBMS
3.3.1 CLASSICAL PARTITIONING METHOD: K-MEANS
The most well-known and commonly used partitioning method for clustering the data is
K-Means and its variations such as K-Mediods, C-Means, etc. The K-Means algorithm takes
the input parameter, k, and partitions a set of n objects into k clusters so that the resulting
intracluster similarity is high but the intercluster similarity is low. Cluster similarity is
measured with the mean value of the objects in a cluster, which can be viewed as the cluster’s
centroid or center of gravity.
The working of K-Means algorithm can be explained as follows: First, it randomly selects
k-of the objects, each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the distance between the object and the cluster mean.
Algorithm K_MeansCluster (k)
// k - number of Clusters
// D - data set containing n objects
// Cj - clusters containing subset of n objects, where j {1, 2, .., k}
Randomly choose k objects from D as the initial clusters
Repeat
foreach object il (where, l {1, 2, .., n}) do
(re)assign il to Cj using Euclidean distance calculation (similar objects)
foreach Cluster Cj do
Update the cluster means (or centroids) considering all objects currently in
the cluster
Until no more reassigning
end K_MeansCluster.
Fig. 3.2 The Classical K-Means Algorithm
It then computes the new mean for each cluster. This process iterates until the criterion
function converges. The K-Means procedure is summarized in Figure 3.2. Typically, the
square-error criterion is used to terminate the algorithm and is defined in equation 3.1.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
66
2
1
||
k
ii
Cp
impE (3.1)
where, E is the sum of the square error for all objects in the data set; p is the given object;
mi is the mean of cluster Ci. In other words, for each object in each cluster, the distance from
the object to its cluster center is squared, and the distances are added.
Working of K-Means Algorithm
Suppose there are a set of objects, marked as black small triangles located in space as
shown in Figure 3.3(a) and let k = 3. According to the algorithm shown in Figure 3.2, three
objects are selected randomly as marked red, black, and blue circles in Figure 3.3(b). Each
object is assigned to a cluster based on the cluster center to which it is the nearest.
Fig. 3.3 Clustering of n-objects based on k-Means method, assuming k = 3.
The nearest distance between each centroid (Cj) and the objects (Di) within each cluster for
all dimensions, d can be computed using Euclidean distance [Carlos, 2006a] [Shehroz, 2004] as
given in equation 3.2.
2
1
)(DistanceEuclidean lj
d
l
li CD
(3.2)
Next, the cluster centers are updated. This means, the mean value of each cluster is
recalculated based on the current objects in the cluster. Using the new cluster centers, the
CHAPTER 3 – ECHO IMAGE SEGMENTATION
67
objects are reassigned to the clusters based on which cluster center is the nearest. This
reassignment is shown as dotted circle in Figure 3.3(c).
This process is iterated until no objects move from one cluster to the other. This situation is
shown in Figure 3.3(e) and at this point the algorithm terminates. The final clusters with the
objects are returned as the result. The algorithm attempts to determine k partitions that
minimizes the square-error function. The method is scalable and efficient in processing large
data sets, because the theoretical complexity is linear. However, practically it takes more time
and the method often terminates at local optimum.
Variations of K-Means
A number of variations to the K-Means algorithm have been developed in an effort to
improve its computational efficiency or extend its expressiveness in categorical and mixed
data. The ISODATA algorithm uses the technique of merging and splitting clusters in order to
obtain the optimal partition, starting from any arbitrary initial partition, utilizing appropriate
threshold values for performing this process [Georgios, 2007].
The dynamic clustering algorithm permitted other representations than the center of a
cluster utilizing maximum-likelihood estimation, selecting a different criterion function. Other
research efforts improved computational complexity by reducing the number of (dis)similarity
calculations. But two very important steps in the evolution of the K-Means algorithm family
involve its extension to categorical, mixed numeric and categorical values through the
development of the K-modes and K-prototypes algorithms. K-modes uses a simple matching
dissimilarity measure to deal with categorical objects while the K-prototypes defines a
combined dissimilarity measure, integrating the K-modes and K-Means algorithms to allow for
clustering of mixed numeric and categorical attributes.
Issues of K-Means Method
There are several issues to be addressed in implementing K-means for any practical
applications such as image segmentation. These are summarized below:
1. Selection of number of clusters. With regard to echo image clustering, it is normally set
to 3 as there is no need to identify more regions.
2. Initial values or seed values for the cluster centroids. Most of the researchers follow
random selection of initial cluster centers.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
68
3. If one or more clusters become empty, this has to be taken care in the design of the
model and SQL statements.
4. Use of appropriate distance metrics. There are number of methods that are specified in
the literature for the distance calculations: Manhattan, Minkowski, Cosine-similarity,
etc. However, Euclidean distance calculation is proved to be better compared to other
techniques.
5. Condition to terminate the algorithm. K-Means, as explained earlier, takes more time,
because it depends upon the distribution of data. Fixing the number of iterations in
advance is a better and faster approach.
One of the major issues of K-Means algorithm is its poor time complexity for large data
sizes [Fahim, 2006]. Due to this drawback, K-Means may not be suitable for practical
applications such as images with high resolution. To address this, instead of implementing this
algorithm in application address space, a model can be devised where all the K-Means steps are
executed within the DBMS address space. This procedure is expected to speed up the process
with appropriate SQL and/or PL/SQL statements.
3.4 SQL BASED K-MEANS ALGORITHM: USING UPDATE
STATEMENT
This section describes the modified K-Means algorithm for the image segmentation process
by using a well designed schema (a set of DBMS tables) and SQL statements. The main
contribution lies in speeding up the Euclidean distance and update operations of cluster
assignment. For K-means, the most intensive step is distance computation, which has time
complexity O(dkn). This step requires both significant CPU use and I/O [Carlos, 2006a].
3.4.1 DEFINITIONS
The input for K-Means is a data set D containing n points with d dimensions, D = {i1, i2, i3,
.., in}. Here, choose the value of k = 3, because echo images are segmented into three major
regions in the cardiac chamber - blood pool (black region), near endocardium (white region),
and the rest (gray region). As explained earlier in Chapter 2, the data set is the pixel values of
the given image f(x, y) of size M × N, where f(x, y) is the gray scale value of a pixel at location
(x, y). All 2D echo images are simple RGB images with same intensity values. Hence, the pixel
data set can be considered as 1-D rather than 3-D; thus making the algorithm simpler.
Table 3.1 Matrices / tables
CHAPTER 3 – ECHO IMAGE SEGMENTATION
69
Matrix Size Description
Data n × d Pixel Data
Centroid k × d Cluser Mean
Eucl n × k Euclidean Distance
CV n × 1 Cluster Assignment
CD n × d Cluster Data
A total of five tables are needed for this algorithm and their structures are shown in Table
3.1 which store the image pixel data.
Each tuple in Data represents a pixel with its spatial co-ordinates and the gray scale
intensity [0-255] value. Since k = 3, the Centroid table always contains 3 rows with the pixels
being selected as centroids in each iteration. Next, in order to store the Euclidean distances, the
table Eucl is used and each entry in this table gives the distance of ith pixel to the respective
centroids in k clusters. The table CV (Cluster Vector) is a temporary table that stores the pixels
and their assigned cluster number (j). Finally, CD (Cluster Data) is a join of CV and Data tables
giving the pixel (its x, y, and the gray scale value) details along with the cluster number
assigned after the specified number of iterations. This is the desired output and this table can be
used for processing. The following subscripts are defined as:
i : 1..n : number of data points (pixels)
j : 1..k : number of clusters
l : 1..d : number of dimensions
3.4.2 PROPOSED METHOD – K-MEANS SQL ALGORITHM
This subsection describes the proposed algorithm for segmentation process by using an
efficient implementation of K-Means algorithm using SQL.
Data Centroid Eucl
CV CD
Fig. 3.4 Schema of all five tables for K-Means Algorithm. Shaded cells are Primary Keys.
It follows almost the same approach as that of the algorithm shown in Figure 3.2, except
that all computations are carried out with the five relational DBMS tables shown above and a
i x y val
j x y val
i d1 d2 d3
i j
i j x y val
Table 3.1 Matrices / tables
CHAPTER 3 – ECHO IMAGE SEGMENTATION
70
series of UPDATE statements written in SQL. Figure 3.4 shows the relational schema with the
attributes.
The Data table consists of 4 attributes. The first attribute, i is the identification number (id)
of each pixel, the second and third attributes are <x, y> signify the spatial co-ordinate of the ith
pixel in the echo image, and the fourth attribute is its intensity value. Here, i is declared as the
primary key constraint. Next, to store the mean value of each cluster Centroid table is used. For
this table j is selected as primary key. It is also a foreign key which references the primary key i
in Data table. The seed values for the three clusters are computed using random number
generation picked from the Data table and stored in this table. Subsequently, the new mean of
each cluster will be calculated and updated. During these iterations <x, y> values have no
meaning. Since k = 3, the Centroid table will always have 3 tuples. For larger values of k (k >
3), those many rows are inserted in this table and no change in schema is required.
The distance between each data point in Data with each of cluster mean, i.e. each row in
Centroid, is computed using Euclidean distance and the computed value is stored in Eucl table.
The attributes d1, d2, and d3 represent the distance of ith pixel to each cluster center. To assign
a cluster number for each pixel, the minimum of <d1, d2, d3> is computed and stored in the
table CV. That is, the attribute i is the pixel id and j is the assigned cluster number during a
particular iteration. To produce the result a natural join of CV with Data is carried out to obtain
the final output table CD, showing the pixel id, cluster number, x-y coordinates, and pixel
intensity value.
Except Data table, all the other tables are updated during every iteration. It is assumed that
all these tables are properly indexed. The algorithm is terminated after a finite number of
iterations. Better results for clustering were obtained with 4 to 6 iterations.
Algorithm K_Means_SQL(Q)
// Input: Grayscale image, I; Q - # of iterations
// Output: Segmented Image, S
DeleteTableData() // Delete all rows in Data
Data ← I
Initialize Centroid table with random values selected from Data table
Initialize Eucl, CV, CD tables
for m ← 1 to Q do
a) Compute the mean for each cluster grouped by j in CD table and update
Centroid table
CHAPTER 3 – ECHO IMAGE SEGMENTATION
71
b) Compute the Euclidean distance for each pixel in Data table with each
cluster mean in Centroid table and update Eucl table
c) Compute the minimum for each row in Eucl and assign cluster number.
Update CV table.
d) Update CD table by joining CV and Data tables for the next iteration
Foreach cluster data do
a) C1 ← 0; C2 ← 150; C3 ← 255
b) Update CD
Create the image, S out of CD table data
Return S
end K_Means_SQL.
Fig. 3.5 Algorithm for SQL based K-Means
The proposed algorithm is shown in Figure 3.5. The main strength of achieving the running
time efficiency lies in the for loop. These steps are executed using only UPDATE statements
and not the two step costly process of DLETE and INSERT.
3.4.3 INITIALIZATION STEPS
For proper working of the SQL based K-Means the database tables must be initialized with
proper data as explained below.
(1) DeleteTableData(): Before populating the database tables the old data must be deleted, if
any, to avoid integrity constraints. This is done using following SQL statements:
DELETE FROM CD;
DELETE FROM CV;
DELETE FROM Eucl;
DELETE FROM Centroid;
DELETE FROM Data;
(2) Load image data to Data table: Further, loading the image data into Data table
accomplished using:
INSERT INTO Data (i, x, y, val)
VALUES (:i, :x, :y, :val);
where :i, :x, :y, :val are the pixel id, pixel coordinates, and the intensity value arrays. This
method of parameterized insertion is more efficient. Instead of inserting one row at a time, that
CHAPTER 3 – ECHO IMAGE SEGMENTATION
72
consumes time, four single dimension arrays are used as id, x, y, and val. This ensures that each
array being populated at single instance.
(3) Initialize Centroid: In order to initialize the Centroid table, 3 pixels are selected from Data
using random number generation.
INSERT INTO Centroid (
SELECT 1, x, y, val FROM Data WHERE i = "j1");
INSERT INTO Centroid (
SELECT 2, x, y, val FROM Data WHERE i = "j2");
INSERT INTO Centroid (
SELECT 3, x, y, val FROM Data WHERE i = "j3");
where, j1, j2, and j3 are the three random numbers generated using the host language
(C#.NET) restricted to 0-255.
(4) Initializing other Tables: The other tables are initialized in a similar way and are shown
below:
INSERT INTO Eucl (
SELECT i, sqrt(power((Data.val - c1.val), 2)) as d1,
sqrt(power((Data.val - c2.val), 2)) as d2,
sqrt(power((Data.val - c3.val), 2)) as d3
FROM Data,
(SELECT * FROM Centroid WHERE j = 1) c1,
(SELECT * FROM Centroid WHERE j = 2) c2,
(SELECT * FROM Centroid WHERE j = 3) c3)
ORDER BY i;
INSERT INTO CV (
SELECT i,
Case when d1 <= d2 and d1 <= d3 then 1
when d2 <= d3 and d2 <= d1 then 2
when d3 <= d2 and d3 <= d1 then 3
end as j
FROM Eucl);
CHAPTER 3 – ECHO IMAGE SEGMENTATION
73
INSERT INTO CD (
SELECT Data.i, j, Data.x, Data.y, Data.val
FROM Data, CV
WHERE Data.i = CV.i);
Here the Euclidean distance is calculated with a single SQL statement for all pixels with
respect to the centroids without a single join. The time taken for joining operation of c1, c2, c3
with Data table is negligible, because the number of rows in Centroid table is 3. Similarly, CV
table is populated with minimum distance at one time instant with a Case statement. Using the
cluster assignment details from CV table, the table CD gets all the details of the data points with
an equi-join operation.
3.4.4 UPDATING DBMS TABLES
The four sub-steps (a) to (d) under the for loop of the algorithm can be designed using an
UPDATE statement. Another alternate method is to use the combination DELETE-INSERT
statement. But, running time required to execute DELETE statement is more in any database,
because it stores all the deleted records in the redo/undo log files for a possible rollback
operation. This is more often a row-by-row operation rather than a bulk operation.
The fundamental requirement is that except the Data table, all the other tables must be
updated. Depending upon the data, certain clusters may be empty. For instance, assume that in
a given image all the pixels are of same intensity leading to just one single cluster. Then the
other two clusters would be empty. This normally issues error message during the SQL join
operations. Hence, the queries are designed to take care of this situation using left-outer join.
The sub-steps of for loop in Figure 3.5 can be accomplished using the following SQL
statements:
(a) Update of Centroid Table
During each iteration, the table data of CD is nothing but the pixels with assigned cluster id
(1, or, 2, or 3). As per the K-Means algorithm the new mean must be computed and stored in
Centroid table. This is obtained by executing the following query:
SELECT j, Avg(val) as val
FROM CD
GROUP BY j;
CHAPTER 3 – ECHO IMAGE SEGMENTATION
74
This query produces 3 rows, one for each cluster. The updation must be carried out in
Centroid without deleting the old data. This is achieved by using a correlated subquery
technique as shown below:
UPDATE Centroid c3
SET (j, val) =
( SELECT c1.j, c2.val
FROM Centroid c1,
(SELECT j, Avg(val) as val
FROM CD
GROUP BY j) c2
WHERE c1.j = c2.j(+) AND c1.j = c3.j );
This concept is adopted for all table updates.
(b) Update of Eucl Table
The purpose of Eucl table is to find the Euclidean distance from each pixel in Data table to
each of the cluster centroid, i.e. each Centroid table row. This operation must be carried out
with a single SQL statement so that the design can be made more general. In other words, the
same SQL statement will be applicable for any value k; thus achieving scalability.
UPDATE Eucl e1
SET (i, d1, d2, d3) =
( SELECT i, sqrt(power((e2.val - c1.val), 2)) as d1,
sqrt(power((e2.val - c2.val), 2)) as d2,
sqrt(power((e2.val - c3.val), 2)) as d3
FROM Data e2,
( SELECT j, x, y, NVL(val, 0) as val
FROM Centroid
WHERE j = 1) c1,
( SELECT j, x, y, NVL(val, 0) as val
FROM Centroid
WHERE j = 2) c2,
( SELECT j, x, y, NVL(val, 0) as val
FROM Centroid
WHERE j = 3) c3
WHERE e1.i = e2.i);
The subquery generates four tables: Data, c1, c2, and c3 corresponding to pixel data and the
three clusters. The SELECT clause computes the distance and updates the three distances d1,
CHAPTER 3 – ECHO IMAGE SEGMENTATION
75
d2, and d3. Note that NVL function is used to take care of the possible NULL values in any of
the clusters.
(c) Update CV Table
Cluster assignment is another important step in which each pixel is assigned an appropriate
cluster id based upon the min value out of d1, d2, and d3 from the table CV. This is done using
the following query:
UPDATE CV v1
SET (i, j) =
( SELECT i,
Case when d1 <= d2 and d1 <= d3 then 1
when d2 <= d3 and d2 <= d1 then 2
when d3 <= d2 and d3 <= d1 then 3
End as j
FROM Eucl v2
WHERE v1.i = v2.i);
It is interesting to note that this query uses a Case statement and no join operation, except
for correlated update operation.
(d) Update CD Table
For the next iteration to work properly, the actual input data pixels and its assigned cluster
id must be ready. This means that the Data table and CV table must be joined to obtain the
same. Below query shows this operation:
UPDATE CD c1
SET (i, j, x, y, val) =
( SELECT d2.i, j, d2.x, d2.y, d2.val
FROM Data d2, CV d3
WHERE d2.i = d3.i AND c1.i = d2.i);
Instead of DELETE and then INSERT, an UPDATE statement is used for this purpose.
After the desired number of iterations, i.e. Q the final cluster assignment can be found in table
CD.
(e) Final Output Table - CD
In order to display the segmented image, 0 (black) is assigned to all pixels in cluster 1, 150
(gray) is assigned to all pixels in cluster 2, and 255 (white) is assigned to all pixels in cluster 3.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
76
UPDATE CD c1
SET (j, val) =
SELECT j, DECODE (j, 1, 0,
2, 150,
3, 255) val
FROM CD c2
WHERE c1.i = c2.i);
Fig. 3.6 Query for segmented image pixels
Since it is faster to do this task with a single SQL statement, the same is achieved by using
a DECODE statement as shown in Figure 3.6. Now the table CD contains the pixels with
appropriate intensity values (0/150/255) corresponding to the 3 clusters and using this data we
can construct an image. The computational efficiency, quality of clustering, and other details
related to this design are discussed later in this chapter.
3.4.5 ISSUES RELATED TO SQL K-MEANS ALGORITHM
Though the SQL based K-Means algorithm is faster than the other versions, it suffers from
the following shortcomings:
1. The major drawback of the current design is that UPDATE operation is slower, because
it is a row-by-row operation.
2. To achieve UPDATE query design, the correlated version requires an extra join
operation. This obviously slows down the execution.
3. There are two tables CV and CD used in the design. The table can be reduced to a single
table instead of two.
These weaknesses can be addressed by modifying the schema design and this modified
design is discussed in the next section.
3.5 FAST SQL BASED K-MEANS ALGORITHM USING
TRUNCATE-INSERT STATEMENTS
As explained in section 3.4.5 UPDATE is a row based statement leading to inefficient
execution. To overcome this inefficiency DELETE-INSERT combination can be used to update
the rows in each of the tables.
However, DELETE is time consuming as stated earlier and therefore a faster version which
can be used in place of DELETE is TRUNCATE. It is faster for the reason that this statement
CHAPTER 3 – ECHO IMAGE SEGMENTATION
77
does not store the records in the redo/undo log buffer and it is a page-wise operation. Also,
merging CV and CD tables into one table called as CVCD will further speed up the execution.
Data Centroid Eucl
CVCD SI
Fig. 3.7 Schema for Fast K-Means Algorithm. Shaded cells are Primary Keys
The modified schema is shown in Figure 3.7 where the other table structures remain same
as the previous design except that CV and CD are merged and called as CVCD. The segmented
image pixel data will be available in a new table called SI. In this design we reduce the update
operations/steps from 4 to 3 as shown in Figure 3.8.
Algorithm FKM_SQL(I, Q)
// Input: Echo image, I and # of iterations: Q
// Output: Segmented Image, S
1. DeleteTableData() // TRUNCATE table
2. Load Image pixel values to Data table.
3. Initialize Centroid table with random values selected from Data table.
4. Initialize Eucl and CVCD tables.
5. for i ← 1 to Q do
a) INSERT INTO Centroid ← Average(CVCDGroup By j) // average of each
cluster, j = 1..k
b) INSERT INTO Eucl ← Euclidean distance (Datai * Centroidj), for i = 1..n
and j = 1..k
c) INSERT INTO CVCD ← Min(Eucli(d1, d2, d3)) // cluster assignment
6. For each cluster data in CVCD, assign a special value [0, 150, 255] and insert
into SI.
7. Create the image, S out of SI table data.
8. Return the segmented image, S.
9. end FKM-SQL.
Fig. 3.8 Algorithm for Fast SQL based K-Means
i x y val
j x y val
i d1 d2 d3
i j val
i j val
CHAPTER 3 – ECHO IMAGE SEGMENTATION
78
Before inserting any new image data pixels into Data table, all the rows in this table must
be deleted. For this task, the following statements are used:
TRUNCATE TABLE CVCD;
TRUNCATE TABLE EUCL;
TRUNCATE TABLE CENTROID;
TRUNCATE TABLE DATA;
The for loop iterates Q times and executes in only 3 steps. In sub-step (a), deletes the
Centroid table data (using TRUNCATE) and inserts the newly computed average or mean of
each cluster. Next, step (b) computes the distance between each pixel in Data table with the
cluster centroid and insert them into Eucl table. Finally, step (c) computes the minimum out of
d1, d2, d3 and assigns the cluster id to each pixel and inserts them into CVCD.
The experimental results show that approximately 4 to 6 iterations is sufficient to attain
good segmentation. To show the segmented image, the pixel data in each cluster is assigned a
distinct color: black (0), gray (155), white (255). Line 6 and 7 perform these operations and the
final segmented image data is stored in table SI.
SQL Implementation
The SQL code for update operations of lines 5(a), 5(b), and 5(c) are as shown here.
Inserting into Centroid Table
As the number of tuples in Centroid is always 3, there is no need to apply TRUNCATE and
then INSERT statements. A direct approach is to use the UPDATE statement itself as shown
below:
UPDATE Centroid c3
SET ( j, val ) =
( SELECT c1.j, c2.val FROM Centroid c1,
( SELECT j, Avg(val) as val
FROM CVCD
GROUP BY j ) c2
WHERE c1.j = c2.j(+) AND c1.j = c3.j );
Inserting into Eucl Table
Before inserting new values into Eucl table, its contents are removed by executing
TRUNCATE statement.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
79
INSERT INTO Eucl (i, d1, d2, d3)
( SELECT i, sqrt(power((e2.val - c1.val), 2)) as d1,
sqrt(power((e2.val - c2.val), 2)) as d2,
sqrt(power((e2.val - c3.val), 2)) as d3
FROM Data e2,
(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 1) c1,
(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 2) c2,
(SELECT j, x, y, NVL(val, 0) as val FROM Centroid WHERE j = 3) c3 );
Inserting into CVCD Table
The previous SQL statement populates each row of Eucl table with the three distances
which represent the cluster distances. Now the ith
pixel in this table belongs to one of the
clusters (i.e. 1 or 2 or 3) depending upon which d is the minimum.
INSERT INTO CVCD (i, j, val)
( SELECT v1.i,
Case when d1 <= d2 and d1 <= d3 then 1
when d2 <= d3 and d2 <= d1 then 2
when d3 <= d2 and d3 <= d1 then 3
end as j, v1.val
FROM Eucl v2, Data v1
WHERE v2.i = v1.i);
A Case statement is used in order to find the least value out of d1, d2, and d3 and assign the
corresponding value as cluster id. Hence, after the execution of this query the CVCD table
would contain pixel data id, cluster number, and the actual grayscale value corresponding to i,
j, and val respectively.
These three operations are executed Q times and the algorithm converges quickly as
explained earlier. The next step is to assign distinct grayscale values to pixels in each cluster.
The following query does this:
INSERT INTO SI (i, j, val)
( SELECT i, j, DECODE
(j, 1, 0,
2, 150,
3, 255) val
FROM CVCD
);
CHAPTER 3 – ECHO IMAGE SEGMENTATION
80
It is a simple query that assigns 0 (black) to all pixels in cluster 1, 150 (gray) to all pixels in
cluster 2, and 255 (white) to all pixels in cluster 3 and stores these pixel data in a new table SI.
3.5.1 EXAMPLE
To illustrate the working of the algorithm as shown in Figure 3.8, assume just 7 rows in the
Data table. Figure 3.9 shows the initial data in the following tables: Data, Centroid, Eucl, and
CVCD.
Fig. 3.9 Table data after initialization process
Now when the SQL statements in the for loop are executed, i.e. the update of Centroid and
two insert statements of Eucl and CVCD tables, the table data get modified. The output of these
tables at the end of first iteration is shown in Figure 3.10.
Centroid
j x y val
1 0 0 1.5
2 0 2 26.33
3 2 1 8.5
Data
i x y val
1 0 0 1
2 0 1 2
3 0 2 14
4 1 0 8
5 1 1 15
6 2 1 9
7 2 2 50
Centroid
j x y val
1 0 0 1
2 0 2 14
3 2 1 9
Eucl
i d1 d2 d3
1 0 13 8
2 1 12 7
3 13 0 5
4 7 6 1
5 14 1 6
6 8 5 0
7 49 36 41
CVCD
i j val
1 1 1
2 1 2
3 2 14
4 3 8
5 2 15
6 3 9
7 2 50
CHAPTER 3 – ECHO IMAGE SEGMENTATION
81
Eucl
i d1 d2 d3
1 .5 25.33 7.5
2 .5 24.33 6.5
3 12.5 12.33 5.5
4 6.5 18.33 .5
5 13.5 11.33 6.5
6 7.5 17.33 .5
7 48.5 23.66 41.5
CVCD
i j val
1 1 1
2 1 2
3 3 14
4 3 8
5 3 15
6 3 9
7 2 50
Fig. 3.10 Contents of tables at the end of first iteration
Comparing the CVCD after first iteration with the initial values, it is clear that no change
occurs in cluster 1. But elements 14 and 15 have moved from cluster 2 to cluster 3. Hence, the
cluster data after the first iteration is C1 = {1, 2}, C2 = {50}, C3 = {14, 8, 15, 9}. We can easily
observe that any further iteration will not shift the elements. The termination of the algorithm is
purely based on Q, which is set by visually verifying the quality of clustering output.
(a) (b)
Fig. 3.11 Segmentation by applying Fast SQL K-Means Algorithm (a) Original image
(b) Segmented image with Q = 6
CHAPTER 3 – ECHO IMAGE SEGMENTATION
82
To show an example for the segmentation process by applying the Fast SQL based K-
Means algorithm, Figure 3.11 shows the original echo image in apical 4 chamber view of a
normal patient and its segmented images in (b).
3.5.2 THEORETICAL TIME COMPLEXITY
In a clustering problem, given n points in Rd and an integer k which denotes the number of
clusters, a partition S = (S1, S2,…,Sk) of the given points into disjoint non-empty sets is
constructed and corresponding centers C = (c1, c2, …,ck) is chosen such that
is minimized . So given the clusters, the centers would have to be the
centroids. Similarly, given the centers, the clusters would be formed by assigning each point in
the set to its nearest center. However, if none of them is given, the problem is NP - complete
[Leonard, 2009] even for n = 2. This is denoted as
.
The most common clustering algorithm used is the Lloyd's algorithm. The algorithm starts
with some k clusters, then computes centroids of those clusters and does a reassignment of
points based on the following. For each point, the nearest centroid is designated and the point is
said to belong to the corresponding cluster. This is repeated until the point assignment
stabilizes; that is, there is no more rearrangement of points. It can be observed that the
algorithm gives rise to a local optimum. For this process a theoretical upper bound has to be
determined.
There is a trivial upper bound of O(kn) iterations since no partition of points into clusters is
ever repeated during the course of the algorithm. In d-dimensional space, this bound was
slightly improved by Inaba et al. to O(nkd
) by counting the number of distinct Voronoi
partitions on n points [Leonard, 2009].
In some cases the upper bound of traditional K-Means algorithm determined to be
(m k d n), where n is the number of data objects, d is the number of dimensions, k is the
number of clusters, and m is the number of iterations. In this algorithm more time is spent in
computing the vector distances. Even a linear algorithm can be quite slow if one of the
arguments of (...) is large, and it is observed that d is usually large. In echo image
segmentation case the value of d is just 1 for grayscale images and 3 for color images.
In the SQL implementation, the time complexity of the algorithm shown in Figure 3.8 can
be written as follows. The active operations that contribute for the complexity analysis are:
CHAPTER 3 – ECHO IMAGE SEGMENTATION
83
for i ← 1 to Q do
(a) Centroid CVCD
(b) Data Centroidj=1 Centroidj=2 Centroidj=3
(c) Data EUCL
end for
The complexity of these three steps can be calculated as:
Step a: Since CVCD table is indexed, its time-complexity is O (log n), assuming the
size of Centroid is small (in this case, it is 3)
Step b: In the worst-case, all pixels may be in a single cluster and Data table is
indexed. Hence, the upper bound is O(log n)
Step c: Both Data and Eucl tables are of size n. Assuming Data table is indexed, the
upper bound for this step will be O(log n * n)
Since, Q is constant and does not depend on n, it can be represented as,
T(n) = O(log n) + O(log n) + O(log n * n)
= O(log n * n) = O(n log n)
This result suggests that the proposed algorithm with proper indexing of the tables should
give better performance than the traditional algorithm.
3.6 QUICK K-MEANS: AN IMPROVED VERSION
The Fast SQL K-Means was proposed as alternate for the conventional K-Means algorithm.
However, the shortcoming of this algorithm is the use of two tables Eucl and CVCD which
need to be truncated and populated in each iteration which is still a time consuming process.
Instead this can be denormalized to a single table to accomplish the entire task. It is observed
that this has resulted in an improvement of 40%-50% compared to Fast K-Means algorithm.
3.6.1 SQL QUERY
The design is based on the use of “With” statement of SQL which helps in writing a single
query that creates virtual tables and carry out the desired operations in one time. The query is,
Insert into Eucl2(i, d1, d2, d3, val, j)
Select * From (
With dist as (Select i, sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM
Centroid2 WHERE j = 1)), 2)) as d1,
sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM
Centroid2 WHERE j = 2)), 2)) as d2,
CHAPTER 3 – ECHO IMAGE SEGMENTATION
84
sqrt(power((e2.val - (SELECT NVL(val, 0) as val FROM
Centroid2 WHERE j = 3)), 2)) as d3, e2.val
From Data2 e2)
Select i, dist.d1, dist.d2, dist.d3, dist.val,
Case when dist.d1 <= dist.d2 and dist.d1 <= dist.d3 then 1
when dist.d2 <= dist.d3 and dist.d2 <= dist.d1 then 2
when dist.d3 <= dist.d2 and dist.d3 <= dist.d1 then 3
End as j
From dist );
The schema of Eucl2 is modified for adding two more attributes namely val and j. Here, j is
the cluster number and val is the grayscale value which are part of the CVCD table in the earlier
design. So the virtual table created with the help of With statement is named as “dist”. The
second Select statement in the Insert clause does the same job as that of Fast K-Means with dist
as the table name. So the advantage is that no separate table is required, but all these can be
accomplished in a single SQL query. This avoids deleting huge number of tuples in these
tables, thus speeding up the process.
3.7 K-MEANS ALGORITHM USING PL/SQL STORED PROCEDURE
A stored procedure is a named set of PL/SQL statements designed to perform an action.
Stored procedures are stored inside the database. They define a programming interface for the
database rather than allowing the client application to interact with database objects directly. As
objects such as stored procedures and triggers become popular, more application code will
move away from external programs and into the database engine. In a database, applications
can be stored and deployed with all of the process logic stored in either Java or PL/SQL blocks.
The aim of this section is to describe method to accomplish the K-Means algorithm using
stored procedures.
3.7.1 ADVANTAGES OF PL/SQL STORED PROCEDURES
There are many compelling benefits to putting all Oracle SQL inside stored
procedures. These include:
Better performance - Stored procedures are loaded once into the SGA and remain there
unless they become paged out. Subsequent executions of the stored procedure are far
faster than external code. Also, stored procedures are executed on the database server
CHAPTER 3 – ECHO IMAGE SEGMENTATION
85
which is likely to be more powerful than the clients which in turn means that stored
procedures should run faster.
Coupling of data with behavior - Relational tables can be coupled with the behaviors
that are associated with them by using naming conventions.
Isolation of code - Since all SQL is moved out of the external programs and into stored
procedures, the application programs become nothing more than calls to stored
procedures. As such, it becomes very simple to swap-out one database and swap-in
another.
The code is stored in a compiled-form which means that it is syntactically valid and
does not need to be compiled at run-time, thereby saving resources.
Less network traffic and hence improves scalability.
The use of packages to combine related objects (procedures, functions and variables) into
one physical unit enhances these advantages.
3.7.2 THE DESIGN
The overall design of the stored procedure approach has two fundamental steps. First, to
execute a a set of procedures to initialize various database tables. Second, a procedure that
executes to update these tables.
CREATE OR REPLACE PROCEDURE INIT_CENTROID_3 (N IN PLS_INTEGER)
IS
BEGIN
DECLARE
Rand1 PLS_INTEGER;
Rand2 PLS_INTEGER;
Rand3 PLS_INTEGER;
BEGIN
Select floor(dbms_random.value(1,N)) Rand1 From Dual;
Select floor(dbms_random.value(1,N)) Rand2 From Dual;
Select floor(dbms_random.value(1,N)) Rand3 From Dual;
Insert into Centroid ( Select 1, X, Y, val From Data Where Data.i = Rand1 );
Insert into Centroid ( Select 2, X, Y, val From Data Where Data.i = Rand2 );
Insert into Centroid ( Select 3, X, Y, val From Data Where Data.i = Rand3 );
Commit;
END;
/
CHAPTER 3 – ECHO IMAGE SEGMENTATION
86
Another important addition to this new design is the introduction of /*+APPEND*/
comment to speed up the insert DML operation of Eucl and CVCD tables. The code shown in
the previous page shows the PL/SQL stored procedure for initializing Centroid table and the
below two procedures are for initializing Eucl, and CVCD tables as explained under Fast K-
Means algorithm earlier in this chapter.
The INIT_CENTROID_3 procedure generates 3 random numbers (since k = 3) from 1 to N,
where N is total number of pixels sent as a parameter. Similar to Fast K-Means, the other two
tables are also initialized as shown below:
CREATE OR REPLACE PROCEDURE INIT_EUCL_3 IS
BEGIN
Insert /*+APPEND*/ into Eucl (
Select i, sqrt(power((data.val-c1.val),2)) as d1,
sqrt(power((data.val-c2.val),2)) as d2,
sqrt(power((data.val-c3.val),2)) as d3
From Data, (Select * From Centroid where j = 1) c1,
(Select * From Centroid where j = 2) c2,
(Select * From Centroid where j = 3) c3)
Order by i;
END;
/
CREATE OR REPLACE PROCEDURE INIT_CVCD_3 IS
BEGIN
Insert /*+APPEND*/ into CVCD (
Select Data.i,
Case when d1 <= d2 and d1 <= d3 then 1
when d2 <= d3 and d2 <= d1 then 2
when d3 <= d2 and d3 <= d1 then 3
End as j, Data.val
From Eucl, Data
Where Eucl.i = Data.i);
END; /
Oracle's "APPEND" hint keyword is a tool to bypass the use of existing half-empty empty
data blocks from the freelist chain. Instead, Oracle 10g extends the table and uses brand-new
dead-empty data blocks for inserts. This results in more rows per I/O, speeding-up insert
CHAPTER 3 – ECHO IMAGE SEGMENTATION
87
performance. The next PL/SQL procedure UPDATE_TABLES_3 updates three tables namely
Centroid, Eucl, and CVCD by using INSERT statements combined with APPEND comment.
The code is given below:
CREATE OR REPLACE PROCEDURE UPDATE_TABLES_3 IS
BEGIN
Execute immediate 'Truncate Table Eucl';
Update Centroid c3
Set (j, val) =
( Select c1.j, c2.val
From Centroid c1, ( Select j, Avg(val) as val
From CVCD Group by j) c2
Where c1.j = c2.j(+) and c1.j = c3.j);
Insert /*+APPEND*/ into Eucl (i, d1, d2, d3)
( Select i, sqrt(power((e2.val-c1.val), 2)) as d1,
sqrt(power((e2.val-c2.val), 2)) as d2,
sqrt(power((e2.val-c3.val), 2)) as d3
From Data e2, (Select j, x, y, NVL(val,0)as val From Centroid where j = 1) c1,
(Select j, x, y, NVL(val,0)as val From Centroid where j = 2) c2,
(Select j, x, y, NVL(val,0)as val From Centroid where j = 3) c3
);
Execute immediate 'Truncate Table CVCD';
Insert /*+APPEND*/ into CVCD (i, j, val)
( Select Data.i,
Case when d1 <= d2 and d1 <= d3 then 1
when d2 <= d3 and d2 <= d1 then 2
when d3 <= d2 and d3 <= d1 then 3
End as j, Data.val
From Eucl, Data
Where Eucl.i = Data.i);
Commit;
END; /
There are other ways to enhance performance of bulk insert. For example, Oracle 11g
Release 2 offers APPEND_VALUES along with FORALL statement which is many times
faster than row-by-row insert. With these procedures compiled and stored in the database
server, the application program (C#.NET) invokes these procedures using the following way:
CHAPTER 3 – ECHO IMAGE SEGMENTATION
88
using (OracleConnection conn = new OracleConnection("Data Source=ORCL; User
ID=scott; Password=tiger"))
{
OracleCommand sqlCommandEnable = new OracleCommand();
sqlCommandEnable.CommandType = CommandType.StoredProcedure;
sqlCommandEnable.Connection = conn;
sqlCommandEnable.CommandText = "INIT_TABLES_3";
conn.Open();
sqlCommandEnable.ExecuteNonQuery();
}
using (OracleConnection conn = new OracleConnection("Data Source=ORCL; User
ID=scott; Password=tiger"))
{
OracleCommand sqlCommandEnable = new OracleCommand();
sqlCommandEnable.CommandType = CommandType.StoredProcedure;
sqlCommandEnable.Connection = conn;
sqlCommandEnable.CommandText = "UPDATE_TABLES_3";
conn.Open();
sqlCommandEnable.ExecuteNonQuery();
}
The only difference between the two blocks in the above code is that the CommandText
parameter must be loaded with the appropriate procedure name.
3.8 SEGMENTATION USING K-MEANS ALGORTIHM AS
EXTERNAL PROCEDURE (KMEP)
Any database query based design must access each record from the specified table for
manipulation. Without proper choice of the layered design, the applications may not run faster.
Designing a database application for optimal performance can seem a daunting challenge.
There are many choices to make - development tools, database design, application structure,
query design, choice of interface - and the "right" choices in each of these areas depend on the
unique application requirements and on the skills set of the designer. Few basic principles and
trade-offs of SQL Server development will help us tremendously to get optimum performance.
This section describes a simple but effective three-layered architecture to achieve a better
running time than the previous designs for running the K-Means clustering algorithm.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
89
The main concept is to implement the K-Means algorithm using C language and create a
.dll file and call the same as an external procedure from the Oracle PL/SQL code. This module
is invoked for each patient’s echo image for segmentation.
3.8.1 EXTERNAL PROCEDURE
An external procedure, also sometimes referred to as an external routine, is a procedure
stored in a dynamic link library (DLL). For instance, a complete C program unit can be
compiled and produce a .dll file so that it can be dynamically called whenever needed from an
Oracle 10g PL/SQL program unit. These procedures participate fully in the current transaction
and can call back to the database to perform SQL operations. The procedures are loaded only
when necessary, so memory can be conserved. Because the decoupling of the call specification
from its implementation body means that the procedures can be enhanced without affecting the
calling programs.
External procedures have the following advantages:
Isolate execution of client applications and processes from the database instance to
ensure that any problem on the client side do not adversely impact the database.
Move computation-bound programs from client to server where they execute faster
(because they avoid the round-trips of network communication)
Interface the database server with external systems and data sources
Extend the functionality of the database server itself
These are important to this research work in terms of separating the CPU bound work of K-
Means algorithm from database but available as an external procedure so that speed and
security can be achieved.
3.8.2 ORACLE EXTERNAL PROCEDURE ARCHITECTURE
As shown in Figure 3.12, the process flow starts with a PL/SQL application that calls a
special PL/SQL "module body." In this example, it is the K-Means program. PL/SQL then
looks for a special Net8 listener process, which already is running in the background. Net8 is
the name for what was formerly the Oracle SQL*Net product. A Net8 listener is a background
process, typically configured by the DBA, which enables other processes to connect to a given
service such as the Oracle server. At this point, the listener will spawn an executable program
CHAPTER 3 – ECHO IMAGE SEGMENTATION
90
called extproc. This process loads the dynamic library and then invokes the desired routine in
the shared library, whereupon it returns its results back to PL/SQL.
Fig. 3.12 External Procedure (.dll) model in Oracle 10g
To limit overhead, only one extproc process needs to run for a given Oracle session; this
process starts with the first external procedure call and terminates when the session exits. For
each distinct external procedure, this extproc process loads the associated shared library, but
only if it hasn't already been loaded.
Calling a dynamically linked routine simply maps the shared code pages into the address
space of the "user" process. Then, when that process touches one of the pages of the shared
library, it will be paged into physical memory, if it isn't already there. The resident pages of the
mapped shared library file will be shared automatically between users of that library. In
practical terms, this means that heavy concurrent use of the external procedure often requires
a lot less computer processing power and memory than, say, the primitive approach that might
take with database pipes.
3.8.3 CREATING EXTERNAL PROCEDURES
In order to create an external procedure for K-Means algorithm, all the necessary functions
are written in C language and stored as .c or .cpp file. Since a callable DLL has to be created,
this program should not have a main() function. Along with this file a module definition file or
.def file containing one or more module statements that describe various attributes of a DLL, is
also created as part of the syntax rules.
CHAPTER 3 – ECHO IMAGE SEGMENTATION
91
This K-Means external DLL file is normally stored in the home path of the database.
External procedures require a listener to be set up as part of the Oracle Net. To call the
K-Means external function, a PL/SQL wrapper is created which maps names, parameter types,
and return types for the C program to the SQL types. For segmenting the echo images, another
PL/SQL procedure is written such that the image data and other parameters are sent to the
external procedure which returns the segmented data back to the database thereby achieving the
tightly-coupled design. A detailed step-by-step procedure for creating external procedures is
given in Appendix A.
3.8.4 DESIGN OF KMEP
The implementation of K-Means with External procedures (KMEP) is a C implementation
of the conventional K-Means algorithm. However, the implementation uses an array as the
main data structure to store the image data and cluster id. Consider the dataset with n objects
and d dimensions (attb1..attbn) being stored as a simple C/C++ array as shown in Figure
3.13(a). These data objects are to be grouped into k-clusters. The (d+1)th
column represents the
cluster id, which is set to 0 for all data objects initially.
Data Array
id attb1 attb2 attb3 attb4 attbi attbd cluster id
1 0 0 23 89 0 125 0
2 0 1 45 12 0 255 0
3 0 2 112 67 9 0 0
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
n 400 250 .. .. .. 45 0
Centroid Array
id attb1 attb2 attb3 attb4 attbi attbd Mean
1 0 0 23 89 0 125 0
2 0 1 45 12 0 255 0
.. .. .. .. .. .. .. ..
k 400 250 .. .. .. 45 0
Fig. 3.13 (a) Structure of the Data Array for a general case (n data points, d dimensions, and k clusters)
(b) Centroid Array
CHAPTER 3 – ECHO IMAGE SEGMENTATION
92
First, the centroid array is initialized by selecting k rows from Data array at random. This
acts as the seed value. Next, the first data object is extracted from the data array and the first
row from the centroid array, and then the Euclidean distance between these two objects is
computed. This process is repeated for all cluster entries in the centroid table. Thus, k distances
are computed which are stored in a temporary single dimensional array, dist[1..k].
Data Array Centroid Array
id X Y val cluster id
1 0 0 125 2
2 0 1 255 1
3 0 2 0 0
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
100000 400 250 45 0
Fig. 3.14 Array showing the pixel data and centroid for an image size of 400×250, k = 3.
dist[1..k] is a temporary array.
The position of the minimum distance is obtained and this gives the cluster id which in turn
is updated in the data array corresponding to the data object. This process is iterated for all data
objects. Figure 3.14 shows this for a typical case of a grayscale echo image for k = 3, d = 1, and
n = 100,000.
The Centroid array is updated during each iteration. The advantage of this method is that all
operations are carried out as in-memory data. As the data array itself is used along with an
extra dimension to store the cluster id, saves memory usage. Since updates are done with the
array index, a considerable speed up can be achieved. Hence, this method must provide an ideal
model for K-Means implementation that achieves both space and time efficiency.
3.9 CONSTRAINT-BASED K-MEANS CLUSTERING ALGORITHM
(CKM)
In the previous section the PL/SQL based external procedure has been designed using C
programming language for the K-Means algorithm to speedup the segmentation/clustering
process. This approach achieves a considerable speed up: an echo image of size 400×250 is
segmented in less than 0.5 seconds. However, this can further be improved. Here a novel
id X Y val mean
1 34 78 125 46
2 0 89 250 250
3 145 56 125 134
1 2 3
23.6 123.8 0 dist[1..k]
CHAPTER 3 – ECHO IMAGE SEGMENTATION
93
technique “Constraint-based k-Means” (CKM) algorithm is proposed. It is experimentally
proved that the performance of this approach can segment the same image approximately half
the time taken by the regular method.
Fig. 3.15 Typical echo image with black pixels shown in yellow contours
The regular K-Means, unfortunately, can not handle the constraint based approach for
clustering [Anthony, 2001]. The heuristics used in the clustering algorithm is based on the
property of the echo image. Consider a typical echo image as shown in Figure 3.15 where at
least 30% of pixels are black. The yellow islands, the pixels inside the LV, RV, etc. in Figure
3.15 are all black pixels that indicate the blood pool. These pixels need not be considered for
clustering. This intuition helps to formulate the constraint which is application specific and
results in approximately 50% improvement in execution speed.
3.9.1 CONSTRAINED CLUSTERING (CC) PROBLEM
Constrained clustering (CC)–finding clusters that specify user specified constraints-is
highly desirable in many applications. This leads to effective and fruitful data mining by
capturing application specific semantics. Formally a constrained clustering can be defined as
follows:
Definition: Given a set D with n data points, a distance function df : D × D → , a positive
integer k, and a set of constraints C, find a k clustering {Cl1, Cl2, ….Clk}.such that DISP =
(
k
i
ii repCldisp1
)),( is minimized, and each cluster satisfies the constraints C, denoted as Cli |=
C.
Black
pixels
CHAPTER 3 – ECHO IMAGE SEGMENTATION
94
Here, the “dispersion” of cluster Cli, disp(Cli, repi), measures the total distance between
each object in Cli and the repi. In the case of k-means the representative object is normally the
centroid of the cluster. Depending on the nature of the constraints and applications the CC
problem can be categorized into constraints on individual object, obstacle object as constraints,
etc.
This constraint limits clustering only a set of pixels and not all. It can be achieved by first
executing a SQL query to select only pixels that satisfy this constraint (retrieve only non-black
pixels for clustering) and retain the black pixels in their original spatial locations. With this
constraint, the reduced data can now be used with unconstrained k-means algorithm. However,
the number of black pixels in an echo image is not constant, but as per the samples collected
90% of the images have 30% pure black pixels.
3.10 RESULTS AND DISCUSSIONS
This section presents some important experimental results of executing the proposed
algorithms. The objective is to justify, through the experiments, that the proposed SQL based
K-Means algorithm is compared with other K-Means methods reported in literature [Carlos,
2006a] [Tapas, 2002] [Velmurugan, 2010]. To establish the practical efficiency, the algorithms
were implemented using C/C++, C#.NET, and Oracle 10g SQL with other software tools. The
input is the 2D echo images of various resolutions.
3.10.1 RESULTS OF SEGMENTATION
Figure 3.16 shows the segmented output of four 2D echo images. It can be noticed that the
cardiac objects are clearly visible and the region has no gradients.
Image Id Input Image Segmented Image
Img1
CHAPTER 3 – ECHO IMAGE SEGMENTATION
95
Image Id Input Image Segmented Image
Img2
Img3
Img4
Fig. 3.16 Two-Dimensional original and segmented echo images in different views (400x250)
3.10.2 COMPARISON WITH RESULTS OF OTHER AUTHORS
A comparison between the SQL, PL/SQL based K-Means algorithm and the OptKM,
IncrKM [Carlos, 2006a], Filtering algorithm [Tapas, 2002], and an Enhanced K-Means
[Velmurugan, 2010] is presented in Table 3.2.
Table 3.2 Running time comparison of Fast SQL K-Means with other variants
Sl.
No.
Running Time (secs)
Proposed
K-Means
Carlos Ordonez
[Carlos, 2006a],
4 CPUs, 40 AMPs
Java based
[Velmurugan
,2010]
Tapas
Kanungo
[Tapas, 2002]
Tian Zhang:
BIRCH
[Tian, 1996]
pp. 112-113
1
FKM:
n = 100K
k = 3
Q = 4
T = 10.1s
KMEP:
T = 1.407s
n = 100K
k = 4
d = 8
Q = 4
OptKM = 17.0 s
IncrKM = 44 s
n = 1000
T = 7.6 s
n = 262144
Dataset:
‘Israel’
k = 8
d = 4
T = 140.4 s
n = 100K
Dataset: DS1
T = 47.1 s
(increasing k
increases T)
CHAPTER 3 – ECHO IMAGE SEGMENTATION
96
According to these results, the proposed modified K-Means algorithms perform better than
other variants. Further, the machine configurations test scenarios of these variants are
comparable to the configurations of the machine used by the proposed algorithm.
3.10.3 CONSOLIDATED PERFORMANCE OF PROPOSED K-MEANS ALGORITHMS
Figure 3.17 shows comparison of all SQL and PL/SQL stored procedure based K-Means
algorithms: FKM (SQL), QKM (SQL), FKM (SP), and PL/SQL (EP). Here, d = 1, k = 3 are
considered and running time is shown in seconds for n varying from 10,000 to 1000,000 for
one stage of operation. It is seen that PL/SQL (EP) algorithm is faster compared to all other
algorithms. Comparing, FKM (SP) with QKM, it is noticed that for n greater than 600,000 data
points, Quick K-Means (QKM) outperforms FKM (SP) algorithm. It is because the QKM uses
denormalized tables whereas FKM (SP) algorithm uses several tables. This is significant for
large values of n rather than smaller values of n.
Fig. 3.17 Running time of SQL and PL/SQL (EP) and PL/SQL (SP) K-Means algorithms, d = 1, k = 3.
0 5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
0 100 200 300 400 500 600 700 800 900 1000
Ru
nn
ing
Tim
e (
s)
Data Size (nx1000)
FKM (SQL)
QKM (SQL)
FKM (SP)
PL/SQL (EP)
CHAPTER 3 – ECHO IMAGE SEGMENTATION
97
(a)
(b)
Fig. 3.18 Running time of proposed algorithms with [Carlos, 2006a]
(a) n = 0 to 100, 000 (b) n = 0 to 1000, 000
Figure 3.18 shows comparison of the proposed algorithm with the variants proposed by
Carlos [Carlos, 2006a], using C++ and SQL implementations. Here k = 8, d = 8 are considered
to match with the experimental set up of Carlos. The results show a considerable improvement
0
2
4
6
8
10
12
14
16
18
20
10 20 30 40 50 60 70 80 90 100
Ru
nn
ing
Tim
e (
s)
Data Size (nx1000)
FKM (SQL) FKM (SP) PL/SQL (EP)
Carlos (C++) Carlos (SQL)
0
25
50
75
100
125
150
175
200
225
250
0 100 200 300 400 500 600 700 800 900 1000
Ru
nn
ing
Tim
e (
s)
Data Size (n x 1000)
FKM (SQL)
FKM (SP)
PL/SQL (EP)
Carlos (C++)
Carlos (SQL)
CHAPTER 3 – ECHO IMAGE SEGMENTATION
98
by the proposed algorithms as compared to other variants proposed in literature. Further, a
detailed discussion of various other results are given in Chapter 10.
3.11 K-MEANS VERSUS OTHER SEGMENTATION FRAMEWORKS
K-Means clustering is often suitable for biomedical image segmentation since the number
of clusters (k) is usually known for images of particular regions of human anatomy. In
biomedical applications, the spatially varying intensity change of a biomedical structure is
usually caused by inhomogeneity in the process of image acquisition, such as the
inhomogeneous distribution of the contrast agent in CT imaging or inhomogeneous distribution
of the magnetic field gradient in MR imaging [Chang Wen, 1998].
Many researchers have described in literature about general segmentation methods which
show application of a particular method to one or two ultrasound images without specific
reference to ultrasound image formation or context. However, the focus of this work is more
specific to echo images from which the regions of interest(s) such as LV, RV, etc. are
extracted. There are a number of segmentation algorithms mentioned in the literature:
watershed, fuzzy entropy based segmentation approach, Delaunay triangulation, fractals, and
edge flow, etc. What an algorithm can segment in this case is only regions not objects. To
obtain a high level object which is desirable in image analysis and retrieval human assistance is
needed. This is carried out with the help of the active contour model where an initial contour
being specified by the operator.
Several algorithms have been tried practically on echo images which include Otsu threshold
method, Markov random field, morphological based, anisotropic diffusion model, etc.
However, all these algorithms either spoil the shape and/or edges of the cardiac cavity regions.
The basic requirement in this research is to get the region free from uneven intensity of pixels
so that contour can move faster.
Another possibility is Hierarchical clustering algorithm, but this is normally used when the
number of clusters is unknown. In echo image clustering the k value is always 3, because
medically only three regions are of importance: endocarium, myocardium, pericardium.
Further, K-Means algorithm has been widely used by many researchers for medical image
segmentation purpose in the past and present [Bhagwati Charan, 2010], [Chang Wen, 1998],
CHAPTER 3 – ECHO IMAGE SEGMENTATION
99
[Ng, 2006], [Ridho, 2010], [Muthukannan, 2010], [AKJain, press]. In [Luo, 2011] blood vessel
segmentation is done based on k-means clustering and morphological thinning.
Another reason for choosing K-Means clustering approach is that all the algorithms used in
this research work are data mining oriented. K-Means clustering is a popular technique is used
for various applications including medical image segmentation [Han, 2010].
3.12 SUMMARY
Segmentation is a vital aspect of medical imaging. It aids in the visualization of medical
data and diagnostics of various diseases. Five variants of K-Means algorithm are proposed to
reduce the time complexity and speedup the query processing.
1. SQL based K-Means Algorithm using UPDATE statement
2. Fast SQL based K-Means using TRUNCATE-INSERT statements
3. Quick K-Means
4. K-Means with Stored Procedures
5. K-Means with External Procedures
These algorithms are implemented under the tightly-coupled database environment so that
the patient image data movement is avoided and the entire analysis is done in the DB address
space. According to the results obtained, the proposed modified K-Means algorithms perform
better than other variants. Further their clustering quality is adequate for the boundary detection
phase. Other image segmentation methods such as threshold, Otsu algorithm, anisotropic
diffusion, fuzzy k-means, watershed, etc. are tried on echo images experimentally and found
that the K-Means is better in terms of speedup and quality of segmentation.