20
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter Chapter 1 Fusion of Manual and Non-Manual Information in American Sign Language Recognition Sudeep Sarkar 1 , Barbara Loeding 2 , and Ayush S. Parashar 1 1 Computer Science and Engineering 2 Special Education University of South Florida University of South Florida Tampa, Florida 33647 Lakeland, Florida 33803 [email protected] [email protected] We present a bottom-up approach to continuous American sign language recognition without wearable aids, but with simple low-level processes operating on images and building realistic representations that are fed into intermediate level processes, to form sign hypotheses. At the intermediate level, we construct representations for both manual and non-manual aspects, such as hand move- ments, facial expressions and head nods. The manual aspects are represented using Relational Distributions that capture the statistical distribution of the re- lationships among the low-level primitives from the body parts. These relational distributions, which can be constructed without the need for part level tracking, are efficiently represented as points in the Space of Probability Functions (SoPF). Manual dynamics are thus represented as tracks in this space. The dynamics of facial expressions along with a sign are also represented as tracks, but in the ex- pression subspace, constructed using principal component analysis (PCA). Head motions are represented as 2D image tracks. The integration of manual with non-manual information is sequential, with non-manual information refining the manual information based hypotheses set. We show that with just image-based manual information, the correct detection rate is around 88%. However, with the use of facial information, accuracy increases to 92%. Thus face contributes valu- able information towards ASL recognition. ‘Negation’ in sentences is correctly detected in 90% of the cases using just 2D head motion information. 1.1. Introduction While speech recognition has made rapid advances, sign language recognition is lagging behind. With the gradual shift to speech-based I/O devices, there is a great danger that persons who rely solely on sign languages for communication will be deprived access to state-of-the-art technology unless there are significant advances in automated recognition of sign languages. Reviews of prior work in sign language recognition appear in 1 and. 2 From these reviews, we can see that work in sign language recognition initially focused on the recognition of static gestures, e.g. 3–5 and isolated signs, e.g. 6 Starner and Pentland 7 1

Chapter 1 Fusion of Manual and Non-Manual Information in

  • Upload
    vanlien

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

Chapter 1

Fusion of Manual and Non-Manual Information in American Sign

Language Recognition

Sudeep Sarkar1, Barbara Loeding2, and Ayush S. Parashar1

1Computer Science and Engineering 2 Special EducationUniversity of South Florida University of South Florida

Tampa, Florida 33647 Lakeland, Florida [email protected] [email protected]

We present a bottom-up approach to continuous American sign languagerecognition without wearable aids, but with simple low-level processes operatingon images and building realistic representations that are fed into intermediatelevel processes, to form sign hypotheses. At the intermediate level, we constructrepresentations for both manual and non-manual aspects, such as hand move-ments, facial expressions and head nods. The manual aspects are representedusing Relational Distributions that capture the statistical distribution of the re-lationships among the low-level primitives from the body parts. These relationaldistributions, which can be constructed without the need for part level tracking,are efficiently represented as points in the Space of Probability Functions (SoPF).Manual dynamics are thus represented as tracks in this space. The dynamics offacial expressions along with a sign are also represented as tracks, but in the ex-pression subspace, constructed using principal component analysis (PCA). Headmotions are represented as 2D image tracks. The integration of manual withnon-manual information is sequential, with non-manual information refining themanual information based hypotheses set. We show that with just image-basedmanual information, the correct detection rate is around 88%. However, with theuse of facial information, accuracy increases to 92%. Thus face contributes valu-able information towards ASL recognition. ‘Negation’ in sentences is correctlydetected in 90% of the cases using just 2D head motion information.

1.1. Introduction

While speech recognition has made rapid advances, sign language recognition is

lagging behind. With the gradual shift to speech-based I/O devices, there is a great

danger that persons who rely solely on sign languages for communication will be

deprived access to state-of-the-art technology unless there are significant advances

in automated recognition of sign languages.

Reviews of prior work in sign language recognition appear in1 and.2 From these

reviews, we can see that work in sign language recognition initially focused on the

recognition of static gestures, e.g.3–5 and isolated signs, e.g.6 Starner and Pentland7

1

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

2 S. Sarkar, B. Loeding, and A. Parashar

were the first to seriously consider continuous sign recognition∗. Using HMM based

representations, they achieved near perfect recognition with sentences of fixed struc-

ture, i.e. containing personal pronoun, verb, noun, adjective, personal pronoun in

that order. Vogler and Metaxas8–10 were instrumental in significantly pushing the

state-of-the-art in automated ASL recognition using HMMs. In terms of the basic

HMM formalism, they have explored many variations, such as context dependent

HMMs, HMMs coupled with partially segmented sign streams, and parallel HMMs.

The wide use of HMM is also seen in foreign sign language recognizers.1 While

HMM based methods perform very well with limited vocabulary, they are haunted

by scalability issues and requirements for large training data.

Many work in continuous sign language recognition has avoided the very basic

problem of segmentation and tracking of hands by using wearable devices, such

as colored gloves, data gloves, or magnetic markers, to directly get the location

features. For example Vogler and Metaxas8–10 have used 3D magnetic tracking

system, Starner and Pentland7 have used colored gloves while Ma et al.11,12 have

used Cybergloves. However, since this is unnatural for signers, in our research we

restrict ourselves to plain color images, without the use of any augmenting wearable

devices.

Non-manual information, which refers to information from facial expressions,

head motion, or torso movement, convey linguistic information in ASL.10,13 Many

work in sign language recognition has concentrated on just using hand motion,

i.e. manual information. Although, some work in automated understanding of

sign language facial expressions is under way.14–17 Non-manual information can

provide vital cues. For example, head motion can be used to detect whether the

ASL sentence has any ‘Negation’. For instance, the sentence ‘I don’t understand’ is

manually signed exactly same as ‘I understand’, except that there is distinct ‘head

shake’ indicating ‘Negation’ in the sentence ‘I don’t understand’ †. There has been

some work on detecting ‘head shakes’ and ‘nods’,15,18,19 but there is no result in

continuous sign language recognition. In this paper, we use non-manual information

to decrease insertion and deletion errors, and to find whether there is ‘Negation’ in

the sentence using the motion trajectories of the head.

We concentrate on the problem of recognition of continuous sign language, i.e.

signs in sentences and not isolated signs or finger-spelled signs. We adopt a bottom-

up approach with simple low-level processes operating on images to build realistic

representations that are fed into intermediate level processes integrating manual

and non-manual information. Drawing on the emerging wisdom in computer vision

∗Note that Sign Language is different from Signed English, the later is an artificial construct thatemploys signs but using English language grammatical structure.†We use following ASL conventions in the paper. Text in italics indicate sentence in English. Forexample ‘I can lipread’. Text in capitalized letters indicate ASL gloss. For example ‘LIPREADCAN I’. Or, the ASL gloss for sign ‘lipread’ is ‘LIPREAD’. Negation in a sentence signed using

non-manual markers is indicated by N̂OT or ‘Negation’. Multiword gloss for a single sign in ASLis indicated by a hyphen. For example ‘DONT-KNOW’ is a multiword gloss for a single sign inASL.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 3

that simple methods are usually found to be the most robust ones when tested on

large data sets, we use simple components. The low-processes are fairly simple ones

involving detection of skin, motion pixels, and face detection by correlation with

an eye template. This would work for signs again simple backgrounds, which is

the most commonly considered scenario. However, for more complex backgrounds,

alternative strategies, such as described in,20–22 could be considered. The inter-

mediate level consists of modeling the hand motion using relational distributions,

which are efficiently represented as points in the Space of Probability Functions

(SoPF). This captures the placement of the hands with respect to the body, but

does not capture hand shape accurately. Many signs can be recognized based on just

this global information. For signs that are very hand-shape dependent, alternative

methods, such as described in,23 can be used. The expression subspace, derived us-

ing PCA, is used to represent the dynamics of facial expression. This level also deals

with integrating non-manual with manual information to reduce the deletion and

insertion errors. The third or topmost level, which we do not explore in this paper,

would consist of using context and grammatical information from ASL (American

Sign Language) phonology, to constrain and to prune the hypotheses set generated

by the intermediate level processes.

The primary contribution of this work is the demonstration that, even with

fairly simple 2D representations, the use of non-manual information can improve

ASL recognition. The integration of facial expression with manual information is

confounded by the fact that expression events may not coincide exactly with the

manual events. This work also constrains itself in relying on pure image-based

inputs and does not require external wearable aids, such as gloves, to enhance the

low-level primitives.

1.2. Data Set

(a) (b)

Fig. 1.1. Sample images in the dataset. (a) shows an image of face taken for the sign ‘PHONE’,as captured by camera ‘A’ and (b) shows a synchronous image of the upper body as captured bycamera ‘B’.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

4 S. Sarkar, B. Loeding, and A. Parashar

One of the issues of ASL recognition research is the data set used in the study.

We constrained the domain to sentences that would be used while communicating

with deaf people at airports. An ASL interpreter signed for the collection and helped

with creating the ground truth. Two digital video cameras were used for collecting

data; one captured images of the upper body, and the other synchronously captured

face images. Fig. 1.1 shows one sample view.

Some statistics of the data set are as follows. The dataset includes 5 instances of

25 distinct sentences. In total there are 125 sentence samples spanning 325 instances

of ASL signs. There are 39 distinct ASL signs. Each sentence has 1 to 5 signs. On

an average 2.7 signs is present per sentence. The number of frames in a sentence

varies from 51 to 163. The longest sentence is, ‘AIRPLANE POSTPONE AGAIN,

MAD I’, comprised of 163 frames. The smallest sentence is, ‘YES’, made up of 51

frames. Average number of frames in a sentence is about 90. Sign length varied

from 4 frames for the sign, ‘CANNOT’, to 71 frames for sign ‘LUGGAGE-HEAVY’.

On an average a sign has 18 frames spanning about 0.6 second.

There are significant variations among the 5 instances of some of the sentences.

For example the sentence ‘If the plane is delayed, I’ll be mad’ was signed as ‘AIR-

PLANE POSTPONE AGAIN, MAD I’ as well as ‘AIRPLANE AGAIN POST-

PONE, MAD I’. Also in one of the instance of the sentence ‘I packed my suitcase’,

the ASL sign ‘I’ was not present. This was also true for some other sentences. The

reason, as given by the signer, was that signs like ‘I’ are implicit while conversing in

ASL and hence can be excluded. For some sentences, ‘Negation’ is conveyed only

through ‘head shakes’. For example, for the sentences ‘I understand’ and ‘I don’t

understand’, the ASL glosses is the same (‘I UNDERSTAND’). The only difference

is that in sentence ‘I don’t understand’, there is presence of a ‘head shake’ i.e. a

non-manual expressions to convey the presence of ‘Negation’ in the sentence.

1.3. Low Level Processing

In many previous work in continuous ASL, detection and tracking of hands have

been simplified using colored gloves7 or magnetic markers.8 Even foreign sign lan-

guage recognizers have used colored gloves or data gloves. There has been recent

effort to extract information and to track directly from color images, without the

use of special devices,6,20–22 but with added computational complexity. Our in-

termediate level representation, as we shall see later, does not require the tracking

of hands. We just need to segment the hands and the face in each frame. Since

segmentation is not the focus of this work, we have used fairly simple ideas based

on skin color to detect the hands and face. We cluster the skin-pixels by using

the Expectation-Maximization (EM) algorithm based on a Gaussian model for the

clusters in the “Lab” color space. We use the 2-class version of the EM algorithm

twice: first to separate the background and second time to separate the clothing of

the signer from the skin pixels. Fig. 1.2(b) shows the segmentation after EM clus-

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 5

tering of (a), where the skin and clothing pixels are separated from the background.

Fig. 1.2 (c) shows the output after the second application of the EM to separate

skin color from the clothing. Blobs of size greater than 200 pixels are kept. This

helps to remove some pixels that are closer to skin color but do not form blobs big

enough to be a part of hand or face. Fig. 1.2(d) shows an example of the final blobs.

These blobs, along with the color values and the edge pixels detected within them,

comprise the low-level primitives.

(a) (b)

(c) (d)

Fig. 1.2. Segmentation of skin pixels using EM: (a) shows original color frame, (b) shows the pixels

obtained after first application of EM, (c) shows the skin pixels obtained after second applicationof EM, and (d) final blobs corresponding to hand and face.

1.4. Intermediate Level Representations

We have separate representation for manual movement (or hold), facial expression,

and head motion. We would like these representations, in particular, the manual

motion, to be somewhat robust with respect to low-level errors. The manual motion

(including no movement) representation does not require the need for tracking hands

or fingers and emphasizes the 2D spatial relationships of hands and face. The face

expression representation is a 2D view-based one. And, head motion model just

takes into account projected 2D motion. Even without the use of 3D information,

we demonstrate robust ASL recognition possibilities.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

6 S. Sarkar, B. Loeding, and A. Parashar

1.4.1. Manual Movement

Grounded on the observation that the organization or structure or relationships

among low-level primitives are more important than the primitives themselves, we

focus on the statistical distribution of the relational attributes observed in the im-

age, which we refer to as relational distributions. Such statistical representation

also alleviates the need for primitive level correspondence or tracking or registra-

tion across frames. Such representations have been successfully used for modeling

periodic motion in the context of identification of a person from gait.24 Here, we use

it to model aperiodic motion in ASL signs. Primitive level statistical distributions,

such as orientation histograms, have been used for gesture recognition.25 However,

the only uses of relational histograms that we are aware of are by Huet and Han-

cock,26 who used it to model line distributions in the context of image database

indexing. The novelty of relational distributions lies in that it offers a strategy for

incorporating dynamic aspects.

We refer the reader to24 for details of the representation. Here we just sketch the

essentials. Let F = {f1, · · · , fN} represent the set of N primitives in an image. For

us these are the Canny edge pixels inside the low-level skin-blobs described earlier.

Let Fk represent a random k−tuple of primitives, and the relationship among these

k-tuple primitives be denoted by Rk. Let the relationships Rk be characterized by a

set of M attributes Ak = {Ak1, · · · , AkM}. For ASL, we use the distance of the two

edge pixels in the vertical and horizontal directions (dx,dy) as the attributes. We

normalize the distance between the pixels by a distance D, which is inversely related

to the distance from the camera. The shape of the pattern can be represented by

joint probability functions: P (Ak = ak), also denoted by P (ak1, · · · , akM ) or P (ak),

where aki is the (discretized, in practice) value taken by the relational attribute Aki.

We term these probabilities as the Relational Distributions.

One interpretation of these distributions is: given an image, if you randomly

pick k-tuples of primitives, what is the probability that it will exhibit the relational

attributes ak? What is P (Ak = ak)? Given that these relational distributions

exhibit complicated shapes that do not readily afford modeling using a combination

of simple shaped distributions, we adopt non-parametric histogram based repre-

sentation. However, to reduce the size that is associated with a histogram based

representation, we use the Space of Probability Functions (SoPF).

As the hands of the signer move, the relational distributions will change. Motion

of hands will introduce non-stationarity in the relational distributions. Figure 1.3

shows some more examples of the 2-ary relational distributions for the sign ‘CAN’.

Notice the change in the distributions as the hands come down. The change in

the vertical direction in relational distributions can be seen clearly as the hands are

coming down, while there is comparatively less change in the relational distributions

in other direction.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 7

Fig. 1.3. Variations in relational distributions with motion. The left column shows the image

frames in sign ‘CAN’. The middle column shows in the edge pixels in the skin-blobs. The rightcolumn shows the relational distributions.

Let P (ak, t) represent the relational distribution at time t. Let

P (ak, t) =n

i=1

ci(t)Φi(ak) + µ(ak) + η(ak) (1.1)

describe the square root of each relational distribution as a linear combination of

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

8 S. Sarkar, B. Loeding, and A. Parashar

orthogonal basis functions, where Φi(ak)’s are orthonormal functions, the function

µ(ak) is a mean function defined over the attribute space, and η(ak) is a function

capturing small random noise variations with zero mean and small variance. We

refer to this space as the Space of Probability Functions (SoPF).

We use the square root function so that we arrive at a space where the distances

are not arbitrary ones but are related to the Bhattacharya distance between the

relational distributions, which is an appropriate distance measure for probability

distributions. More details about the derivation of this property can be found in.24

Given a set of relational distributions, {P (ak, ti)|i = 1, · · · , T}, the Space of

Probability Functions SoPF can be arrived at by principal component analysis

(PCA). In practice, we can consider the subspace spanned by a few (N << n)

dominant vectors associated with the large eigenvalues. Thus, a relational distribu-

tion can be represented using these N coordinates (ci(t)s), which is more compact

representation than a normalized histogram based representation. The ASL sen-

tences form traces in this Space of Probability Functions (SoPF).

The eigenvectors of the SoPF associated with the largest eigenvalues are shown in

Figure 1.4. The space was trained with 39 distinct signs. The size of each relational

distribution is 30×30. The vertical axes of the images plot the distance attribute dy,

and the distance attribute dx is along the horizontal axes. Brightness is proportional

to the component magnitude. The first eigenvector shows 3 modes in it. The

bright spot in the second eigenvector emphasizes the differences in the attribute dx

between the two features. The third eigenvector is radially symmetric, emphasizing

the differences in both the attributes. Most of the energy of the variation is captured

by the 15 largest eigenvalues.

Fig. 1.4. Dominant dimensions of the learned SoPF, modeling the manual motion

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 9

1.4.2. Non-manual: Facial Expression

The first step is the localization of the face in each frame. There are various sophis-

ticated approaches to detecting faces.27–29 Here we adopt a very simple approach

that relies on eye localization using eye template matching. The eye template, which

is a rectangular region enclosing the two eyes, is the average image from 4 persons,

different from the ASL signer, but similar imaging geometry as used in the ASL

signs. The correlation is calculated on the whole image, only for the first image

frame of the sentence. For the subsequent images, the center of the rectangular box

bounding the eye is found by correlating around the neighborhood of center found

in previous image. A window of 10 pixels in width and height is considered for the

neighborhood search. After the detection of eyes, we demarcate the face with an

elliptical structure. We use the golden ratio for face,30 to mask the face with two

elliptical structures, one for the top part and the other for the bottom. Fig. 1.5

shows example outputs of the eye detection and facial demarcation steps.

(a) (b)

Fig. 1.5. (a) Output of eye detection. (b) Extracted elliptical facial region.

We adopt a view-based representation of expression, modeled as traces in ex-

pression sub-space, which is computed using Principal Component Analysis (PCA)

of 4 expression examples for each of the 39 signs. We have found that 20 largest

eigenvalues capture most of the energy in the expression variations, at least in this

dataset. It is interesting to note that various aspects of facial expression that are

important to ASL, such as motion of eye brows, cheek puffing, lip movement, and

nose wrinkles,13 are captured by the dominant eigenvectors, some of which are

show in Figure 1.6. It is observed that lip movements are emphasized by most of

the eigenvectors because the interpreter was also mouthing the English equivalent

of the signs. This might not be true for native signers.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

10 S. Sarkar, B. Loeding, and A. Parashar

1.4.3. Non-manual: Head Motion

Head motion is represented by the sequence of the average of the two eye locations,

which are detected during face localization. The 2D trajectories are defined with

respect to the location in the starting frame. Also, Fig. 1.7 shows some example

head trajectories. Figs. 1.7 (a), (b) and (c) clearly shows the presence of ‘Negation’

in the sentences, Fig. 1.7(d) shows the vertical motion of face, indicating a ‘head

nod’ while Figs. 1.7 (e) and (f) shows the motion trajectories for the sentences in

which there is no positive or negative meaning conveyed through them.

1.5. Combination of Manual and Non-Manual

We have used the facial expression information to reduce the deletion and insertion

errors while head motion information is used to find whether the sentence contains

‘Negation’ or not. The combination of the non-manual with manual information

is not trivial because (i) the non-manuals are not time-synchronized with manuals.

This is not a video synchronization issue. The facial event might lag or lead the

manual event. Also, (ii) presence of presence of a strong non-manual indicating

‘Assertion’ or ‘Negation’ in the sentence makes it hard to extract facial information

for some frames.

Fig. 1.8 shows the information flow architecture. We process manual informa-

tion, facial expressions, and head motion as independent channels, which are then

combined as follows

(1) Find the n signs with least distances in the sentence using manual information.

(2) Find the distances for same n signs found in Step 1, using non-manual infor-

mation.

(3) Sort these signs in ascending order of distances obtained from non-manual in-

formation.

(4) Discard α signs having maximum distances from sorted list obtained from Step

3.

(5) Keep the remaining n − α signs from Step 1.

And, finally head motion is used to detect whether the sentence has ‘Negation’ in

it. The selection of n and α is a function of the number of signs in a sentence and

the computational costs of high level processes.

Distances between SoPF traces quantify motion involved in the manual infor-

mation, while the distances in the face space quantify changes in facial expression.

In this work, we adopt a simple distance measure between two traces to find a

sign in the sentence using manual and non-manual information. For the manual

information, we cross correlate the SoPF trace of the training sign with the trace

of given sentence and pick the value and shift that results in the minimum distance

(Euclidean). For non-manual information, we correlate the trace of trained sign

near the time neighborhood where the smallest distance for manual information has

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 11

been found.

Let us look at some example of correlation of signs with sentences. Fig. 1.9(a)

plots the correlations of manual information of ’LIPREAD’ with the sentence

’LIPREAD CAN I’. Lower values of distance indicate matches, which in this case

occurs around frame 12. Similarly, Fig. 1.9(b) & Fig. 1.10(c) shows the correlation

of the manual information of the signs ‘CAN’ & ‘I’ with the same sentence. The

actual position of the signs can be seen in Fig. 1.10(d). The epenthesis movement31

indicated by E, can also be clearly seen in between the signs.

1.6. Experiments

In this section, we show results to demonstrate the efficacy of the proposed approach

using the data described earlier. Given that we have 5 instances of 25 distinct

sentences, we use 5-fold cross validation to evaluate the effectiveness of (a) manual

motion modeling, (b) the integration of facial expression with manual information,

and (c) the detection of negation in sentences. Four instances of each sentence are

used for training and one is used for testing. Note that some signs occur multiple

times in the training data in different arrangements with other signs. There are 65

sign instances, making up the 25 test sentences, to be recognized. Before we present

results, a few words about the performance measures are in order.

1.6.1. Quantifying Performance

Since we are considering the output of an intermediate level process that would

typically be further refined using grammar constraints, the performance measures

should reflect the tentative nature of the output. We sort the signs on the basis

of minimum distance of a sign to the sentence. Then we choose n signs with the

least distances. If a sign is part of a sentence, but not present in these n signs, then

a deletion error has occurred. The number of deletion errors depends on n; as n

increases, errors go down but the cost of high level processing increases since it has

to consider more possibilities. Since the maximum number of words in a sentence

is 5, we report results with n = 6, which is also 10% of the number of possible sign

instances in the test set. The correct detection rate or accuracy is 100 minus the

deletion rate. A sign that is not a part of sentence but has distance less than the last

correctly detected sign is declared to be wrongly inserted in the sentence. We have

an insertion error. Insertion errors can be reduced using the context knowledge or

by grouping the signs that are very similar. Here we have reduced the insertion

errors using facial expression information in the sentence. It is harder to recover

from deletion errors.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

12 S. Sarkar, B. Loeding, and A. Parashar

Error Manual Manual + Facial

Deletion [9% to 15%], 12% [4% to 11%], 8%Insertion [12% to 19%], 15% [7% to 15%], 11%

1.6.2. Use of Non-Manual Information

To study the effect of the use of manual information, we start with the top 8 signs,

as determined by manual information, and then prune out 2 signs (α = 2) based on

facial information, so as to finally arrive at 6 hypothesized signs per sentence. We

compare the final insertion and deletion error rates with hypothesizing 6 possible

signs per sentence based on just manual information.

The deletion error rate for the top n = 6 matches based on just manual infor-

mation as captured by the SoPF traces, with 5-fold validation, ranges from 9% to

15% with an average of about 12%. Thus, the average correct detection rate from

just manual information is 88%. Table 1.1 shows the improvement in rates to about

92% correct detection when face information is added. We also see a corresponding

reduction in the insertion error rates.

The percentage of sentences that are perfectly recognized, i.e. all the correct

signs are among the topmost ranks (zero deletion and insertion errors), is around

46%. This is a sentence level performance measure, i.e. the percentage of sentences

that are perfectly recognized. This is a very strict measure. A sentence can be

misunderstood even if there is only one sign not correctly recognized. We contend

that it is important to also report this number, even if it is low. Note that with the

use of grammatical constraints, this performance can be further improved.

1.6.3. Head Motion to Find ‘Negation’

‘Negation’ in a sentence is indicated by the ‘head shake’. We use aspect ratio

of the aggregate 2D track of the whole sentence, i.e the ratio of width to height

of the entire trajectory, as the features to recognize the presence of ‘Negation’ in

the sentence. We consider sentences whose motion trajectories have aspect ratio

greater than 1.25, width greater than 40 pixels, and height less than 50 pixels,

to have ‘Negation’ in them. Using this logic, out of 30 sentences in the database

that have ‘Negation’ in them 27 were correctly recognized while there were 18 false

alarms out of the remaining 95 sentences. The false alarms were mainly because

of sentences like ‘GATE WHERE’, which also have motion of face in horizontal

direction. Also, the missed detections are in sentences that specifically have sign

the ‘ME’ in them (example: ‘You don’t understand me’ ). This is because in sign

‘ME’, there is a natural downward movement of face, and hence the height of motion

trajectory increases to a good extent, causing aspect ratio to decrease. In future,

this performance can be increased by looking for portions in the sentences where

the negation occurs and not using the entire trajectory over the whole sentence.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 13

1.7. Conclusion

We presented a framework for continuous sign language recognition that combined

non-manual information from faces with manual information to decrease both the

deletion and insertion errors. Unlike most previous approaches that are top-down

HMM based, ours is a bottom-up approach that relies on simple low-level processes

feeding into intermediate level processes that hypothesize signs that are present

in a sentence. The approach also does not bypass the segmentation problem, but

relies on simple, yet robust, low-level representations. The manual dynamics were

modeled by capturing the statistics of the relationships among the low-level fea-

tures via the concepts of relational distributions embedded in Space of Probability

Functions. Facial dynamics were captured using an expression sub-space, computed

using PCA.

Even with fairly simple vision processes, embedded in a bottom-up approach, we

were able to achieve good performance from pure image based inputs. Using 5-fold

cross-validation on a data set of 125 sentences over 325 sign instances, we showed

that the accuracy of individual sign recognition was about 88% with just manual

information. The use of non-manual information increased the accuracy from 88%

to 92%. We were also able to correctly detect negations in sentences 90% of the

time.

1.8. Acknowledgment

This work was supported in part by the National Science Foundation under grant

IIS 0312993. Any opinions, findings and conclusions or recommendations expressed

in this material are those of the author(s) and do not necessarily reflect those of the

National Science Foundation.

References

1. B. Loeding, S. Sarkar, A. Parashar, and A. Karshmer. Progress in automated computerrecognition of sign language. In Lecture Notes in Computer Science, vol. 3118, pp.1079–1087, (2004).

2. C. Sylvie and S. Ranganath, Automatic sign language analysis: A survey and thefuture beyond lexical meaning, IEEE Transactions on Pattern Analysis and MachineIntelligence. 27(6), 873–891 (Jun, 2005).

3. Y. Cui and J. Weng, Appearance-based hand sign recognition from intensity imagesequences, Computer Vision and Image Understanding. 78(2), 157–176 (May, 2000).

4. M. Zhao and F. K. H. Quek, RIEVL: Recursive induction learning in hand gesturerecognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20

(11), 1174–1185, (1998).5. J. Triesch and C. von der Malsburg. Robust classification of hand postures against

complex backgrounds. In International Conference on Automatic Face and GestureRecognition, pp. 170–175, (1996).

6. M. H. Yang, N. Ahuja, and M. Tabb, Extraction of 2d motion trajectories and its

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

14 S. Sarkar, B. Loeding, and A. Parashar

application to hand gesture recognition, IEEE Transactions on Pattern Analysis andMachine Intelligence. 24, 168–185 (Aug., 2002).

7. T. Starner and A. Pentland. Real-time American Sign Language recognition fromvideo using hidden Markov models. In Symposium on Computer Vision, pp. 265–270,(1995).

8. C. Vogler and D. Metaxas. ASL recognition based on a coupling between HMMs and3d motion analysis. In International Conference on Computer Vision, pp. 363–369,(1998).

9. C. Vogler and D. Metaxas. Parallel hidden Markov models for American Sign Languagerecognition. In International Conference on Computer Vision, pp. 116–122, (1999).

10. C. Vogler and D. Metaxas, A framework of recognizing the simultaneous aspects ofAmerican Sign Language, Computer Vision and Image Understanding. 81, 358–384,(2001).

11. J. Ma, W. Gao, C. Wang, and J. Wu. A continuous Chinese sign language recognitionsystem. In International Conference on Automatic Face and Gesture Recognition, pp.428–433, (2000).

12. C. Wang, W. Gao, and S. Shan. An approach based on phonemes to large vocabularyChinese sign language recognition. In International Conference on Automatic Faceand Gesture Recognition, pp. 393–398, (2002).

13. B. Bahan and C. Neidle. Non-manual realization of agreement in American Sign Lan-guage. Master’s thesis, Boston University, (1996).

14. R. Wilbur and A. Martinez, Physical correlates of prosodic structure in AmericanSign Language, Meeting of the Chicago Linguistics Society, April. pp. 25–27, (2002).

15. U. M. Erdem and S. Sclaroff. Automatic detection of relevant head gestures in Amer-ican Sign Language communication. In International Conference on Pattern Recogni-tion, pp. 460–463, (2002).

16. C. Vogler and S. Goldenstein, Facial movement analysis in ASL, Universal Access inthe Information Society. 6(4), 363–374, (2008).

17. U. Canzler and T. Dziurzyk, Extraction of Non Manual Features for Videobased SignLanguage Recognition, IAPR Workshop on Machine Vision Application (MVA2002).pp. 11–13, (2002).

18. M. L. Cascia, S. Sclaroff, and V. Athitsos, Fast, reliable head tracking under varyingillumination, IEEE Transactions on Pattern Analysis and Machine Intelligence. 21

(6) (June, 1999).19. A. Kapoor and R. W. Picard. A real-time head nod and shake detector. In Workshop

on Perspective User Interfaces (Nov., 2001).20. R. Yang, S. Sarkar, and B. Loeding. Enhanced level building algorithm for the move-

ment epenthesis problem in sign language recognition. In Computer Vision and Pat-tern Recognition, (2007).

21. R. Yang and S. Sarkar, Coupled grouping and matching for sign and gesture recogni-tion, Computer Vision and Image Understanding. (2008).

22. J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff. Simultaneous localization and recogni-tion of dynamic hand gestures. In IEEE Workshop on Motion and Video Computing,vol. 2, pp. 254–260, (2005).

23. L. Ding and A. Martinez, Modelling and recognition of the linguistic components inAmerican Sign Language, Image and Vision Computing. (2009).

24. I. Robledo and S. Sarkar, Representation of the evolution of feature relationship statis-tics: Human gait-based recognition, IEEE Transactions on Pattern Analysis and Ma-chine Intelligence (Feb. 2003).

25. W. Freeman and M. Roth. Orientation histograms for hand and gesture recognition.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 15

In International Workshop on Face and Gesture Recognition, pp. 296–301, (1995).26. A. Huet and E. Hancock, Line pattern retrieval using relational histograms, IEEE

Transactions on Pattern Analysis and Machine Intelligence. 12(13), 1363–1370,(1999).

27. H. A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEETransactions on Pattern Analysis and Machine Intelligence. 20, 23–38, (1998).

28. A. Colemnarez and T. Huang, Face detection with information based maximum dis-crimination, Computer Vision and Pattern Recognition. pp. 782–787, (1997).

29. K. K. Sung and T. Poggio, Example-based learning for view-based human face de-tection, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20, 39–51,(1998).

30. L. G. Frakas and I. R. Munro, Anthropometric Facial Proportions in Medicine.(Charles C. Thomas, Springfield, IL, 1987).

31. S. K. Liddell and R. E. Johnson, American Sign Language: The phonological base,Sign Language Studies. 64, 195–277, (1989).

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

16 S. Sarkar, B. Loeding, and A. Parashar

Fig. 1.6. Dominant dimensions of the learned facial expressions over the 39 signs.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 17

(a) ‘DONT-KNOW I’ (b) ‘I NOT HAVE KEY’

(c) ‘NO’ (d) ‘YES’

(e) ‘YOU UNDERSTAND (f) ‘SUITCASE I

ME’ PACK FINISH’

Fig. 1.7. Head motion trajectories. (a),(b) and (c) show the motion trajectories for sentences with

negation. (d) shows the motion trajectory for the sign ‘YES’. (e) and (f) show motion trajectoriesfor sentences that do not convey negative meaning.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

18 S. Sarkar, B. Loeding, and A. Parashar

Fig. 1.8. A bottom-up architecture for fusing information from facial expressions and head motionwith manual information to prune the set of possible ASL sign hypotheses.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

American Sign Language Recognition 19

1 7 10 1820 30 40 50 60 70 790

50

100

150

200

250

300

350Distribution of sign ‘LIPREAD’ in sentence ‘LIPREAD CAN I’

Frames of sentence ‘LIPREAD CAN I’

Dis

tan

ceLIPREAD

(a)

1 10 20 25 30 35 40 50 60 70 790

100

200

300

400

500

600Distribution of sign ‘CAN’ in sentence ‘LIPREAD CAN I’

Frames of sentence ‘LIPREAD CAN I’

Dis

tan

ce

CAN

(b)

Fig. 1.9. Cross-correlation of sign with sentences. (a),(b) shows the correlation of signs‘LIPREAD’ and ‘CAN’ with the sentence ‘LIPREAD CAN I’. Lower values indicates closermatches.

May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter

20 S. Sarkar, B. Loeding, and A. Parashar

1 10 20 30 40 45 4950 60 70 790

50

100

150

200

250

300

350

400

450Distribution of sign ‘I’ in sentence ‘LIPREAD CAN I’

Frames of sentence ‘LIPREAD CAN I’

Dis

tan

ceI

(c)

1 7 18 25 35 45 49 79

Position of signs in the sentence ‘LIPREAD CAN I’E is epenthesis movements

Frames in sentence ‘LIPREAD CAN I’

LIPREAD CAN I

E E

(d)

Fig. 1.10. Cross-correlation of sign with sentences (contd.). (c) shows the correlation of signs‘I’ with the sentence ‘LIPREAD CAN I’. Lower values indicates closer matches. (d) shows theground-truth position of the signs in the sentence. E indicates the epenthesis movements presentbetween two signs.