May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
Chapter 1
Fusion of Manual and Non-Manual Information in American Sign
Language Recognition
Sudeep Sarkar1, Barbara Loeding2, and Ayush S. Parashar1
1Computer Science and Engineering 2 Special EducationUniversity of South Florida University of South Florida
Tampa, Florida 33647 Lakeland, Florida [email protected] [email protected]
We present a bottom-up approach to continuous American sign languagerecognition without wearable aids, but with simple low-level processes operatingon images and building realistic representations that are fed into intermediatelevel processes, to form sign hypotheses. At the intermediate level, we constructrepresentations for both manual and non-manual aspects, such as hand move-ments, facial expressions and head nods. The manual aspects are representedusing Relational Distributions that capture the statistical distribution of the re-lationships among the low-level primitives from the body parts. These relationaldistributions, which can be constructed without the need for part level tracking,are efficiently represented as points in the Space of Probability Functions (SoPF).Manual dynamics are thus represented as tracks in this space. The dynamics offacial expressions along with a sign are also represented as tracks, but in the ex-pression subspace, constructed using principal component analysis (PCA). Headmotions are represented as 2D image tracks. The integration of manual withnon-manual information is sequential, with non-manual information refining themanual information based hypotheses set. We show that with just image-basedmanual information, the correct detection rate is around 88%. However, with theuse of facial information, accuracy increases to 92%. Thus face contributes valu-able information towards ASL recognition. ‘Negation’ in sentences is correctlydetected in 90% of the cases using just 2D head motion information.
1.1. Introduction
While speech recognition has made rapid advances, sign language recognition is
lagging behind. With the gradual shift to speech-based I/O devices, there is a great
danger that persons who rely solely on sign languages for communication will be
deprived access to state-of-the-art technology unless there are significant advances
in automated recognition of sign languages.
Reviews of prior work in sign language recognition appear in1 and.2 From these
reviews, we can see that work in sign language recognition initially focused on the
recognition of static gestures, e.g.3–5 and isolated signs, e.g.6 Starner and Pentland7
1
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
2 S. Sarkar, B. Loeding, and A. Parashar
were the first to seriously consider continuous sign recognition∗. Using HMM based
representations, they achieved near perfect recognition with sentences of fixed struc-
ture, i.e. containing personal pronoun, verb, noun, adjective, personal pronoun in
that order. Vogler and Metaxas8–10 were instrumental in significantly pushing the
state-of-the-art in automated ASL recognition using HMMs. In terms of the basic
HMM formalism, they have explored many variations, such as context dependent
HMMs, HMMs coupled with partially segmented sign streams, and parallel HMMs.
The wide use of HMM is also seen in foreign sign language recognizers.1 While
HMM based methods perform very well with limited vocabulary, they are haunted
by scalability issues and requirements for large training data.
Many work in continuous sign language recognition has avoided the very basic
problem of segmentation and tracking of hands by using wearable devices, such
as colored gloves, data gloves, or magnetic markers, to directly get the location
features. For example Vogler and Metaxas8–10 have used 3D magnetic tracking
system, Starner and Pentland7 have used colored gloves while Ma et al.11,12 have
used Cybergloves. However, since this is unnatural for signers, in our research we
restrict ourselves to plain color images, without the use of any augmenting wearable
devices.
Non-manual information, which refers to information from facial expressions,
head motion, or torso movement, convey linguistic information in ASL.10,13 Many
work in sign language recognition has concentrated on just using hand motion,
i.e. manual information. Although, some work in automated understanding of
sign language facial expressions is under way.14–17 Non-manual information can
provide vital cues. For example, head motion can be used to detect whether the
ASL sentence has any ‘Negation’. For instance, the sentence ‘I don’t understand’ is
manually signed exactly same as ‘I understand’, except that there is distinct ‘head
shake’ indicating ‘Negation’ in the sentence ‘I don’t understand’ †. There has been
some work on detecting ‘head shakes’ and ‘nods’,15,18,19 but there is no result in
continuous sign language recognition. In this paper, we use non-manual information
to decrease insertion and deletion errors, and to find whether there is ‘Negation’ in
the sentence using the motion trajectories of the head.
We concentrate on the problem of recognition of continuous sign language, i.e.
signs in sentences and not isolated signs or finger-spelled signs. We adopt a bottom-
up approach with simple low-level processes operating on images to build realistic
representations that are fed into intermediate level processes integrating manual
and non-manual information. Drawing on the emerging wisdom in computer vision
∗Note that Sign Language is different from Signed English, the later is an artificial construct thatemploys signs but using English language grammatical structure.†We use following ASL conventions in the paper. Text in italics indicate sentence in English. Forexample ‘I can lipread’. Text in capitalized letters indicate ASL gloss. For example ‘LIPREADCAN I’. Or, the ASL gloss for sign ‘lipread’ is ‘LIPREAD’. Negation in a sentence signed using
non-manual markers is indicated by N̂OT or ‘Negation’. Multiword gloss for a single sign in ASLis indicated by a hyphen. For example ‘DONT-KNOW’ is a multiword gloss for a single sign inASL.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 3
that simple methods are usually found to be the most robust ones when tested on
large data sets, we use simple components. The low-processes are fairly simple ones
involving detection of skin, motion pixels, and face detection by correlation with
an eye template. This would work for signs again simple backgrounds, which is
the most commonly considered scenario. However, for more complex backgrounds,
alternative strategies, such as described in,20–22 could be considered. The inter-
mediate level consists of modeling the hand motion using relational distributions,
which are efficiently represented as points in the Space of Probability Functions
(SoPF). This captures the placement of the hands with respect to the body, but
does not capture hand shape accurately. Many signs can be recognized based on just
this global information. For signs that are very hand-shape dependent, alternative
methods, such as described in,23 can be used. The expression subspace, derived us-
ing PCA, is used to represent the dynamics of facial expression. This level also deals
with integrating non-manual with manual information to reduce the deletion and
insertion errors. The third or topmost level, which we do not explore in this paper,
would consist of using context and grammatical information from ASL (American
Sign Language) phonology, to constrain and to prune the hypotheses set generated
by the intermediate level processes.
The primary contribution of this work is the demonstration that, even with
fairly simple 2D representations, the use of non-manual information can improve
ASL recognition. The integration of facial expression with manual information is
confounded by the fact that expression events may not coincide exactly with the
manual events. This work also constrains itself in relying on pure image-based
inputs and does not require external wearable aids, such as gloves, to enhance the
low-level primitives.
1.2. Data Set
(a) (b)
Fig. 1.1. Sample images in the dataset. (a) shows an image of face taken for the sign ‘PHONE’,as captured by camera ‘A’ and (b) shows a synchronous image of the upper body as captured bycamera ‘B’.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
4 S. Sarkar, B. Loeding, and A. Parashar
One of the issues of ASL recognition research is the data set used in the study.
We constrained the domain to sentences that would be used while communicating
with deaf people at airports. An ASL interpreter signed for the collection and helped
with creating the ground truth. Two digital video cameras were used for collecting
data; one captured images of the upper body, and the other synchronously captured
face images. Fig. 1.1 shows one sample view.
Some statistics of the data set are as follows. The dataset includes 5 instances of
25 distinct sentences. In total there are 125 sentence samples spanning 325 instances
of ASL signs. There are 39 distinct ASL signs. Each sentence has 1 to 5 signs. On
an average 2.7 signs is present per sentence. The number of frames in a sentence
varies from 51 to 163. The longest sentence is, ‘AIRPLANE POSTPONE AGAIN,
MAD I’, comprised of 163 frames. The smallest sentence is, ‘YES’, made up of 51
frames. Average number of frames in a sentence is about 90. Sign length varied
from 4 frames for the sign, ‘CANNOT’, to 71 frames for sign ‘LUGGAGE-HEAVY’.
On an average a sign has 18 frames spanning about 0.6 second.
There are significant variations among the 5 instances of some of the sentences.
For example the sentence ‘If the plane is delayed, I’ll be mad’ was signed as ‘AIR-
PLANE POSTPONE AGAIN, MAD I’ as well as ‘AIRPLANE AGAIN POST-
PONE, MAD I’. Also in one of the instance of the sentence ‘I packed my suitcase’,
the ASL sign ‘I’ was not present. This was also true for some other sentences. The
reason, as given by the signer, was that signs like ‘I’ are implicit while conversing in
ASL and hence can be excluded. For some sentences, ‘Negation’ is conveyed only
through ‘head shakes’. For example, for the sentences ‘I understand’ and ‘I don’t
understand’, the ASL glosses is the same (‘I UNDERSTAND’). The only difference
is that in sentence ‘I don’t understand’, there is presence of a ‘head shake’ i.e. a
non-manual expressions to convey the presence of ‘Negation’ in the sentence.
1.3. Low Level Processing
In many previous work in continuous ASL, detection and tracking of hands have
been simplified using colored gloves7 or magnetic markers.8 Even foreign sign lan-
guage recognizers have used colored gloves or data gloves. There has been recent
effort to extract information and to track directly from color images, without the
use of special devices,6,20–22 but with added computational complexity. Our in-
termediate level representation, as we shall see later, does not require the tracking
of hands. We just need to segment the hands and the face in each frame. Since
segmentation is not the focus of this work, we have used fairly simple ideas based
on skin color to detect the hands and face. We cluster the skin-pixels by using
the Expectation-Maximization (EM) algorithm based on a Gaussian model for the
clusters in the “Lab” color space. We use the 2-class version of the EM algorithm
twice: first to separate the background and second time to separate the clothing of
the signer from the skin pixels. Fig. 1.2(b) shows the segmentation after EM clus-
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 5
tering of (a), where the skin and clothing pixels are separated from the background.
Fig. 1.2 (c) shows the output after the second application of the EM to separate
skin color from the clothing. Blobs of size greater than 200 pixels are kept. This
helps to remove some pixels that are closer to skin color but do not form blobs big
enough to be a part of hand or face. Fig. 1.2(d) shows an example of the final blobs.
These blobs, along with the color values and the edge pixels detected within them,
comprise the low-level primitives.
(a) (b)
(c) (d)
Fig. 1.2. Segmentation of skin pixels using EM: (a) shows original color frame, (b) shows the pixels
obtained after first application of EM, (c) shows the skin pixels obtained after second applicationof EM, and (d) final blobs corresponding to hand and face.
1.4. Intermediate Level Representations
We have separate representation for manual movement (or hold), facial expression,
and head motion. We would like these representations, in particular, the manual
motion, to be somewhat robust with respect to low-level errors. The manual motion
(including no movement) representation does not require the need for tracking hands
or fingers and emphasizes the 2D spatial relationships of hands and face. The face
expression representation is a 2D view-based one. And, head motion model just
takes into account projected 2D motion. Even without the use of 3D information,
we demonstrate robust ASL recognition possibilities.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
6 S. Sarkar, B. Loeding, and A. Parashar
1.4.1. Manual Movement
Grounded on the observation that the organization or structure or relationships
among low-level primitives are more important than the primitives themselves, we
focus on the statistical distribution of the relational attributes observed in the im-
age, which we refer to as relational distributions. Such statistical representation
also alleviates the need for primitive level correspondence or tracking or registra-
tion across frames. Such representations have been successfully used for modeling
periodic motion in the context of identification of a person from gait.24 Here, we use
it to model aperiodic motion in ASL signs. Primitive level statistical distributions,
such as orientation histograms, have been used for gesture recognition.25 However,
the only uses of relational histograms that we are aware of are by Huet and Han-
cock,26 who used it to model line distributions in the context of image database
indexing. The novelty of relational distributions lies in that it offers a strategy for
incorporating dynamic aspects.
We refer the reader to24 for details of the representation. Here we just sketch the
essentials. Let F = {f1, · · · , fN} represent the set of N primitives in an image. For
us these are the Canny edge pixels inside the low-level skin-blobs described earlier.
Let Fk represent a random k−tuple of primitives, and the relationship among these
k-tuple primitives be denoted by Rk. Let the relationships Rk be characterized by a
set of M attributes Ak = {Ak1, · · · , AkM}. For ASL, we use the distance of the two
edge pixels in the vertical and horizontal directions (dx,dy) as the attributes. We
normalize the distance between the pixels by a distance D, which is inversely related
to the distance from the camera. The shape of the pattern can be represented by
joint probability functions: P (Ak = ak), also denoted by P (ak1, · · · , akM ) or P (ak),
where aki is the (discretized, in practice) value taken by the relational attribute Aki.
We term these probabilities as the Relational Distributions.
One interpretation of these distributions is: given an image, if you randomly
pick k-tuples of primitives, what is the probability that it will exhibit the relational
attributes ak? What is P (Ak = ak)? Given that these relational distributions
exhibit complicated shapes that do not readily afford modeling using a combination
of simple shaped distributions, we adopt non-parametric histogram based repre-
sentation. However, to reduce the size that is associated with a histogram based
representation, we use the Space of Probability Functions (SoPF).
As the hands of the signer move, the relational distributions will change. Motion
of hands will introduce non-stationarity in the relational distributions. Figure 1.3
shows some more examples of the 2-ary relational distributions for the sign ‘CAN’.
Notice the change in the distributions as the hands come down. The change in
the vertical direction in relational distributions can be seen clearly as the hands are
coming down, while there is comparatively less change in the relational distributions
in other direction.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 7
Fig. 1.3. Variations in relational distributions with motion. The left column shows the image
frames in sign ‘CAN’. The middle column shows in the edge pixels in the skin-blobs. The rightcolumn shows the relational distributions.
Let P (ak, t) represent the relational distribution at time t. Let
√
P (ak, t) =n
∑
i=1
ci(t)Φi(ak) + µ(ak) + η(ak) (1.1)
describe the square root of each relational distribution as a linear combination of
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
8 S. Sarkar, B. Loeding, and A. Parashar
orthogonal basis functions, where Φi(ak)’s are orthonormal functions, the function
µ(ak) is a mean function defined over the attribute space, and η(ak) is a function
capturing small random noise variations with zero mean and small variance. We
refer to this space as the Space of Probability Functions (SoPF).
We use the square root function so that we arrive at a space where the distances
are not arbitrary ones but are related to the Bhattacharya distance between the
relational distributions, which is an appropriate distance measure for probability
distributions. More details about the derivation of this property can be found in.24
Given a set of relational distributions, {P (ak, ti)|i = 1, · · · , T}, the Space of
Probability Functions SoPF can be arrived at by principal component analysis
(PCA). In practice, we can consider the subspace spanned by a few (N << n)
dominant vectors associated with the large eigenvalues. Thus, a relational distribu-
tion can be represented using these N coordinates (ci(t)s), which is more compact
representation than a normalized histogram based representation. The ASL sen-
tences form traces in this Space of Probability Functions (SoPF).
The eigenvectors of the SoPF associated with the largest eigenvalues are shown in
Figure 1.4. The space was trained with 39 distinct signs. The size of each relational
distribution is 30×30. The vertical axes of the images plot the distance attribute dy,
and the distance attribute dx is along the horizontal axes. Brightness is proportional
to the component magnitude. The first eigenvector shows 3 modes in it. The
bright spot in the second eigenvector emphasizes the differences in the attribute dx
between the two features. The third eigenvector is radially symmetric, emphasizing
the differences in both the attributes. Most of the energy of the variation is captured
by the 15 largest eigenvalues.
Fig. 1.4. Dominant dimensions of the learned SoPF, modeling the manual motion
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 9
1.4.2. Non-manual: Facial Expression
The first step is the localization of the face in each frame. There are various sophis-
ticated approaches to detecting faces.27–29 Here we adopt a very simple approach
that relies on eye localization using eye template matching. The eye template, which
is a rectangular region enclosing the two eyes, is the average image from 4 persons,
different from the ASL signer, but similar imaging geometry as used in the ASL
signs. The correlation is calculated on the whole image, only for the first image
frame of the sentence. For the subsequent images, the center of the rectangular box
bounding the eye is found by correlating around the neighborhood of center found
in previous image. A window of 10 pixels in width and height is considered for the
neighborhood search. After the detection of eyes, we demarcate the face with an
elliptical structure. We use the golden ratio for face,30 to mask the face with two
elliptical structures, one for the top part and the other for the bottom. Fig. 1.5
shows example outputs of the eye detection and facial demarcation steps.
(a) (b)
Fig. 1.5. (a) Output of eye detection. (b) Extracted elliptical facial region.
We adopt a view-based representation of expression, modeled as traces in ex-
pression sub-space, which is computed using Principal Component Analysis (PCA)
of 4 expression examples for each of the 39 signs. We have found that 20 largest
eigenvalues capture most of the energy in the expression variations, at least in this
dataset. It is interesting to note that various aspects of facial expression that are
important to ASL, such as motion of eye brows, cheek puffing, lip movement, and
nose wrinkles,13 are captured by the dominant eigenvectors, some of which are
show in Figure 1.6. It is observed that lip movements are emphasized by most of
the eigenvectors because the interpreter was also mouthing the English equivalent
of the signs. This might not be true for native signers.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
10 S. Sarkar, B. Loeding, and A. Parashar
1.4.3. Non-manual: Head Motion
Head motion is represented by the sequence of the average of the two eye locations,
which are detected during face localization. The 2D trajectories are defined with
respect to the location in the starting frame. Also, Fig. 1.7 shows some example
head trajectories. Figs. 1.7 (a), (b) and (c) clearly shows the presence of ‘Negation’
in the sentences, Fig. 1.7(d) shows the vertical motion of face, indicating a ‘head
nod’ while Figs. 1.7 (e) and (f) shows the motion trajectories for the sentences in
which there is no positive or negative meaning conveyed through them.
1.5. Combination of Manual and Non-Manual
We have used the facial expression information to reduce the deletion and insertion
errors while head motion information is used to find whether the sentence contains
‘Negation’ or not. The combination of the non-manual with manual information
is not trivial because (i) the non-manuals are not time-synchronized with manuals.
This is not a video synchronization issue. The facial event might lag or lead the
manual event. Also, (ii) presence of presence of a strong non-manual indicating
‘Assertion’ or ‘Negation’ in the sentence makes it hard to extract facial information
for some frames.
Fig. 1.8 shows the information flow architecture. We process manual informa-
tion, facial expressions, and head motion as independent channels, which are then
combined as follows
(1) Find the n signs with least distances in the sentence using manual information.
(2) Find the distances for same n signs found in Step 1, using non-manual infor-
mation.
(3) Sort these signs in ascending order of distances obtained from non-manual in-
formation.
(4) Discard α signs having maximum distances from sorted list obtained from Step
3.
(5) Keep the remaining n − α signs from Step 1.
And, finally head motion is used to detect whether the sentence has ‘Negation’ in
it. The selection of n and α is a function of the number of signs in a sentence and
the computational costs of high level processes.
Distances between SoPF traces quantify motion involved in the manual infor-
mation, while the distances in the face space quantify changes in facial expression.
In this work, we adopt a simple distance measure between two traces to find a
sign in the sentence using manual and non-manual information. For the manual
information, we cross correlate the SoPF trace of the training sign with the trace
of given sentence and pick the value and shift that results in the minimum distance
(Euclidean). For non-manual information, we correlate the trace of trained sign
near the time neighborhood where the smallest distance for manual information has
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 11
been found.
Let us look at some example of correlation of signs with sentences. Fig. 1.9(a)
plots the correlations of manual information of ’LIPREAD’ with the sentence
’LIPREAD CAN I’. Lower values of distance indicate matches, which in this case
occurs around frame 12. Similarly, Fig. 1.9(b) & Fig. 1.10(c) shows the correlation
of the manual information of the signs ‘CAN’ & ‘I’ with the same sentence. The
actual position of the signs can be seen in Fig. 1.10(d). The epenthesis movement31
indicated by E, can also be clearly seen in between the signs.
1.6. Experiments
In this section, we show results to demonstrate the efficacy of the proposed approach
using the data described earlier. Given that we have 5 instances of 25 distinct
sentences, we use 5-fold cross validation to evaluate the effectiveness of (a) manual
motion modeling, (b) the integration of facial expression with manual information,
and (c) the detection of negation in sentences. Four instances of each sentence are
used for training and one is used for testing. Note that some signs occur multiple
times in the training data in different arrangements with other signs. There are 65
sign instances, making up the 25 test sentences, to be recognized. Before we present
results, a few words about the performance measures are in order.
1.6.1. Quantifying Performance
Since we are considering the output of an intermediate level process that would
typically be further refined using grammar constraints, the performance measures
should reflect the tentative nature of the output. We sort the signs on the basis
of minimum distance of a sign to the sentence. Then we choose n signs with the
least distances. If a sign is part of a sentence, but not present in these n signs, then
a deletion error has occurred. The number of deletion errors depends on n; as n
increases, errors go down but the cost of high level processing increases since it has
to consider more possibilities. Since the maximum number of words in a sentence
is 5, we report results with n = 6, which is also 10% of the number of possible sign
instances in the test set. The correct detection rate or accuracy is 100 minus the
deletion rate. A sign that is not a part of sentence but has distance less than the last
correctly detected sign is declared to be wrongly inserted in the sentence. We have
an insertion error. Insertion errors can be reduced using the context knowledge or
by grouping the signs that are very similar. Here we have reduced the insertion
errors using facial expression information in the sentence. It is harder to recover
from deletion errors.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
12 S. Sarkar, B. Loeding, and A. Parashar
Error Manual Manual + Facial
Deletion [9% to 15%], 12% [4% to 11%], 8%Insertion [12% to 19%], 15% [7% to 15%], 11%
1.6.2. Use of Non-Manual Information
To study the effect of the use of manual information, we start with the top 8 signs,
as determined by manual information, and then prune out 2 signs (α = 2) based on
facial information, so as to finally arrive at 6 hypothesized signs per sentence. We
compare the final insertion and deletion error rates with hypothesizing 6 possible
signs per sentence based on just manual information.
The deletion error rate for the top n = 6 matches based on just manual infor-
mation as captured by the SoPF traces, with 5-fold validation, ranges from 9% to
15% with an average of about 12%. Thus, the average correct detection rate from
just manual information is 88%. Table 1.1 shows the improvement in rates to about
92% correct detection when face information is added. We also see a corresponding
reduction in the insertion error rates.
The percentage of sentences that are perfectly recognized, i.e. all the correct
signs are among the topmost ranks (zero deletion and insertion errors), is around
46%. This is a sentence level performance measure, i.e. the percentage of sentences
that are perfectly recognized. This is a very strict measure. A sentence can be
misunderstood even if there is only one sign not correctly recognized. We contend
that it is important to also report this number, even if it is low. Note that with the
use of grammatical constraints, this performance can be further improved.
1.6.3. Head Motion to Find ‘Negation’
‘Negation’ in a sentence is indicated by the ‘head shake’. We use aspect ratio
of the aggregate 2D track of the whole sentence, i.e the ratio of width to height
of the entire trajectory, as the features to recognize the presence of ‘Negation’ in
the sentence. We consider sentences whose motion trajectories have aspect ratio
greater than 1.25, width greater than 40 pixels, and height less than 50 pixels,
to have ‘Negation’ in them. Using this logic, out of 30 sentences in the database
that have ‘Negation’ in them 27 were correctly recognized while there were 18 false
alarms out of the remaining 95 sentences. The false alarms were mainly because
of sentences like ‘GATE WHERE’, which also have motion of face in horizontal
direction. Also, the missed detections are in sentences that specifically have sign
the ‘ME’ in them (example: ‘You don’t understand me’ ). This is because in sign
‘ME’, there is a natural downward movement of face, and hence the height of motion
trajectory increases to a good extent, causing aspect ratio to decrease. In future,
this performance can be increased by looking for portions in the sentences where
the negation occurs and not using the entire trajectory over the whole sentence.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 13
1.7. Conclusion
We presented a framework for continuous sign language recognition that combined
non-manual information from faces with manual information to decrease both the
deletion and insertion errors. Unlike most previous approaches that are top-down
HMM based, ours is a bottom-up approach that relies on simple low-level processes
feeding into intermediate level processes that hypothesize signs that are present
in a sentence. The approach also does not bypass the segmentation problem, but
relies on simple, yet robust, low-level representations. The manual dynamics were
modeled by capturing the statistics of the relationships among the low-level fea-
tures via the concepts of relational distributions embedded in Space of Probability
Functions. Facial dynamics were captured using an expression sub-space, computed
using PCA.
Even with fairly simple vision processes, embedded in a bottom-up approach, we
were able to achieve good performance from pure image based inputs. Using 5-fold
cross-validation on a data set of 125 sentences over 325 sign instances, we showed
that the accuracy of individual sign recognition was about 88% with just manual
information. The use of non-manual information increased the accuracy from 88%
to 92%. We were also able to correctly detect negations in sentences 90% of the
time.
1.8. Acknowledgment
This work was supported in part by the National Science Foundation under grant
IIS 0312993. Any opinions, findings and conclusions or recommendations expressed
in this material are those of the author(s) and do not necessarily reflect those of the
National Science Foundation.
References
1. B. Loeding, S. Sarkar, A. Parashar, and A. Karshmer. Progress in automated computerrecognition of sign language. In Lecture Notes in Computer Science, vol. 3118, pp.1079–1087, (2004).
2. C. Sylvie and S. Ranganath, Automatic sign language analysis: A survey and thefuture beyond lexical meaning, IEEE Transactions on Pattern Analysis and MachineIntelligence. 27(6), 873–891 (Jun, 2005).
3. Y. Cui and J. Weng, Appearance-based hand sign recognition from intensity imagesequences, Computer Vision and Image Understanding. 78(2), 157–176 (May, 2000).
4. M. Zhao and F. K. H. Quek, RIEVL: Recursive induction learning in hand gesturerecognition, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20
(11), 1174–1185, (1998).5. J. Triesch and C. von der Malsburg. Robust classification of hand postures against
complex backgrounds. In International Conference on Automatic Face and GestureRecognition, pp. 170–175, (1996).
6. M. H. Yang, N. Ahuja, and M. Tabb, Extraction of 2d motion trajectories and its
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
14 S. Sarkar, B. Loeding, and A. Parashar
application to hand gesture recognition, IEEE Transactions on Pattern Analysis andMachine Intelligence. 24, 168–185 (Aug., 2002).
7. T. Starner and A. Pentland. Real-time American Sign Language recognition fromvideo using hidden Markov models. In Symposium on Computer Vision, pp. 265–270,(1995).
8. C. Vogler and D. Metaxas. ASL recognition based on a coupling between HMMs and3d motion analysis. In International Conference on Computer Vision, pp. 363–369,(1998).
9. C. Vogler and D. Metaxas. Parallel hidden Markov models for American Sign Languagerecognition. In International Conference on Computer Vision, pp. 116–122, (1999).
10. C. Vogler and D. Metaxas, A framework of recognizing the simultaneous aspects ofAmerican Sign Language, Computer Vision and Image Understanding. 81, 358–384,(2001).
11. J. Ma, W. Gao, C. Wang, and J. Wu. A continuous Chinese sign language recognitionsystem. In International Conference on Automatic Face and Gesture Recognition, pp.428–433, (2000).
12. C. Wang, W. Gao, and S. Shan. An approach based on phonemes to large vocabularyChinese sign language recognition. In International Conference on Automatic Faceand Gesture Recognition, pp. 393–398, (2002).
13. B. Bahan and C. Neidle. Non-manual realization of agreement in American Sign Lan-guage. Master’s thesis, Boston University, (1996).
14. R. Wilbur and A. Martinez, Physical correlates of prosodic structure in AmericanSign Language, Meeting of the Chicago Linguistics Society, April. pp. 25–27, (2002).
15. U. M. Erdem and S. Sclaroff. Automatic detection of relevant head gestures in Amer-ican Sign Language communication. In International Conference on Pattern Recogni-tion, pp. 460–463, (2002).
16. C. Vogler and S. Goldenstein, Facial movement analysis in ASL, Universal Access inthe Information Society. 6(4), 363–374, (2008).
17. U. Canzler and T. Dziurzyk, Extraction of Non Manual Features for Videobased SignLanguage Recognition, IAPR Workshop on Machine Vision Application (MVA2002).pp. 11–13, (2002).
18. M. L. Cascia, S. Sclaroff, and V. Athitsos, Fast, reliable head tracking under varyingillumination, IEEE Transactions on Pattern Analysis and Machine Intelligence. 21
(6) (June, 1999).19. A. Kapoor and R. W. Picard. A real-time head nod and shake detector. In Workshop
on Perspective User Interfaces (Nov., 2001).20. R. Yang, S. Sarkar, and B. Loeding. Enhanced level building algorithm for the move-
ment epenthesis problem in sign language recognition. In Computer Vision and Pat-tern Recognition, (2007).
21. R. Yang and S. Sarkar, Coupled grouping and matching for sign and gesture recogni-tion, Computer Vision and Image Understanding. (2008).
22. J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff. Simultaneous localization and recogni-tion of dynamic hand gestures. In IEEE Workshop on Motion and Video Computing,vol. 2, pp. 254–260, (2005).
23. L. Ding and A. Martinez, Modelling and recognition of the linguistic components inAmerican Sign Language, Image and Vision Computing. (2009).
24. I. Robledo and S. Sarkar, Representation of the evolution of feature relationship statis-tics: Human gait-based recognition, IEEE Transactions on Pattern Analysis and Ma-chine Intelligence (Feb. 2003).
25. W. Freeman and M. Roth. Orientation histograms for hand and gesture recognition.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 15
In International Workshop on Face and Gesture Recognition, pp. 296–301, (1995).26. A. Huet and E. Hancock, Line pattern retrieval using relational histograms, IEEE
Transactions on Pattern Analysis and Machine Intelligence. 12(13), 1363–1370,(1999).
27. H. A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEETransactions on Pattern Analysis and Machine Intelligence. 20, 23–38, (1998).
28. A. Colemnarez and T. Huang, Face detection with information based maximum dis-crimination, Computer Vision and Pattern Recognition. pp. 782–787, (1997).
29. K. K. Sung and T. Poggio, Example-based learning for view-based human face de-tection, IEEE Transactions on Pattern Analysis and Machine Intelligence. 20, 39–51,(1998).
30. L. G. Frakas and I. R. Munro, Anthropometric Facial Proportions in Medicine.(Charles C. Thomas, Springfield, IL, 1987).
31. S. K. Liddell and R. E. Johnson, American Sign Language: The phonological base,Sign Language Studies. 64, 195–277, (1989).
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
16 S. Sarkar, B. Loeding, and A. Parashar
Fig. 1.6. Dominant dimensions of the learned facial expressions over the 39 signs.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 17
(a) ‘DONT-KNOW I’ (b) ‘I NOT HAVE KEY’
(c) ‘NO’ (d) ‘YES’
(e) ‘YOU UNDERSTAND (f) ‘SUITCASE I
ME’ PACK FINISH’
Fig. 1.7. Head motion trajectories. (a),(b) and (c) show the motion trajectories for sentences with
negation. (d) shows the motion trajectory for the sign ‘YES’. (e) and (f) show motion trajectoriesfor sentences that do not convey negative meaning.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
18 S. Sarkar, B. Loeding, and A. Parashar
Fig. 1.8. A bottom-up architecture for fusing information from facial expressions and head motionwith manual information to prune the set of possible ASL sign hypotheses.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
American Sign Language Recognition 19
1 7 10 1820 30 40 50 60 70 790
50
100
150
200
250
300
350Distribution of sign ‘LIPREAD’ in sentence ‘LIPREAD CAN I’
Frames of sentence ‘LIPREAD CAN I’
Dis
tan
ceLIPREAD
(a)
1 10 20 25 30 35 40 50 60 70 790
100
200
300
400
500
600Distribution of sign ‘CAN’ in sentence ‘LIPREAD CAN I’
Frames of sentence ‘LIPREAD CAN I’
Dis
tan
ce
CAN
(b)
Fig. 1.9. Cross-correlation of sign with sentences. (a),(b) shows the correlation of signs‘LIPREAD’ and ‘CAN’ with the sentence ‘LIPREAD CAN I’. Lower values indicates closermatches.
May 17, 2009 15:12 World Scientific Review Volume - 9.75in x 6.5in ASL-BookChapter
20 S. Sarkar, B. Loeding, and A. Parashar
1 10 20 30 40 45 4950 60 70 790
50
100
150
200
250
300
350
400
450Distribution of sign ‘I’ in sentence ‘LIPREAD CAN I’
Frames of sentence ‘LIPREAD CAN I’
Dis
tan
ceI
(c)
1 7 18 25 35 45 49 79
Position of signs in the sentence ‘LIPREAD CAN I’E is epenthesis movements
Frames in sentence ‘LIPREAD CAN I’
LIPREAD CAN I
E E
(d)
Fig. 1.10. Cross-correlation of sign with sentences (contd.). (c) shows the correlation of signs‘I’ with the sentence ‘LIPREAD CAN I’. Lower values indicates closer matches. (d) shows theground-truth position of the signs in the sentence. E indicates the epenthesis movements presentbetween two signs.