Face Detection in Low-resolution Color Images

Face Detection in Low-resolution Color Images

Jun Zheng, Geovany A. Ramirez, and Olac Fuentes,Computer Science Department,University of Texas at El Paso,El Paso, Texas, 79968, U.S.A.

[email protected],[email protected],[email protected],

No Institute Given

Abstract. In low-resolution images, people at large distance appear very smalland we may be interested in detecting the subject’s face for recognition or analy-sis. However, recent face detection methods usually detect face images of about20×20 pixels or 24×24 pixels. Face detection in low-resolution images has notbeen explicitly studied. In this work, we studied the relationship between resolu-tion and the automatic face detection rate with Modified Census Transform andproposed a new 12- bit Modified Census Transform that works better than theoriginal one in low-resolution color images for object detection. The experimentsshow that our method can attain better results than other methods while detectingin low-resolution color images.

1 Introduction

Face detection is an important first step for applications in computer vision, includ-ing human-computer interaction, tracking, object recognition and scene reconstruction.Face detection is a difficult task, due to different factors such as varying size, orien-tation, poses, facial expression, occlusion and lighting conditions [1]. In recent years,numerous methods for detecting faces working efficiently under these various condi-tions have been proposed [1]. Those methods usually detect face images larger than20× 20 pixels or 24× 24 pixels. Face detection in low-resolution images has not beenexplicitly studied [2].

However, in surveillance systems, the regions of interests are often impoverished orblurred due to the large distance between the camera and the objects, or the low spatialresolution of devices. Figure 1 illustrates an image collected from a surveillance video.In this low-resolution image, people appear very small and we may be interested indetecting the subject’s face for recognition or analysis. But the resolution of the faceis about 8 × 8 pixels. Conventional face detection approaches barely work in this low-resolution images [2].

Torralba et al.[3] first studied psychologically the task of face detection in low-resolution images by humans. They investigated how face detection changes as a func-tion of available image resolution, whether the inclusion of local context, a local areasurrounding the face, improves face detection performance, and how contrast negation

Fig. 1. Faces in surveillance images

and image orientation changes affect face detection. Their results suggest that the inter-nal facial features become rather indistinct and lose their effectiveness as good predic-tors of whether a pattern is a face or not, so that using upper-body images is better thanusing only face images for human beings to recognize faces in low-resolution images.

Kruppa and Schile [4] used the knowledge of Torralba’s experiment and applied alocal context detector, which is trained with instances that contain an entire head of oneperson, neck and part of the upper body, for automatic face detection in low-resolutionimages. They applied wavelet decomposition to capture most parts of the upper body’scontours, as well as the collar of the shirt and the boundary between forehead and hair,while the facial parts such as eyes and mouth are hardly discernible in the wavelettransform of low-resolution images. In their experiments on two large data sets theyfound that using local context could significantly improve the detection rate, particularlyin low resolution images.

Shinji Hayashi and Osamu Hasegawa [2] proposed a new face detector, along witha conventional AdaBoost-based classifier, for low-resolution images that improved theface detection rate from 39% to 71% for 6 × 6 pixel faces of MIT+CMU frontal facetest set. This new detector comprises four techniques to detect faces from low-resolutionimages: using upper-body images, expansion of input image, frequency-band limitation,and combination of two detectors.

In this work, we observed the relationship between resolution and the automatic facedetection rate with Modified Census Transform and proposed a new modified censustransform feature that works better than the original one in low-resolution color imagesfor object detection. We present experimental results showing the application of ourmethod on Georgia Tech color frontal face database. The experiments show that ourmethod can attain better results than other methods while detecting in low-resolutioncolor images.

2 Related work

One of the milestone in face detection was the work of Rowley et al. who developed afrontal face detection system that scanned every possible region and scale of an imageusing a window of 20 × 20 pixels [5]. Each window is pre-processed to correct forvarying lighting, then, a retinally connected neural network is used to process the pixelintensity levels of each window to determine if it contains a face. In later works, they

provided invariance to rotation perpendicular to the image plane by means of anotherneural network that determined the rotation angle of a region, and then the region wasrotated by the negative angle and given to the original neural network for classification[6].

Convolution neural networks, which are highly modular multilayer feedforwardneural networks that are invariant to certain transformations, were originally proposedby [7] with the goal of performing handwritten character recognition. Years later theypropose a generic object recognition system [8] and they have also been shown to pro-vide good results in face recognition by [9].

Schneiderman and Kanade detected faces and cars from different view points usingspecialized detectors [10]. For faces, they used 3 specialized detectors for frontal, leftprofile, and right profile views. For cars, they used 8 specialized detectors. Each spe-cialized detector is based on histograms that represent the wavelet coefficients and theposition of the possible object, and then they used a statistical decision rule to eliminatefalse negatives.

Jesorsky et al. based their face detection system on edge images [11]. They used acoarse-to-fine detection using the Hausdorff distance between a hand-drawn model andthe edge image of a possible face. In [12], the face model used by Jesorsky et al. wasoptimized using genetic algorithms, increasing slightly the correct detection rate.

Viola et al. [13] used Haar features in their face detection systems. They first in-troduced the integral image to compute Haar features rapidly. They also proposed anefficient modified version of the Adaboost algorithm which selects a small number ofcritical visual features in face images of 24× 24 pixels, and introduced a cascade clas-sifiers which allows background regions of the image to be quickly discarded whilespending more computation on regions of interest. Based on their work, we use a varia-tion of the cascade of strong classifiers with modified census transform rather than haarfeatures.

Sung [14] first proposed a simple lighting model followed by histogram equaliza-tion. Using a database of face window patterns and non-face window patterns, theyconstruct a distribution-based model of face patterns in a masked 19× 19 dimensionalnormalized image vector space. For each new window pattern to be classified, theycompute a vector of distances from the new window pattern to the window pattern pro-totypes in the masked 19 × 19 pixel image feature space. Then based on the vector ofdistance measurements to the window pattern prototypes, they train a multi-layer per-ceptron (MLP) net to identify the new window as a face or non-face. Schneiderman[15] choose a functional form of the posterior probability function that captures thejoint statistics of local appearance and position on the object as well as the statistics oflocal appearance in the visual world at large. Viola [13] applied a simpler normalizationto zero mean and unit variance on the analysis window.

Wu et al. detected frontal and profile faces with arbitrary in-plane rotation and upto 90-degree out-of-plane rotation [16]. They used Haar features and a look-up tableto develop strong classifiers. To create a cascade of strong classifiers they used RealAdaBoost, an extension to the conventional AdaBoost. They built a specialized detectorfor each of 60 different face poses. To simplify the training process, they took advantageof the fact that Haar features can be efficiently rotated by 90 degrees or reversed, thus

they only needed to train 8 detectors, while the other 52 can be obtained by rotating orinverting the Haar features.

However, these methods are often computational expensive, so Froba and Ernst[17] used inherently illumination invariant local structure features for real-time facedetection. They used Modified Census Transform (MCT), which is a non-parametriclocal transform, for efficient computation. The modified census transform could captureall the 3× 3 local structure kernels, see Figure 2, while the original Census Transform(CT), first proposed by Zabih and Woodfill [18], would not capture the local imagestructure correctly in some cases. They also introduced an efficient four-stage cascade-style classifier for rapid detection, while Viola used some thirty stages in his paper.Using these local structure features and an efficient four-stage classifier, they obtainedresults that were comparable to the best systems presented to date.

Fig. 2. A randomly chosen subset of Local Structure Kernels

Also based on local structures, Dalal [19] introduced grids of locally normalizedHistograms of Oriented Gradients (HOG)as descriptors for object detection in staticimages. The HOG descriptors are computed over dense and overlapping grids of spa-tial blocks, with image gradient orientation features extracted at fixed resolution andgathered into a high dimensional feature vector. They are designed to be robust to smallchanges in image contour locations and directions, and significant changes in imageillumination and color, while remaining highly discriminative for overall visual form.

3 12-bit Modified Census Transform for face detction onlow-resolution color images

3.1 Features

The 9-bit modified census transform is defined based on gray images. It works wellwhen detecting faces of 24 × 24 pixels or above, but doesn’t work on low-resolutionface images such as 8× 8 pixels because of lack of inforamtion.

For color images, we add 3 additional bits to describe the information of RGB col-ors of each pixel, and propose a 12-bit modified census transform for face detection onlow-resolution color image. Let bR(x), bG(x), bB(x) be the 3 additional bits of modi-fied census transform of the pixel at position x, and µR(x), µG(x), µB(x) be the meanintensities of red layer, green layer and blue layer of N ′(x). Let Im(x) be the meanintensity of all layers of the pixel at position x, and Im(x) be the mean intensity ofall layers of N ′(x). The modified census transform in color space could be defined asfollows:

µ =µR(x) + µG(x) + µB(x)

3

bR(x) = δ(µR(x), µ)

bG(x) = δ(µG(x), µ)

bB(x) = δ(µB(x), µ)

Ω(x) = Ψ⊕

y∈N ′(x)

ξm(x, y)⊕

bR(x)⊕

bG(x)⊕

bB(x)

ξm(x, y) =

0 if Im(x) ≥ Im(y);1 if Im(x) < Im(y).

where⊕

denotes concatenation, δ(x, y) is a comparison function which takes 1if x < y and takes 0 otherwise, Ψ denotes a function converting a binary bit stringto a decimal number, and Ω(x) is the decimal number of the 12-bit modified censustransform bit string.

3.2 Training of classifiers

This section describes an algorithm for constructing a cascade of classifiers. We createa cascade of strong classifiers using a variation of AdaBoost algorithm used by Violaet al., where we only use three stages in low resolution face detection as displayed inFigure 3. Because of using modified census transform and less stages, our method is aspowerful as but more efficient than Viola’s.

Fig. 3. The cascade has three stages of increasing complexity. Each stage has the ability to rejectthe current analysis window as background or pass it on to the next stage.

Stages in the cascade are constructed by training classifiers using a version of boost-ing algorithms similar to AdaBoost. The pseudo-code of boosting is in Table 1. Boost-ing terminates when the minimum detection rate and the maximum false positive rateper stage in the cascade are attained. If the target false positive rate is achieved, the al-gorithm ends. Otherwise, all negative examples correctly classified are eliminated andthe training set is balanced adding negative examples using bootstrapping. The pseudo-code of bootstrapping is in Table 2. With the updated training set, all the weak classifiersare retrained.

Table.1 Pseudo-code of boosting algorithm

• Given training examples (Ω1, l1), . . . , (Ωn, ln) where li = 0, 1for negative and positive examples respectively.• Initialize weights ω1,i = 1

2m ,12l for li = 0, 1 respectively,

where m and l are the number of negativesand positives respectively• For k = 1, . . . ,K :

1. Normalize the weights,ωk,i =

ωk,i∑ni=1 ωk,j

2. Generate a weak classifier ck for a single feature k,with error εk =

∑i ωi|ck(Ωi)− li|.

3. Compute αk = 12 ln( 1−εk

εk) ,

4. Update the weights,

ωk+1,i = ωk,i ×e−αk ck(Ωi) = li1 otherwise

• The final strong classifier is:

C(Ω) =

1∑Kk=1 αkck(Ω(k)) > 1

2

∑Kk=1 αk

0 otherwisewhere K is the total number of features

A few weak classifiers are combined forming a final strong classifier. The weakclassifiers consist of histograms of gpk and gnk for the feature k. Each histogram holdsa weight for each feature. To build a weak classifier, we first count the kernel indexstatistics at each position. The resulting histograms determine whether a single featurebelongs to a face or non-face. The single feature weak classifier at position k withthe lowest boosting error et is chosen in every boosting loop.The maximal number offeatures on each stage is limited with regard to the resolution of analysis window. Thedefinition of histograms are as follows:

gpk(r) =∑i I(Ωi(k) = r)I(li = 1)

gnk (r) =∑i I(Ωi(k) = r)I(li = 0)

, k = 1, . . . ,K, r = 1, . . . , R

where I(.) is the indicator function that takes 1 if the argument is true and 0 other-wise. The weak classifier for feature k is:

ck(Ωi) =

1 gpk(Ωi(k)) > gnk (Ωi(k))0 otherwise

where Ωi(k) is the kth feature of the ith face image, ck is the weak classifier forthe kth feature. The final stage classifier C(Ω) is the sum of all weak classifiers for thechosen features.

C(Ω) =

1∑Kk=1 αkck(Ω(k)) > 1

2

∑Kk=1 αk

0 otherwise

where ck is the weak classifier, C(Ω) is the strong classifier, and αk = 12 ln( 1−εk

εk).

Table.2 Pseudo-code of bootstrapping algorithm

• Set the minimum true positive rate, Tmin, for each boosting iteration.• Set the maximum detection error on the negative dataset, Ineg , for each

boost-strap iteration.• P = set of positive training examples.• N = set of negative training examples.• K = the total number of features.• While Ierr > Ineg

- While Ttpr < Tmin For k = 1 to K

Use P and N to train a classifier for a single feature.Update the weights.

Test the classifier with 10-fold cross validation to determine Ttpr.- Evaluate the classifier on the negative set to determine Ierr and put any

false detections into the set N

4 Experimental results

The training data set consists of 6000 faces and 6000 randomly cropped non-faces.Then both the faces and non-faces are down-sampled to 24 × 24, 16 × 16, 8 × 8, and6× 6 pixels for training detectors for different resolutions.

To test the detector, we use the Georgia Tech face database, which contains imagesof 50 people. All people in the database are represented by 15 color JPEG images withcluttered background taken at resolution 640 × 480 pixels. The average size of thefaces in these images is 150 × 150 pixels. The pictures show frontal and tilted faceswith different facial expressions, lighting conditions and scale.

We use a cascade of three strong classifiers using a variation of AdaBoost algorithmused by Viola et al. The number of Boosting iterations is not certain. The boost processwill continue until the detection rate reaches the minimum detection rate. For detectingfaces of 24 × 24 pixels, the analysis window is of size 22 × 22. The maximal numberof features for each stage is 20, 300, and 484 respectively. For detecting faces of 16× 16 pixels, the analysis window is of size 14 × 14. The maximal number of featuresfor each stage is 20, 150, and 196 respectively. For detecting faces of 8 × 8 pixels, theanalysis window is of size 6 × 6. The maximal number of features for each stage is 20,30, and 36 respectively. For detecting faces of 6 × 6 pixels, the analysis window is ofsize 4× 4. The maximal number of features for each stage is 5, 10, and 16 respectively.

The experimental results are as follows. In Table 3 we present the results of using12-bit MCT in different resolutions. We can see from Table 3 that as the resolutionof faces in the testing images reduces the false alarms increase, and the detection ratereduces. In Table 4 we present the results of using 9-bit MCT in different resolutions.We can also see from Table 4 that as the resolution of faces in the testing images reducesthe false alarms increase, and the detection rate reduces. From Table 3 and Table 4 wecan conclude that using 12-bit MCT in color images could get much better detectionrate and fewer false alarms in low resolution images than using 9-bit MCT.

In Figures 4 through 7, we show some representative results from the Georgia Techface database. Figure 4 presents the detection results on 24 × 24 pixel face images.Figure 6 presents the detection results on 16× 16 pixel face images. Figure 6 presentsthe detection results on 8× 8 pixel face images. Figure 7 presents the detection resultson 6× 6 pixel face images.

Table.3 Face detection using 12-bit modified census transform

Georgia tech face databaseModified census transform 12-bit MCT

Detection rate False alarms

Boosting

24× 24 99.5% 116× 16 97.2% 108× 8 95.0% 1366× 6 80.0% 149

Table.4 Face detection using 9-bit modified census transform

Georgia tech face databaseModified census transform 9-bit MCT

Detection rate False alarms

Boosting

24× 24 99.2% 29616× 16 98.4% 6538× 8 93.5% 6976× 6 68.8% 474

Fig. 4. Sample detection results on 24× 24 pixel face images

5 Conclusion

In this paper, we presented 12-bit MCT that works better than the original 9-bit MCTin low-resolution color images for object detection. According to the experiments, our



method can attain better results than the conventional 9-bit MCT while detecting inlow-resolution color images.

For future work, we will be extend our system in other object detection problemssuch as car detection, road detection, and hand gestures detection. In addition, we wouldlike to perform experiments using other boosting algorithms such as Float- Boost andReal AdaBoost to further improve the performance of our system in low-resolutionobject detection.

References

1. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE transac-tions on pattern analysis and machine intelligence 24 (2002) 34–58

2. Hayashi, S., Hasegawa, O.: Robust face detection for low-resolution images. Journal ofAdvanced Computational Intelligence and Intelligent Informatics 10 (2006) 93–101

3. Torralba, A., Sinha, P.: Detecting faces in impoverished images. Technical Report 028, MITAI Lab, Cambridge, MA (2001)


4. Kruppa, H., Schiele, B.: Using local context to improve face detection. In: Proceedings ofthe British Machine Vision Conference, Norwich, England (2003) 3–12

5. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 20 (1998) 23–38

6. Rowley, H.A., Baluja, S., Kanade, T.: Rotation invariant neural network-based face detection.In: Proceedings of 1998 IEEE Conference on Computer Vision and Pattern Recognition,Santa Barbara, CA (1998) 38–44

7. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learn-ing. In Forsyth, D., ed.: Shape, Contour and Grouping in Computer Vision. Springer (1989)

8. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R., Hubbard, W., Jackel, L.D.:Object recognition with gradient-based learning. In Forsyth, D., ed.: Shape, Contour andGrouping in Computer Vision, Springer (1999)

9. Lawrence, S., Giles, C.L., C.L., Tsoi, A.C., Back, A.D.: Face recognition: A convolutionalneural-network approach. IEEE Transactions on Neural Networks 8 (1997) 98–113

10. Schneiderman, H., Kanade, T.: A statistical model for 3-d object detection applied to facesand cars. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2000)

11. Jesorsky, O., Kirchberg, K., Frischholz, R.W.: Robust face detection using the hausdorffdistance. In: Third International Conference on Audio- and Video- based Biometric PersonAuthentication. Lecture Notes in Computer Science, Springer (2001) 90–95

12. Kirchberg, K.J., Jesorsky, O., Frischholz, R.W.: Genectic model optimization for hausdorffdistance-based face localization. In: International Workshop on Biometric Authentication,Springer (2002) 103–111

13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features.In: Proceedings of 2001 IEEE International Conference on Computer Vision and PatternRecognition. (2001) 511–518

14. Sung, K.K.: Learning and Example Seletion for Object and Pattern Detection. PhD thesis,Massachusetts Institute of Technology (1996)

15. Schneiderman, H., Kanade, T.: Probabilistic modeling of local appearence and spatial rela-tionship for object recognition. In: International Conference on Computer Vision and PatternRecognition, IEEE (1998)

16. Wu, B., Ai, H., Huang, C., Lao, S.: Fast rotation invariant multi-view face detection basedon real adaboost. In: Sixth IEEE International Conference on Automatic Face and GestureRecognition. (2004)

17. Froba, B., Ernst, A.: Face detection with the modified census transform. In: Sixth IEEEInternational Conference on Automatic Face and Gesture Recognition, Erlangen, Germany(2004) 91– 96

18. Zabih, R., Woodfill, J.: A non-parametric approach to visual correspondence. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (1996)

19. Dalal, N.: Finding People in Images and Videos. PhD thesis, Institut National Polytechniquede Grenoble (2006)

Documents

Face Detection in Low-resolution Color Images