View
226
Download
1
Tags:
Embed Size (px)
Citation preview
Handwritten Word Recognition: A New CAPTCHA Challenge
Amalia Rusu and Venu GovindarajuCEDAR
University at Buffalo
CAPTCHA
Completely Automatic Public Turing test to tell Computers and Humans Apart
An automated test that humans can pass but current computer programs fail – beyond the state-of-the-art
Exploits the difference in abilities between humans and machines (i.e. text, speech or facial features recognition)
A new formulation of the Alan Turing’s test - “Can machines think?”
Objective
Example of interface and handwritten CAPTCHA to confirm registration.
Please enter the handwritten word as it is shown below:
If you cannot read this image click here
User Authentication Steps using HCAPTCHA
i. Initializationii. Handwritten CAPTCHA Challengeiii. User Responseiv. Verification
Automatic Authentication Session for Web Services.
Internet
User
Authentication Server
Challenge
Response
User authentication
The user initiate the dialog and has to be authenticated by server
Internet
User
Authentication Server
Challenge
Response
User authentication
The user initiate the dialog and has to be authenticated by server
Desirable Properties
CAPTCHA should be automatically generated and graded Test can be taken quickly and easily by human users Test will accept virtually all human users and reject software agents Test will resist automatic attack for many years despite the
technology advances and prior knowledge of algorithms
Previous Work
First CAPTCHA designed in 1997 (for AltaVista website URL filter) CMU
Gimpy, EZ-Gimpy, Gimpy-R, Bongo, Pix, Eco PARC
BaffleText UCB & PARC
PessimalPrint Microsoft
ARTiFACIAL Bell Labs
Reverse Turing test using speech GIT
Character morphing
CAPTCHA Tests
AltaVista URL filter uses isolated random characters and digits on a cluttered background.
PessimalPrint uses a degradation model simulating physical defectscaused by copying and scanning of printed text.
BaffleText uses pronounceable character strings that are not in the English dictionary and render the character string using a font into an image (without physics-based degradations); then generate a mask image as shown above.
CAPTCHA Tests
GimpyType 3 different English words appearing in the picture above.
EZ-Gimpy uses real English words.
Gimpy-R uses nonsense words.
Character morphing algorithm that transforms a string into its graphical form.
Why Handwritten CAPTCHA? No handwritten text based CAPTCHA exists - so far!!! Several machine printed text based CAPTCHA already broken
Greg Mori and Jitendra Malik of the UCB have written a program that can solve Ez-Gimpy with accuracy 83%
Thayananthan, Stenger, Torr, and Cipolla of the Cambridge vision group have written a program that can achieve 93% correct recognition rate against Ez-Gimpy
Gabriel Moy, Nathan Jones, Curt Harkless, and Randy Potter of Areté Associates have written a program that can achieve 78% accuracy against Gimpy-R
Machine recognition of handwriting is more difficult than printed text Handwriting recognition is a task that humans perform easily and reliably Research is in the early stages - a promising field Handwritten CAPTCHAs will challenge the KBCS community!
Speed and accuracy of a HR. Feature extraction time is excluded. Testing platform is an Ultra-SPARC.
Lexicon size
Lexicon Driven Grapheme Model
time (secs)
accuracy time (secs)
accuracy
Top 1 Top 2 Top 1 Top 2
10 0.027 96.53 98.73 0.021 96.56 98.77
100 0.044 89.22 94.13 0.031 89.12 94.06
1000 0.144 75.38 86.29 0.089 75.38 86.29
20000 1.827 58.14 66.56 0.994 58.14 66.49
State-of-the-art
Source of Errors for HW Recognizers
Image qualityBackground noise, printing surface, writing styles
Image featuresVariable stroke width, slope, rotations, stretching, compressing
Segmentation errors Over-segmentation, merging, fragmentation, ligatures, scrawls
Recognition errors Confusion with similar lexicon entries, large lexicons
Creating H-CAPTCHAS
Use handwritten word images that current recognizers cannot read Controlled “distortion” of existing handwritten word images Create handwritten images by concatenating handwritten character
images Use handwritten US city name images (4,000 from CEDAR CDROM) Character images were discretely printed to begin with Character images are automatically segmented out of handwritten word
images Use set of 20,000 handwritten character images (extracted by program)
Synthesize sentence images by gluing together isolated upper and lower case handwritten characters or word images
H-CAPTCHA Generation Algorithm
Input. Original (random) handwritten image (existing US city name image or
synthetic word image with length 5 to 8 characters or meaningful sentence).
Lexicon containing the image’s truth word.Output. H-CAPTCHA image.Method. Randomly choose a number of transformations Randomly establish the transformations corresponding to the given number
from: add lines, circles, grids, arcs, background noise (multiplicative or impulse), random convolution masks, blur, wave, spread, median filters, thick or thin characters on vertical or horizontal fashion, etc.
A priori order is assigned to each transformation based on experimental results. Sort the list of chosen transformations based on their priority order and apply them in sequence, so that the effect is cumulative.
Handwritten text images
Examples of handwritten characters used to generate random words.
Examples of handwritten US city name images used as a base for transformations.
Examples of synthetic handwritten sentence images.
H-CAPTCHA by Image Quality Transforms
Add lines, grids, arcs, background noise, convolution masks and special filters
H-CAPTCHA by Image Features Transforms
Variable stroke width, slope, rotations, stretching, compressing
H-CAPTCHA by Segmentation Transform
Delete ligatures, use touching letters/digits, merge characters for over segmentation or to be unable to segment
H-CAPTCHA by Lexicon Transform
Lexicon challenges: size, density, availability
Truth
WMR results
(Top choice first)
Accuscript results
(Top choice first)
Image
Orlando ovlando ovlavdo onlando orlanolo orlaudo oviando orlahdo arlando orlando ovlanao
ollando ovlando orlanolo orlando ovlanao ovlavdo onlando oviando orlanda arlando
Lackawanna lackaevana lackawawa lackawaua lackowana lackawana lackawanna lackawarna lackawanra lackamama lactawana
lackawarna lactawana lackawarra lackawawa lackawana lackawaua lackawanna lackowana locrawara lackawanra
Clarence clarlncl clarlnce clarencl cearence clarence cbarence clorence clahence aarence clawce
claience clarence clatence clarlnce cearence clavence clarenxe clasence clorence claiexce
Buffalo buffaio buffalo butfalo buifalo buffrio ruffalo bulfalo bufialo buefaio bullalo
ruffalo buffalo buffrlo buffaio buffrio bulfalo buifalo butfalo buefalo bufialo
H-CAPTCHA Evaluation
No risk of image repetition Image generation completely automated: words, images and distortions
chosen at random
The transformed images cannot be easily normalized or rendered noise free by present computer programs, although original images must be public knowledge
Deformed images do not pose problems to humans Human subjects succeeded on our test images
Test against state-of-the-art: WMR, Accuscript CAPTCHAs unbroken by CEDAR recognizers
H-CAPTCHAs
Handwritten US city name images that defeat both WMR and Accuscript recognizers.
H-CAPTCHA Challenge
Low accuracy of handwriting recognizers vs. humans on a subset of test images.
Word RecognizersNumber of
Recognized ImagesAccuracy
WMR 383 9.28%
Accuscript 182 4.41%
Number of
Students
Number of
Test Images
Humans
Accuracy
WMR
Accuracy
Accuscript
Accuracy
12 15 82% 0% 0%
Low accuracy of handwriting recognizers. The lexicons are created so as to contain all the truths of test images. Total number of tested images is 4,127 (and so is the lexicon size)
CAPTCHA using Gestalt Psychology
Gestalt psychology is based on the observation that we often experience things that are not a part of our simple sensations
What we are seeing is an effect of the whole event, not contained in the sum of the parts (holistic approach)
Organizing principles - Gestalt laws: law of closure law of similarity law of proximity law of symmetry law of continuity law of familiarity figure and ground
Not restricted to perception memory
OXXXXXX XOXXXXX XXOXXXX XXXOXXX XXXXOXX XXXXXOXXXXXXXO
**********
**********
**********
[ ] [ ] [ ]
H-CAPTCHA based on Gestalt Laws
Gestalt laws: law of proximity, symmetry, familiarity, continuity
Methods: create horizontal or vertical overlaps - for same words smaller distance overlaps - for different words bigger distance overlaps
H-CAPTCHA based on Gestalt Laws
Gestalt laws: law of closure, proximity, continuity
Methods: create occlusions by circles, rectangles, lines with random angles
H-CAPTCHA based on gestalt laws
Gestalt laws: law of closure, proximity, continuity
Methods: add occlusions by waves from left to right on entire image, with various amplitudes / wavelength or rotate them by an angle
H-CAPTCHA based on Gestalt Laws
Gestalt laws: law of closure, proximity, continuity, background
Methods: use empty letters, broken letters, edgy contour, fragmentation
H-CAPTCHA based on Gestalt Laws
flip-flop –OK for humans!!
vertical mirror – difficult for humans
horizontal mirror – difficult for humans
Gestalt laws: memory, internal metrics, familiarity of letters
Methods: change word orientation entirely, or the orientation for few letters only
Gestalt H-CAPTCHA Results
Word Recognizers
Horizontal Overlap (Small)
Horizontal Overlap (Large)
Vertical Overlap
Occlusion by waves
Occlusion by circles
Empty Letters
Less Fragment-
ation
More Fragment-
ation
Old Transforms
WMR 24.35% 12.93% 27.88% 15.43% 35.93% 0.89% 0% 0.48% 9.28%
Accuscript 2.93% 2.42% 12.64% 10.56% 32.34% 0.06% 0.18% 0% 4.41%
Tested images is 4,127 for each type of transformation.
Future Work
Creates transformed alias e-mail addresses to prevent mining by software agents
Personalizing Email Addresses
Transformed EmailAddress
Original EmailAddress
Apply ImageTransformation
Future Work
Few methods to differentiate between adult vs. child
o Asking a question that has the answer in the handwritten sentence
o Giving an incomplete handwritten sentence and asking to imply the missing word
o Comparing the handwritten text with a standard word list
o Using longer, more complicated handwritten sentences, using advanced topics from technical fields such as math, physics, or financial
Useful on Internet services due to expansion of harmful minor websites
Adult vs. Child vs. Machine
Reading abilities delimitation:Machine vs. 1st grade childAdult vs. 7th grade child
Future Work HCAPTCHA based on Handwritten Sentence Reading and Understanding Incorporate and adjust the image complexity factor as a parameter of error Try out more image transformations and compare results against humans
performance Cognitive aspects of HCAPTCHA for adult vs. child protocol HCAPTCHA as a Challenge Response Protocol for Security Systems Online-Handwriting CAPTCHA HCAPTCHA as a Biometric? HCAPTCHA normalization concerns based on future technology development
Thank You