3
Twitter A11y: A Browser Extension to Describe Images Christina Low 1 *, Emma McCamey 2 *, Cole Gleason 3 , Patrick Carrington 3 , Jeffrey P. Bigham 3 , Amy Pavel 3 1 Stony Brook University, Stony Brook, USA, [email protected] 2 Virginia Commonwealth University, Richmond, USA, [email protected] 3 Carnegie Mellon University, Pittsburgh, USA, {cgleason, pcarring, jbigham, apavel}@cs.cmu.edu ABSTRACT Twitter is integral to many people’s lives for news, entertain- ment, and communication. While people increasingly post images to Twitter, a large majority of images remain inacces- sible to people with vision impairments due to a lack of image descriptions (i.e. alternative text). We present Twitter A11y (pronounced ally), a browser extension to make images acces- sible through a set of strategies tailored to the platform. For example, screenshots of text that exceed the Twitter character limit are common, so we detect textual images, and automati- cally add alternative text using optical character recognition. Tweet images apart from screenshots and link previews receive descriptions from crowd workers. Based on an evaluation of the timelines of 50 self-identified blind Twitter users, Twit- ter A11y increases automatic alt text coverage from 2.6% to 25.6%, before crowdsourcing the remaining images. ACM Classification Keywords H.5.m Information interfaces and presentation (e.g., HCI): Miscellaneous Author Keywords Screen reader; Twitter; social media; accessibility. INTRODUCTION AND RELATED WORK Twitter has been a popular platform for people with vision impairments because screen reader software supported its orig- inal simple text interface [12]. However, people increasingly post images to social media platforms [8], and an overwhelm- ing majority of these images remain inaccessible. For instance, only 0.1% of images on Twitter include alternative text, despite Twitter providing the option to add alternative text [5]. As the lack of alternative text is a perennial problem on the web, prior work obtains alternative text for web images by creating new alt text using computer vision and crowdsourcing, or by retrieving existing alt text posted elsewhere. While computer vision approaches such as Microsoft Seeing AI [10] * First and second authors contributed equally. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ASSETS ’19, October 28–30, 2019, Pittsburgh, PA, USA Copyright held by the owner/author(s). ACM 978-1-4503-6676-2/19/10 DOI: http://dx.doi.org/10.1145/3308561.3354629 Figure 1. Twitter A11y describes images posted to Twitter depending on the image type: (A) user-posted images, (B) external link previews, and (C) text-based screenshots. It descriptions for text-based screenshots (C) via optical character recognition and for external link previews (B) by crawling the linked webpage for corresponding alt text. Remaining im- ages without alt text (A) receive descriptions via crowdsourcing. and Facebook’s automated alt text [13] efficiently create a large quantity of images descriptions, the generated alt text remains simplistic and less trustworthy than human descriptions [9]. On the other hand, prior crowdsourcing approaches [4, 12] generate high-quality alt text for images but are expensive at social media scale. Caption Crawler instead retrieves existing alt text by searching for prior instances of the same image posted elsewhere with alt text [8]. While Caption Crawler succeeds on traditional sites, this method often fails on social media sites where users post original images and recent images not yet indexed by search engines. Similarly, WebInSight [1] retrieves alt text for traditional webpage images through a variety of techniques, but the primary automatic technique relies on closely related HTML tags (e.g., captions, titles of linked pages) that typically do not describe Twitter images. In order to provide high-quality descriptions for images on Twitter, we present Twitter A11y, an end-to-end system that displays, generates and retrieves alt text for images. Twitter A11y automatically provides alt text for frequent classes of Twitter images (e.g., text-based screenshots and external link previews), and generates descriptions for remaining images using crowdsourcing. Twitter A11y’s browser extension then loads the alt text for images as a user scrolls through Twitter. We performed a preliminary evaluation on 50 blind Twitter user’s timelines. Of 5,000 images collected, 1,119 contained media (images, external link previews). By scraping websites through external links and using optical character recognition, Twitter A11y increased the automatic alt text coverage from 2.6% to 25.6%. We evaluated crowdsourcing as an additional method, finding that 69% of the resulting descriptions were high-quality (as defined in [5]).

Twitter A11y: A Browser Extension to Describe Images · Twitter images (e.g., text-based screenshots and external link previews), and generates descriptions for remaining images using

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Twitter A11y: A Browser Extension to Describe Images

    Christina Low1*, Emma McCamey2*, Cole Gleason3, Patrick Carrington3, Jeffrey P. Bigham3, Amy Pavel3 1 Stony Brook University, Stony Brook, USA, [email protected]

    2Virginia Commonwealth University, Richmond, USA, [email protected] 3Carnegie Mellon University, Pittsburgh, USA, {cgleason, pcarring, jbigham, apavel}@cs.cmu.edu

    ABSTRACT Twitter is integral to many people’s lives for news, entertain-ment, and communication. While people increasingly post images to Twitter, a large majority of images remain inacces-sible to people with vision impairments due to a lack of image descriptions (i.e. alternative text). We present Twitter A11y (pronounced ally), a browser extension to make images acces-sible through a set of strategies tailored to the platform. For example, screenshots of text that exceed the Twitter character limit are common, so we detect textual images, and automati-cally add alternative text using optical character recognition. Tweet images apart from screenshots and link previews receive descriptions from crowd workers. Based on an evaluation of the timelines of 50 self-identified blind Twitter users, Twit-ter A11y increases automatic alt text coverage from 2.6% to 25.6%, before crowdsourcing the remaining images.

    ACM Classification Keywords H.5.m Information interfaces and presentation (e.g., HCI): Miscellaneous

    Author Keywords Screen reader; Twitter; social media; accessibility.

    INTRODUCTION AND RELATED WORK Twitter has been a popular platform for people with vision impairments because screen reader software supported its orig-inal simple text interface [12]. However, people increasingly post images to social media platforms [8], and an overwhelm-ing majority of these images remain inaccessible. For instance, only 0.1% of images on Twitter include alternative text, despite Twitter providing the option to add alternative text [5].

    As the lack of alternative text is a perennial problem on the web, prior work obtains alternative text for web images by creating new alt text using computer vision and crowdsourcing, or by retrieving existing alt text posted elsewhere. While computer vision approaches such as Microsoft Seeing AI [10] * First and second authors contributed equally. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).ASSETS ’19, October 28–30, 2019, Pittsburgh, PA, USA Copyright held by the owner/author(s). ACM 978-1-4503-6676-2/19/10 DOI: http://dx.doi.org/10.1145/3308561.3354629

    ����

    Figure 1. Twitter A11y describes images posted to Twitter depending on the image type: (A) user-posted images, (B) external link previews, and (C) text-based screenshots. It descriptions for text-based screenshots (C) via optical character recognition and for external link previews (B) by crawling the linked webpage for corresponding alt text. Remaining im-ages without alt text (A) receive descriptions via crowdsourcing.

    and Facebook’s automated alt text [13] efficiently create a large quantity of images descriptions, the generated alt text remains simplistic and less trustworthy than human descriptions [9]. On the other hand, prior crowdsourcing approaches [4, 12] generate high-quality alt text for images but are expensive at social media scale. Caption Crawler instead retrieves existing alt text by searching for prior instances of the same image posted elsewhere with alt text [8]. While Caption Crawler succeeds on traditional sites, this method often fails on social media sites where users post original images and recent images not yet indexed by search engines. Similarly, WebInSight [1] retrieves alt text for traditional webpage images through a variety of techniques, but the primary automatic technique relies on closely related HTML tags (e.g., captions, titles of linked pages) that typically do not describe Twitter images.

    In order to provide high-quality descriptions for images on Twitter, we present Twitter A11y, an end-to-end system that displays, generates and retrieves alt text for images. Twitter A11y automatically provides alt text for frequent classes of Twitter images (e.g., text-based screenshots and external link previews), and generates descriptions for remaining images using crowdsourcing. Twitter A11y’s browser extension then loads the alt text for images as a user scrolls through Twitter.

    We performed a preliminary evaluation on 50 blind Twitter user’s timelines. Of 5,000 images collected, 1,119 contained media (images, external link previews). By scraping websites through external links and using optical character recognition, Twitter A11y increased the automatic alt text coverage from 2.6% to 25.6%. We evaluated crowdsourcing as an additional method, finding that 69% of the resulting descriptions were high-quality (as defined in [5]).

    http://dx.doi.org/10.1145/3308561.3354629http:alternativetext).Wemailto:apavel}@cs.cmu.edumailto:[email protected]:[email protected]

  • Scroll to triggerimage processing

    Send request withimage URLs

    Select descriptionmethod

    Update pageHTML withnew alt text

    URL followingif URLpreview

    OCRif classi�edas text

    Crowdsourcingremainingimages

    Figure 2. Flowchart of Twitter A11y process. Rounded rectangles depict steps that happen in the browser extension, and rectangles with square corners depict steps on the server.

    SYSTEM Twitter A11y uses a Chrome extension (Figure 2, red) and a backend server (Figure 2, blue) to distinguish tweets based on three image classes: user-posted images, external link pre-views, and text-based screenshots, and chose a suitable method to generate alt text to insert back into the tweet HTML.

    Requesting alternative text As a user browses Twitter, the Chrome extension (Figure 2, red) parses through the user’s timeline tweets. For each tweet that contains at least one image, the extension extracts the image URL, any existing alt text, and context of the tweet (e.g., tweet ID, user ID, tweet text). If the tweet contains a clickable link to an external website, the extension also records the linked URL. With each new image URL extracted, the extension sends a request to the server that categorizes the image for alt text retrieval or generation.

    Obtaining alternative text The Flask [11] server receives tweet data with image URLs that need alt text added or revised and uses three methods as applicable to fetch the alt text descriptions.

    External link previews. Sites such as news organizations share external articles on Twitter that contain described im-ages, but do not include alt text for the preview images that appear in such tweets. To address external link preview images (Figure 1B), the system parses the page at the given external URL to extract all images. We compare the color histogram of the tweet preview image with the color histograms for all im-ages in the external page (~3-10 images total for news articles), and record any alt text for the closest matching image.

    Text-based screenshot. Many tweets feature screenshots of text (e.g., text that exceeds the Twitter character limit, text messages, screenshots of tweets). We determine if an image tweet depicts primarily text (Figure 1C) using whole-image labelling via the Google Cloud Vision API [7]. If the confi-dence of the “text” label exceeds 0.8, we record the resulting OCR text (also via Google Cloud Vision API) as the alt text. We selected the threshold of 0.8 empirically to achieve high-precision (lower recall) such that users are unlikely to receive inaccurate alt text.

    All other images. For remaining images not handled by the prior two methods, we post a crowd task ($0.35 per image or around $10 per hour) to obtain image descriptions on Amazon Mechanical Turk. The task asks workers to generate image alt text using the guidelines proposed by Salisbury et al. [12].

    Crowdsourcing carries the benefit that humans may be able to best describe characteristics such as humor that automatic methods miss.

    Displaying alternative text From the server, the extension receives either the correspond-ing existing alt text in the database, newly generated alt text, or the status of an uncompleted alt text description. The exten-sion dynamically inserts the description into the alt text tag for the image. A screen reader can then read the newly generated alt text for an image when it focuses on the tweet. While automatically-generated alt text appears soon after viewing, crowd-generated alt text takes longer (on the scale of minutes) such that the user could view the text upon re-visiting the tweet (e.g., by scrolling back in their timeline, or revisiting a user’s page where the tweet appeared). PRELIMINARY EVALUATION We evaluated Twitter A11y by recreating the timelines of self-identified blind users. We collected 100 tweets that would be included in the timelines of each of 50 users who stated they were blind in their profile description (5,000 total tweets). Of this sample, 161 (3%) contained links to external sites with image previews, and 958 (19%) tweets contained other images. We then used the described strategies and rated the quality of the resulting descriptions. Using Gleason et. al.’s method [5], we rated a description as high quality if they described some or most of the image, and we rated a description as low quality if it was irrelevant or only somewhat relevant to the image.

    Link following: Of the 161 external link previews, 18 (11%) already contained alt text. The link following strategy added an additional 64 descriptions (44%). Upon rating, only 15% of the alt text collected from the external site was high-quality. This method can be improved by adding a threshold to deter-mine if a match is correct, instead of the closest match.

    Optical character recognition: Of the 958 tweets containing at least one image, only 12 (0.01%) already contained alt text. The OCR strategy added alt text to 193 (20%), with 156 (81%) being high-quality that described most or all of the image.

    Crowdsourcing This left 753 images remaining without alt text. We selected a subset of 100 to crowdsource and rate for quality. We found that 69 were high-quality and 31 were low quality. Low-quality descriptions were typically subjective comments about the image that did not describe it, or unrelated paragraphs copied from reverse image search results.

    CONCLUSION We present Twitter A11y, a system to increase the low accessi-bility of images on Twitter by combining multiple strategies tailored to the style of images on the platform. While we se-lected our automatic methods to be high-precision, we aim to expand this work to achieve higher coverage of automated alt text by adding new strategies. For instance, memes can be de-scribed in richer detail using prepared templates and OCR [6]. While paid crowdsourcing is a common approach, we also want to evaluate social micro-volunteering [2] and friendsourc-ing [3] as alternate methods. In addition, we plan to deploy and iteratively refine Twitter A11y’s interface and crowdsourcing strategy to ensure real-time, high-quality descriptions.

    http:quality.Wehttp:users.Wehttp:tweets).Wehttp:images.We

  • REFERENCES 1. Jeffrey P. Bigham, Ryan S Kaminsky, Richard E Ladner,

    Oscar M Danielsson, and Gordon L Hempton. 2006. WebInSight:: making web images accessible. In Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility. ACM, 181–188.

    2. Erin Brady, Meredith Ringel Morris, and Jeffrey P. Bigham. 2015. Gauging Receptiveness to Social Microvolunteering. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, New York, NY, USA, 1055–1064. DOI: http://dx.doi.org/10.1145/2702123.2702329

    3. Erin L. Brady, Yu Zhong, Meredith Ringel Morris, and Jeffrey P. Bigham. 2013. Investigating the Appropriateness of Social Network Question Asking As a Resource for Blind Users. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (CSCW ’13). ACM, New York, NY, USA, 1225–1236. DOI:http://dx.doi.org/10.1145/2441776.2441915

    4. Daniel Dardailler. 1997. The ALT-server (“An eye for an alt”).

    5. Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M Kitani, and Jeffrey P. Bigham. 2019a. “It’s almost like they’re trying to hide it”:

    Have Failed to 1ide Web

    How User-Provided Image DescriptionsMake Twitter Accessible. In The World WConference. ACM, 549–559.

    6. Cole Gleason, Amy Pavel, Xingyu Liu, Patrick Carrington, Lydia B. Chilton, and Jeffrey P. Bigham. 2019b. Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’19). ACM, New York, NY, USA. DOI: http://dx.doi.org/10.1145/3308561.3353792

    7. Google. 2016. Cloud Vision API. https://cloud.google.com/vision/docs/. (2016). Accessed 2019-07-10.

    8. Darren Guinness, Edward Cutrell, and Meredith Ringel Morris. 2018. Caption crawler: Enabling reusable alternative text descriptions using reverse image search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 518.

    9. Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding Blind People’s Experiences with Computer-Generated Captions of Social Media Images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). ACM, New York, NY, USA, 5988–5999. DOI: http://dx.doi.org/10.1145/3025453.3025814

    10. Microsoft. 2017. Seeing AI | Talking camera app for those with a visual impairment. https://www.microsoft.com/en-us/seeing-ai/. (2017).

    11. Armin Ronacher. 2019. Flask web framework. http://flask.pocoo.org/. (2019).

    12. Elliot Salisbury, Ece Kamar, and Meredith Ringel Morris. 2017. Toward scalable social alt text: Conversational crowdsourcing as a tool for refining vision-to-language technology for the blind. In Fifth AAAI Conference on Human Computation and Crowdsourcing.

    3. Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW ’17). ACM, New York, NY, USA, 1180–1192. DOI: http://dx.doi.org/10.1145/2998181.2998364

    http://dx.doi.org/10.1145/2702123.2702329http://dx.doi.org/10.1145/2441776.2441915http://dx.doi.org/10.1145/3308561.3353792https://cloud.google.com/vision/docs/http://dx.doi.org/10.1145/3025453.3025814https://www.microsoft.com/en-us/seeing-ai/http://flask.pocoo.org/http://dx.doi.org/10.1145/2998181.2998364http:Accessible.In

    Introduction and Related WorkSystemRequesting alternative textObtaining alternative textDisplaying alternative text

    Preliminary evaluationConclusionReferences