Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced...

Preview:

Citation preview

Kimberly A. Jameson1, Sean Tauber1, Prutha S. Deshpande2, Stephanie M. Chang3, and Sergio Gago3

1Institute for Mathematical Behavioral Sciences, 2Cognitive Sciences, and 3Calit2

University of California, Irvine

INSTITUTE FORMATHEMATICAL BEHAVIORAL SCIENCESUC IRVINE

Crowdsourcing the transcription of archival data

UCI ColCat Project Collaborators:

Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.

Prutha DeshpandeSean Tauber

Stephanie ChangSergio Gago

Nathan BenjaminYang Jiao

Brian HuynhHan Ke

Ram BhaktaZhimin Xiang

Ian Harris

Prutha S. Deshpande CogSci

Sean TauberIMBS

Sergio GagoCalit2

Stephanie M. ChangCalit2

UCI ColCat Project Collaborators:

• Background on an important problem in Cognitive Science.

• The domain under consideration: Color categorization.

• Creating a new database using internet-based procedures.

• Features of the internet-based research problem and solution approaches that may generalize elsewhere.

• Modeling the problem and developing appropriate analyses.

• Preliminary results from empirical tests.

• Summary.

Talk Overview

Research on how concepts are represented across linguistic

groups

✶ Individual concept formation and the sharing and transmission of concepts within and across groups.

E.g., Kinship terminology …

Concept formation across language groups

E.g., Kinship terminology:

https://en.wikipedia.org/wiki/Kinship

Concept formation across language groups

https://en.wikipedia.org/wiki/Kinship

E.g., Kinship terminology:

In what ways are representations of concepts similar across individuals and language

groups?

and

What are the various ways concepts vary across individuals and language groups?

How do the world’s languages map the color appearances we all see in our environments?

Basic Color Terms (1969)

Paul KayBrent Berlin

Basic Color Terms being described as “the smallest set of simple words with which the speaker can name any color.”

Courtesy of Lindsey & Brown (2006). PNAS, 102.

Image Credit: Lindsey & Brown (2006). PNAS, 102.

Basic Color Terms (1969)

(2) Provided a sequence by which languages adopted subsets of the 11 basic color categories.

(1) Found all languages tested had systems including 11 or fewer basic color words (e.g., English): red, yellow, green, blue, orange, purple, pink, brown, grey, black and

white.

(Terms such as crimson, blonde and royal blue are not considered to be basic.)

IMBS workshop UC Irvine 12/04/2015

Color concept universals like this were made popular by Berlin & Kay, and by

several other investigators,

still, there are instances where different societies have evolved different conventions for color naming ...

Image Credit: Lindsey & Brown (2006). PNAS, 102.

Courtesy of Lindsey & Brown (2006). PNAS, 102.

Berinmo (5 words)

Image Credit: Kay & Regier (2007). Cognition, 102.

Different numbers of Color Terms:n=3

T. Regier et al, PNAS 104, 2007

n=4

n=3

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

n=4

n=3

n=5

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

n=4

n=3

n=5

n=6

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

The World Color Survey

✶ 110 languages; 25 speakers.

✶ Data collection ended in 1980.

✶ Digitalizing hand coded data took more than 23 years.

✶ A very valuable site of unembellished ascii data files:http://www.icsi.berkeley.edu/wcs/data.html

World Color Survey Data — Uses a Generic Format

The existing World Color Survey (WCS) database

✶ Beginning ~2003 the WCS database was made publicly available.

✶ Has been very widely cited in the last few years.

http://www.icsi.berkeley.edu/wcs/data.html

(2009)

E.g., Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

(WCS datafiles do not include headers)

Language Number

Speaker Number

Focus Number

Term Abbrv.

Coordinates of focus selection

Datafile Example “foci.txt”:Color chip selected as category best-exemplar

Focus selections in two languages:

Deshpande, P.S. (under review). Investigating Color Categorization Behaviors in Korean-English Bilinguals. UCI Undergraduate Research Journal (submitted June, 2015).

English

Korean

The WCS data is awesome, but …

a platform with a GUI for empirically investigating and analyzing such data would be even better,

and a site with rigorous on-board research tools would

also be a big plus.

We were given a chance to do this…

Jameson, K. A., Benjamin, N. A., Chang, S.M., Deshpande, P. S., Gago, S., Harris, I. G., Jiao, Y., and Tauber, S. (2015). Mesoamerican Color Survey Digital Archive. In Encyclopedia of Color Science and Technology, (Ronnier Luo, Ed.). Springer: Berlin / Heidelberg. ISBN: 978-3-642-27851-8 (Online). DOI 10.1007/978-3-642-27851-8.

Nathan

See Poster: An Affordance Based Approach to Large Data-Set Navigation.

The Robert E. MacLaury Archive

✶ ~23,000 pages of raw color categorization data that includes:

✶ 116 dialects from indigenous Mesoamerican societies (261 surveys), and

✶ ~130 additional surveys from a variety of languages (across Africa, Asia, the Americas and Europe).

R. E. MacLaury’s Dissertation:Color in MesoAmerica, Vol. I: A Theory of Composite Categorization. (1986)

Book:Color and Cognition in Mesoamerica: Constructing Categories as Vantages.(1997)

The mesoamerican portion of the REM archive:

37 within Oaxaca

30 within Guatemala

33 within Mexico City

Jameson et al. (2015). ECST.

Chinantec language diversity in the MCS

http://dc433.4shared.com/img/5va8nYgU/s7/0.6475236094740385/aaa.jpg?async

Chinantec language diversity in the MCS

Developing

Vigorous

Endangered

Jameson et al. (2015). ECST.

Features of our transcription problem that may be general:

✶ The data has a constrained structure and format.(unlike typical historical records transcription tasks)

✶ It’s a perceptual identification/reproduction problem:e.g., identify handwritten characters/symbols in a

standardized template or form and reproduce them via keyboard input.

✶ transcription of large blocks of data can be brokeninto small tasks and transcribed by OCR or

crowdsourcing methods.

YangSee Poster: Optical Character Recognition of Handwritten Tabular Data.

Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

Problem: Convert THIS into a data addressable file

American English Data

Problem: Convert THIS into a data addressable file

DATA

... continues up to 330 ...

Challenges of our transcription job: • Concepts. How they apply everywhere

• There’s a classic example — color.

• There’s an existing database.

• There’s a chance to do better.

• Crowdsourcing can help greatly

• Why OCR doesn't work. Handwriting that is not prose.

• The reason is its a perceptual problem.

• Crowdsourcing lets us break the problem into pieces and solve it piecewise.

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ While crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like “CCT”) may permit trading off “smarter” data for “bigger” data, giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Jameson & Romney (1990). Consensus on Semiotic Models of Alphabetic Systems.J. of Quant. Anthro.

Batchelder and Romney (1988)Test theory without an answer-

key. Psychometrika.

Cultural consensus analyses of a

cognitive-perceptual task

✶ For tasks evaluating new characters designed to extend the 26 letters of the English alphabet, consensus analyses objectively identified expert typeface designers with higher “competence” compared to college undergraduates.

Automating archive transcription: Task and Judgments

Stephanie

Design 1: OCR verification (pattern recognition) - 2-AFC yes/noDesign 2: OCR verification (training data) - free responseDesign 3: Crowdsource verification - 2-AFC “match/no-match” Design 4: Naming ranges 1 - free response + confidenceDesign 5: Naming ranges 2 - N-AFC + confidenceDesign 6: Focus transcription 1 - free response + confidenceDesign 7: Focus transcription 2 - free response

“free response” = a “reCAPTCHA” task.

Poster title: Designing Crowdsourcing Methods for the Transcription of Handwritten Documents.

*

E.g., internet-based transcription task:

http://colcat.calit2.uci.edu

Cultural Consensus Theory (CCT)to aggregate the data

Deshpande, Tauber., Chang, Gago & Jameson. (in preparation). Digitizing a large corpus of handwritten documents using crowdsourcing and cultural consensus theory.

Prutha

— Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription.

— Enrich the model underlying Dichtomous Bayesian form of CCT (Oravecz, et al. 2014) to handle N-alternative forced-choice data formats.

— As a result, employ smarter analyses of smaller samples, using CCT’s formal process model, that produce solutions as robust as those from large amounts of “averaged” data.

See Poster: A Cultural Consensus Theory Analysis of Crowdsourced Transcription Data.

Results:

Results: Task 4n=30

Results: Task 4

hi, hl

n=30

Results: Task 4n=30

Inferring the true transcription

• Mode?

• (Bayesian) Cultural Consensus Theory (CCT)(Oravecz, Vandekerckhove & Batchelder, 2014)(Batchelder & Romney, 1988)

Cultural Consensus Theory (CCT)

• “Test theory without an answer key” (Batchelder & Romney, 1988)

• Allows us to infer:

• shared latent cultural knowledge (true transcription)

• individual ability

• item difficulty

• response bias

Cultural Consensus Theory (CCT)

• Usually applied to dichotomous (true/false) data.

• Other formats have been explored with Bayesian framework but not multiple choice / free response (to our knowledge).

• Not typically applied to perceptual identification(although, see Jameson 1990)

Dichotomous CCT Multiple Choice CCT

Dichotomous CCT Multiple Choice CCTObserved Data

Dichotomous CCT Multiple Choice CCTObserved Data

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Dichotomous CCT Multiple Choice CCT

(subject-wise bias)

Observed Data

Latent Parameters

Examples of perceptually confusable stimuli

Response bias:Individuals or items?

subject-wise bias item-wise bias

Response bias:Individuals or items?

subject-wise bias item-wise bias

CCT Answer Key: Task 4

CCT Answer Key: Task 4

CCT Answer Key: Task 4

CCT Answer Key: Task 4

CCT Answer Key: Task 4

subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)

Answer 125 (Z125) Subject 0 bias (g0)

subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)

Answer 125 (Z125) Subject 0 bias (g0)

item-wise posteriorsAnswer 4 (Z4)

Item 4 bias (g4)Answer 16 (Z16)

Item 16 bias (g16)

Answer 125 (Z125)Item 125 bias (g125)

subject-wise model predictionstask 4

subject-wise model predictions

item-wise model predictions

task 4

subject-wise model predictionstask 7

subject-wise model predictions

item-wise model predictions

task 7

✶ CCT was designed to work on small (6-10) sized subject samples typical of anthropological studies.

Would the patterns of results reported for Task 4 be possible with a sample smaller than 30 participants?

Method Answer Key Estimate %-correct

Mean Competence

Mean Item Difficulty

Trial 1 - 8 participants 100% 0.929 0.466

Trial 2 - 8 participants 100% 0.937 0.460

Trial 3 - 8 participants 100% 0.914 0.459

Trial 4 - 8 participants 100% 0.942 0.464

Trial 5 - 8 participants 100% 0.935 0.464

30 Participants 100% 0.917 0.366

Can we use fewer informants ?…

Preliminary trends suggests 8 participants may be as informative as 30.

Discussion points• Two (or more) response-strategy “subcultures”?

• Confidence data can help CCT results

• Quantitative model evaluation

• Item + individual bias component?

• Automation and integration with other server-side processes (Python module vs. R, Matlab)

Results Summary: ✶ These preliminary results suggest two novel approaches, piece-wise crowdsourcing and CCT data handling, can be used to accurately transcribe a large corpus of ethnographic data.

✶ By using internet-based methods, it appears we can a avoid 20+ year manual transcription job and derive an accurate and unbiased database of great value to investigations of concept formation across language groups.

✶ The economical way in which we modeled this perceptually-based transcription problem seems likely to generalize to other internet-based tasks that require extraction and evaluation of targets embedded in distracting information, and our novel use of CCT analyses seem promising for intelligently aggregating smaller subsets of crowdsourced responses to address large data handling problems.

Thanks for Listening!!

Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.

Recommended