Accurate Crowd-Labeling using Item Response Theory Poster

Poster Print Size: This poster template is 36” high by 48” wide. It can be used to print any poster with a 3:4 aspect raAo.

Placeholders: The various elements included in this poster are ones we oCen see in medical, research, and scienAfic posters. Feel free to edit, move, add, and delete items, or change the layout to suit your needs. Always check with your conference organizer for specific requirements.

Image Quality: You can place digital photos or logo art in your poster file by selecAng the Insert, Picture command, or by using standard copy & paste. For best results, all graphic elements should be at least 150-‐200 pixels per inch in their final printed size. For instance, a 1600 x 1200 pixel photo will usually look fine up to 8“-‐10” wide on your printed poster.

To preview the print quality of images, select a magnificaAon of 100% when previewing your poster. This will give you a good idea of what it will look like in print. If you are laying out a large poster and using half-‐scale dimensions, be sure to preview your graphics at 200% to see them at their final printed size. Please note that graphics from websites (such as the logo on your hospital's or university's home page) will only be 72dpi and not suitable for prinAng.

[This sidebar area does not print.]

Change Color Theme: This template is designed to use the built-‐in color themes in the newer versions of PowerPoint. To change the color theme, select the Design tab, then select the Colors drop-‐down list.

The default color theme for this template is “Office”, so you can always return to that aCer trying some of the alternaAves.

PrinAng Your Poster: Once your poster file is ready, visit www.genigraphics.com to order a high-‐quality, affordable poster print. Every order receives a free design review and we can deliver as fast as next business day within the US and Canada.

Genigraphics® has been producing output from PowerPoint® longer than anyone in the industry; daAng back to when we helped MicrosoC® design the PowerPoint® soCware. US and Canada: 1-‐800-‐790-‐4001 Email: [email protected]

[This sidebar area does not print.]

Results

•  Framework for crowd-labeling with detailed parameters. •  Results show better and stable performance when compared

to state-of-the-art. •  Plan to make our approach more fine-grained by adding

variability in labeler ability [1,2,13].

Conclusion

Simulated data: •  Generated using fixed values of

all the parameters except the labeler log-odds π (Table I).

•  5,000 instances, four crowd labels per instance, 20 expert-labeled instances.

•  Accuracy (Table II.)

RTE dataset: •  Five crowd labels per instance, 20 expert-labeled instances. •  Labeler error rate (Table III) •  Accuracy (Table IV.)

Accurate Crowd-Labeling using Item Response Theory Faiza Khan Khattak1 Ansaf Salleb-Aouissi1 Anita Raja2

1Columbia University, New York 2The Cooper Union, New York. This work is partially funded by the National Science Foundation award number IIS-1454855 and IIS-1454814.

Faiza Khan Khahak. [email protected] Ansaf Salleb-‐Aouissi. [email protected] Anita Raja. [email protected]

Contact References

Problem •  A Bayesian approach to crowd-labeling inspired by IRT. •  New parameters and refined the usual IRT parameters to fit

the crowd-labeling scenario. •  Crowd Labeling Using Bayesian Statistics (CLUBS)

•  Parameters estimation:

§  Since true labels are unknown, expert-labeled instances (ground truth) used for a small percentage of data (usually 0.1% -10%) for parameter estimation.

§  Estimated parameters used for aggregation of multiple crowd- labels for the rest of the dataset with no ground truth available.

§  Difficulty (βi) and discrimination level (δi) of the instances without expert-labels are calculated as follows:

•  Label aggregation: Final label =

•  Crowd-labeling: Non-expert humans labeling a large dataset – tedious, time-consuming and expensive if accomplished by experts alone. •  Crowd-labelers are non-experts è multiple labels per

instance for quality assurance è labels combined to get one final label.

•  Many research papers [4,5,8] but still unresolved challenges. •  Challenges:

§  Getting accurate final label when crowd is of heterogeneous quality.

§  Identifying best ways to evaluate labelers. §  Choosing per-class labeler ability or over all labeler

ability. §  Can quantifying prevalence of the class and/or clarity

of a labeling task/question [9] improve accuracy?

Our Hypothesis •  Crowd-‐labeling ≈ Test taking. •  Item Response Theory (IRT) [12] used to model student ability (e.g., in GRE and GMAT.)

•  IRT model: •  IRT model is a compelling framework for crowd labeling. •  Similarity:

•  IRT model infers student and quesAon related parameters, and probability of correctness of answers.

•  Crowd labeling process infers labeler and data instance related parameters, and probability of correctness of labels.

•  Difference: •  For IRT model correct answers are known. •  For crowd labeling ground truth is to be inferred.

Experiments

Our Approach

•  Implementation: Stan programming language [12]. •  State-of-the-art Methods: We compared our method to

Majority voting, Dawid and Skene [3], Expectation Maximization (EM), Karger’s Iterative method (KOS) [7], Mean Field algorithm (MF) and Belief Propagation (BP) [10].

•  Datasets: (a) Simulated dataset (b) Recognizing Textual Entailment (RTE)

[1] Maarten A. S. Boksem, Theo F. Meijman, and Monicque M. Lorist. 2005. Effects of mental faAgue on ahenAon: An ERP study. Cogni&ve Brain Research 25, 1 (Sept. 2005), 107–116. DOI:hhp://dx.doi.org/10.1016/j.cogbrainres.2005.04.011 [2] Arpád Csatho ́, Dimitri van der Linden, Istvan Hernádi, Peter Buzas, and Agnes Kalmar. Effects of mental faAgue on the capacity limits of visual ahenAon. Journal of Cogni&ve Psychology 24, 5 (Aug. 2012), 511–524. [3] A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood EsAmaAon of Observer Error-‐Rates Using the EM Algorithm. In Applied Sta&s&cs, Vol. 28. 20–28. Ofer Dekel and Ohad Shamir. 2009. Good learners for evil teachers. In Interna&onal Conference on Machine Learning ICML. 30. [4] Pinar Donmez and Jaime G. Carbonell. 2008. ProacAve learning: cost-‐sensiAve acAve learning with mulAple imperfect oracles. In Proceedings of the 17th ACM conference on Informa&on and knowledge management (CIKM ’08). ACM, New York, NY, USA, 619–628.

[11] P. Bartleh, F.c.n. Pereira, C.j.c. Burges, L. Bohou, and K.q.Weinberger (Eds.). 701–709. hhp://books.nips.cc/papers/files/nips25/NIPS2012 0328.pdf [12] F. Lord. 1952. A Theory of Test Scores. Psychometric Monograph No. 7. Stan Development Team. 2014. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0. hhp://mc-‐stan.org/ [13] Heikki Topi, Joseph S. Valacich, and Jeffrey A. Hoffer. 2005. The effects of task complexity and Ame availability limitaAons on human performance in database query tasks. Interna&onal Journal of Human-‐Computer Studies 62, 3 (2005), 349–379. [14] Jacob Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. 2009. Whose Vote Should Count More: OpAmal IntegraAon of Labels from Labelers of Unknown ExperAse. In Neural Informa&on Processing Systems NIPS. 2035–2043.

[5] Dirk Hovy, Taylor Berg-‐Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning Whom to Trust with MACE. In Conference of the North American Chapter of the Associa&on for Computa&onal Linguis&cs: Human Language Technologies NAACL HLT, Atlanta, Georgia. [6] David Karger, Seewong Oh, and Devavrat Shah. 2011. IteraAve learning for reliable crowdsourcing systems. In Neural Informa&on Processing Systems NIPS, Granada, Spain. [7] David R. Karger, Sewoong Oh, and Devavrat Shah. 2014. Budget-‐OpAmal Task AllocaAon for Reliable Crowdsourcing Systems. Opera&ons Research 62, 1 (2014), 1–24. [8] Faiza Khan Khahak and Ansaf Salleb-‐Aouissi. 2013. Robust Crowd Labeling using Lihle ExperAse. In Sixteenth Interna&onal Conference on Discovery Science, Singapore. [9] Aniket Kihur, H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proc. CHI 2008, ACM Pres. 453–456. [10] Qiang Liu, Jian Peng, and Alex Ihler. 2012. VariaAonal Inference for Crowdsourcing. In Advances in Neural Informa&on Processing Systems NIPS,

Documents

Accurate Crowd-Labeling using Item Response Theory Poster