1
Results Framework for crowd-labeling with detailed parameters. Results show better and stable performance when compared to state-of-the-art. Plan to make our approach more fine-grained by adding variability in labeler ability [1,2,13]. Conclusion Simulated data: Generated using fixed values of all the parameters except the labeler log-odds π (Table I). 5,000 instances, four crowd labels per instance, 20 expert- labeled instances. Accuracy (Table II.) RTE dataset: Five crowd labels per instance, 20 expert-labeled instances. Labeler error rate (Table III) Accuracy (Table IV.) Accurate Crowd-Labeling using Item Response Theory Faiza Khan Khattak 1 Ansaf Salleb-Aouissi 1 Anita Raja 2 1 Columbia University, New York 2 The Cooper Union, New York. This work is partially funded by the National Science Foundation award number IIS-1454855 and IIS-1454814. Faiza Khan Khahak. i2224@columbia.edu Ansaf SallebAouissi. [email protected] Anita Raja. [email protected] Contact References Problem A Bayesian approach to crowd-labeling inspired by IRT. New parameters and refined the usual IRT parameters to fit the crowd-labeling scenario. Crowd Labeling Using Bayesian Statistics (CLUBS) Parameters estimation: Since true labels are unknown, expert-labeled instances (ground truth) used for a small percentage of data (usually 0.1% -10%) for parameter estimation. Estimated parameters used for aggregation of multiple crowd- labels for the rest of the dataset with no ground truth available. Difficulty (β i ) and discrimination level (δ i ) of the instances without expert-labels are calculated as follows: Label aggregation: Final label = Crowd-labeling: Non-expert humans labeling a large dataset – tedious, time-consuming and expensive if accomplished by experts alone. • Crowd-labelers are non-experts multiple labels per instance for quality assurance labels combined to get one final label. Many research papers [4,5,8] but still unresolved challenges. Challenges: Getting accurate final label when crowd is of heterogeneous quality. Identifying best ways to evaluate labelers. Choosing per-class labeler ability or over all labeler ability. Can quantifying prevalence of the class and/or clarity of a labeling task/question [9] improve accuracy? Our Hypothesis Crowdlabeling ≈ Test taking. Item Response Theory (IRT) [12] used to model student ability (e.g., in GRE and GMAT.) IRT model: IRT model is a compelling framework for crowd labeling. Similarity: IRT model infers student and quesAon related parameters, and probability of correctness of answers. Crowd labeling process infers labeler and data instance related parameters, and probability of correctness of labels. Difference: For IRT model correct answers are known. For crowd labeling ground truth is to be inferred. Experiments Our Approach Implementation: Stan programming language [12]. State-of-the-art Methods: We compared our method to Majority voting, Dawid and Skene [3], Expectation Maximization (EM), Karger’s Iterative method (KOS) [7], Mean Field algorithm (MF) and Belief Propagation (BP) [10]. Datasets: (a) Simulated dataset (b) Recognizing Textual Entailment (RTE) [1] Maarten A. S. Boksem, Theo F. Meijman, and Monicque M. Lorist. 2005. Effects of mental faAgue on ahenAon: An ERP study. Cogni&ve Brain Research 25, 1 (Sept. 2005), 107–116. DOI:hhp://dx.doi.org/10.1016/j.cogbrainres.2005.04.011 [2] Arpád Csatho ́, Dimitri van der Linden, Istvan Hernádi, Peter Buzas, and Agnes Kalmar. Effects of mental faAgue on the capacity limits of visual ahenAon. Journal of Cogni&ve Psychology 24, 5 (Aug. 2012), 511–524. [3] A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood EsAmaAon of Observer ErrorRates Using the EM Algorithm. In Applied Sta&s&cs, Vol. 28. 20–28. Ofer Dekel and Ohad Shamir. 2009. Good learners for evil teachers. In Interna&onal Conference on Machine Learning ICML. 30. [4] Pinar Donmez and Jaime G. Carbonell. 2008. ProacAve learning: costsensiAve acAve learning with mulAple imperfect oracles. In Proceedings of the 17th ACM conference on Informa&on and knowledge management (CIKM ’08). ACM, New York, NY, USA, 619–628. [11] P. Bartleh, F.c.n. Pereira, C.j.c. Burges, L. Bohou, and K.q.Weinberger (Eds.). 701–709. hhp://books.nips.cc/papers/ files/nips25/NIPS2012 0328.pdf [12] F. Lord. 1952. A Theory of Test Scores. Psychometric Monograph No. 7. Stan Development Team. 2014. Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0.hhp://mc stan.org/ [13] Heikki Topi, Joseph S. Valacich, and Jeffrey A. Hoffer. 2005. The effects of task complexity and Ame availability limitaAons on human performance in database query tasks. Interna&onal Journal of HumanComputer Studies 62, 3 (2005), 349–379. [14] Jacob Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. 2009. Whose Vote Should Count More: OpAmal IntegraAon of Labels from Labelers of Unknown ExperAse. In Neural Informa&on Processing Systems NIPS. 2035–2043. [5] Dirk Hovy, Taylor BergKirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning Whom to Trust with MACE. In Conference of the North American Chapter of the Associa&on for Computa&onal Linguis&cs: Human Language Technologies NAACL HLT, Atlanta, Georgia. [6] David Karger, Seewong Oh, and Devavrat Shah. 2011. IteraAve learning for reliable crowdsourcing systems. In Neural Informa&on Processing Systems NIPS, Granada, Spain. [7] David R. Karger, Sewoong Oh, and Devavrat Shah. 2014. BudgetOpAmal Task AllocaAon for Reliable Crowdsourcing Systems. Opera&ons Research 62, 1 (2014), 1–24. [8] Faiza Khan Khahak and Ansaf SallebAouissi. 2013. Robust Crowd Labeling using Lihle ExperAse. In Sixteenth Interna&onal Conference on Discovery Science, Singapore. [9] Aniket Kihur, H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proc. CHI 2008, ACM Pres. 453–456. [10] Qiang Liu, Jian Peng, and Alex Ihler. 2012. VariaAonal Inference for Crowdsourcing. In Advances in Neural Informa&on Processing Systems NIPS,

Accurate Crowd-Labeling using Item Response Theory Poster

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Poster  Print  Size:  This  poster  template  is  36”  high  by  48”  wide.  It  can  be  used  to  print  any  poster  with  a  3:4  aspect  raAo.  

Placeholders:  The  various  elements  included  in  this  poster  are  ones  we  oCen  see  in  medical,  research,  and  scienAfic  posters.  Feel  free  to  edit,  move,    add,  and  delete  items,  or  change  the  layout  to  suit  your  needs.  Always  check  with  your  conference  organizer  for  specific  requirements.  

Image  Quality:  You  can  place  digital  photos  or  logo  art  in  your  poster  file  by  selecAng  the  Insert,  Picture  command,  or  by  using  standard  copy  &  paste.  For  best  results,  all  graphic  elements  should  be  at  least  150-­‐200  pixels  per  inch  in  their  final  printed  size.  For  instance,  a  1600  x  1200  pixel  photo  will  usually  look  fine  up  to  8“-­‐10”  wide  on  your  printed  poster.  

To  preview  the  print  quality  of  images,  select  a  magnificaAon  of  100%  when  previewing  your  poster.  This  will  give  you  a  good  idea  of  what  it  will  look  like  in  print.  If  you  are  laying  out  a  large  poster  and  using  half-­‐scale  dimensions,  be  sure  to  preview  your  graphics  at  200%  to  see  them  at  their  final  printed  size.  Please  note  that  graphics  from  websites  (such  as  the  logo  on  your  hospital's  or  university's  home  page)  will  only  be  72dpi  and  not  suitable  for  prinAng.  

 [This  sidebar  area  does  not  print.]  

Change  Color  Theme:  This  template  is  designed  to  use  the  built-­‐in  color  themes  in  the  newer  versions  of  PowerPoint.  To  change  the  color  theme,  select  the  Design  tab,  then  select  the  Colors  drop-­‐down  list.          

         

The  default  color  theme  for  this  template  is  “Office”,  so  you  can  always  return  to  that  aCer  trying  some  of  the  alternaAves.  

PrinAng  Your  Poster:  Once  your  poster  file  is  ready,  visit  www.genigraphics.com  to  order  a  high-­‐quality,  affordable  poster  print.  Every  order  receives  a  free  design  review  and  we  can  deliver  as  fast  as  next  business  day  within  the  US  and  Canada.    

Genigraphics®  has  been  producing  output  from  PowerPoint®  longer  than  anyone  in  the  industry;  daAng  back  to  when  we  helped  MicrosoC®  design  the  PowerPoint®  soCware.      US  and  Canada:    1-­‐800-­‐790-­‐4001  Email:  [email protected]  

 [This  sidebar  area  does  not  print.]  

                                                               

Results  

•  Framework for crowd-labeling with detailed parameters. •  Results show better and stable performance when compared

to state-of-the-art. •  Plan to make our approach more fine-grained by adding

variability in labeler ability [1,2,13].

Conclusion  

Simulated data: •  Generated using fixed values of

all the parameters except the labeler log-odds π (Table I).

•  5,000 instances, four crowd labels per instance, 20 expert-labeled instances.

•  Accuracy (Table II.)

RTE dataset: •  Five crowd labels per instance, 20 expert-labeled instances. •  Labeler error rate (Table III) •  Accuracy (Table IV.)

Accurate Crowd-Labeling using Item Response Theory  Faiza Khan Khattak1 Ansaf Salleb-Aouissi1 Anita Raja2

1Columbia University, New York 2The Cooper Union, New York. This work is partially funded by the National Science Foundation award number IIS-1454855 and IIS-1454814.

Faiza  Khan  Khahak.  [email protected]  Ansaf  Salleb-­‐Aouissi.  [email protected]  Anita  Raja.    [email protected]    

Contact   References  

Problem  •  A Bayesian approach to crowd-labeling inspired by IRT. •  New parameters and refined the usual IRT parameters to fit

the crowd-labeling scenario. •  Crowd Labeling Using Bayesian Statistics (CLUBS)

•  Parameters estimation:

§  Since true labels are unknown, expert-labeled instances (ground truth) used for a small percentage of data (usually 0.1% -10%) for parameter estimation.

§  Estimated parameters used for aggregation of multiple crowd- labels for the rest of the dataset with no ground truth available.

§  Difficulty (βi) and discrimination level (δi) of the instances without expert-labels are calculated as follows:

•  Label aggregation: Final label =

•  Crowd-labeling: Non-expert humans labeling a large dataset – tedious, time-consuming and expensive if accomplished by experts alone. •  Crowd-labelers are non-experts è multiple labels per

instance for quality assurance è labels combined to get one final label.

•  Many research papers [4,5,8] but still unresolved challenges. •  Challenges:

§  Getting accurate final label when crowd is of heterogeneous quality.

§  Identifying best ways to evaluate labelers. §  Choosing per-class labeler ability or over all labeler

ability. §  Can quantifying prevalence of the class and/or clarity

of a labeling task/question [9] improve accuracy?

Our  Hypothesis    •  Crowd-­‐labeling  ≈  Test  taking.    •  Item   Response   Theory   (IRT)   [12]   used   to   model   student   ability  (e.g.,    in  GRE  and  GMAT.)  

•  IRT  model:                  •  IRT  model  is  a  compelling  framework  for  crowd  labeling.    •  Similarity:  

•  IRT  model  infers  student  and  quesAon  related  parameters,  and  probability  of  correctness  of  answers.  

•  Crowd   labeling   process   infers   labeler   and   data   instance  related   parameters,   and   probability   of   correctness   of  labels.  

•  Difference:    •  For  IRT  model  correct  answers  are  known.  •  For  crowd  labeling  ground  truth  is  to  be  inferred.    

Experiments  

Our  Approach  

•  Implementation: Stan programming language [12]. •  State-of-the-art Methods: We compared our method to

Majority voting, Dawid and Skene [3], Expectation Maximization (EM), Karger’s Iterative method (KOS) [7], Mean Field algorithm (MF) and Belief Propagation (BP) [10].

•  Datasets: (a) Simulated dataset (b) Recognizing Textual Entailment (RTE)

[1]  Maarten  A.  S.  Boksem,  Theo  F.  Meijman,  and  Monicque  M.  Lorist.  2005.  Effects  of  mental  faAgue  on  ahenAon:  An  ERP  study.  Cogni&ve  Brain  Research  25,  1  (Sept.  2005),  107–116.  DOI:hhp://dx.doi.org/10.1016/j.cogbrainres.2005.04.011    [2]  Arpád  Csatho  ́,  Dimitri  van  der  Linden,  Istvan  Hernádi,  Peter  Buzas,  and  Agnes  Kalmar.  Effects  of  mental  faAgue  on  the  capacity  limits  of  visual  ahenAon.  Journal  of  Cogni&ve  Psychology  24,  5  (Aug.  2012),  511–524.    [3]  A.  P.  Dawid  and  A.  M.  Skene.  1979.  Maximum  Likelihood  EsAmaAon  of  Observer  Error-­‐Rates  Using  the  EM  Algorithm.  In  Applied  Sta&s&cs,  Vol.  28.  20–28.  Ofer  Dekel  and  Ohad  Shamir.  2009.  Good  learners  for  evil  teachers.  In  Interna&onal  Conference  on  Machine  Learning  ICML.  30.  [4]  Pinar  Donmez  and  Jaime  G.  Carbonell.  2008.  ProacAve  learning:  cost-­‐sensiAve  acAve  learning  with  mulAple  imperfect  oracles.  In  Proceedings  of  the  17th  ACM  conference  on  Informa&on  and  knowledge  management  (CIKM  ’08).  ACM,  New  York,  NY,  USA,  619–628.    

[11]  P.  Bartleh,  F.c.n.  Pereira,  C.j.c.  Burges,  L.  Bohou,  and  K.q.Weinberger  (Eds.).  701–709.  hhp://books.nips.cc/papers/files/nips25/NIPS2012  0328.pdf  [12]  F.  Lord.  1952.  A  Theory  of  Test  Scores.  Psychometric  Monograph  No.  7.  Stan  Development  Team.  2014.  Stan  Modeling  Language  Users  Guide  and  Reference  Manual,  Version  2.5.0.  hhp://mc-­‐stan.org/  [13]  Heikki  Topi,  Joseph  S.  Valacich,  and  Jeffrey  A.  Hoffer.  2005.  The  effects  of  task  complexity  and  Ame  availability  limitaAons  on  human  performance  in  database  query  tasks.  Interna&onal  Journal  of  Human-­‐Computer  Studies  62,  3  (2005),  349–379.    [14]  Jacob  Whitehill,  P.  Ruvolo,  T.  Wu,  J.  Bergsma,  and  J.  Movellan.  2009.  Whose  Vote  Should  Count  More:  OpAmal  IntegraAon  of  Labels  from  Labelers  of  Unknown  ExperAse.  In  Neural  Informa&on  Processing  Systems  NIPS.  2035–2043.    

[5]  Dirk  Hovy,  Taylor  Berg-­‐Kirkpatrick,  Ashish  Vaswani,  and  Eduard  Hovy.  2013.  Learning  Whom  to  Trust  with  MACE.  In  Conference  of  the  North    American  Chapter  of  the  Associa&on  for  Computa&onal  Linguis&cs:  Human  Language  Technologies  NAACL  HLT,  Atlanta,  Georgia.  [6]  David  Karger,  Seewong  Oh,  and  Devavrat  Shah.  2011.  IteraAve  learning  for  reliable  crowdsourcing  systems.  In  Neural  Informa&on  Processing    Systems  NIPS,  Granada,  Spain.  [7]  David  R.  Karger,  Sewoong  Oh,  and  Devavrat  Shah.  2014.  Budget-­‐OpAmal  Task  AllocaAon  for  Reliable  Crowdsourcing  Systems.  Opera&ons  Research  62,  1  (2014),  1–24.  [8]  Faiza  Khan  Khahak  and  Ansaf  Salleb-­‐Aouissi.  2013.  Robust  Crowd  Labeling  using  Lihle  ExperAse.  In  Sixteenth  Interna&onal  Conference  on  Discovery  Science,  Singapore.  [9]  Aniket  Kihur,  H.  Chi,  and  Bongwon  Suh.  2008.  Crowdsourcing  user  studies  with  Mechanical  Turk.  In  Proc.  CHI  2008,  ACM  Pres.  453–456.    [10]  Qiang  Liu,  Jian  Peng,  and  Alex  Ihler.  2012.  VariaAonal  Inference  for  Crowdsourcing.  In  Advances  in  Neural  Informa&on  Processing  Systems  NIPS,