12
No. 5 Crowdsourcing 場 雪乃(国情報学研究所) 2014 7 20 WWW 2014 勉強会

No. 5 Crowdsourcing...4 1本め概要:検広告にクイズ形式タスクを埋め込む 。広告提の仕組みを使い適なにタスクを割り当て healthline Figure 3:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • No. 5 Crowdsourcing

    ⾺馬場  雪乃(国⽴立立情報学研究所) 2014年年7⽉月20⽇日 WWW 2014 勉強会

  • 2

    本セッション:クラウドソーシングの品質管理理⼿手法特に「ワーカの選択⽅方法」に着⽬目

    ● 本セッションでのクラウドソーシング:(答えを知りたい)多肢選択式問題を⼈人間に提⽰示し回答させる仕組み

    ● 共通の課題  間違った回答をする⼈人もいる  できるだけ「正しい回答」を獲得したい● 「同じ問題を複数⼈人に聞く際の回答統合⼿手法」「正しい回答をしそうなワーカの選択⼿手法」が提案されている

  • 3

    1本め概要:検索索広告にクイズ形式タスクを埋め込む。広告提⽰示の仕組みを使い適切切な⼈人にタスクを割り当て

    • “Quizz: Targeted Crowdsourcing with a Billion (Potential) Users”

    ● NY⼤大のP. IpeirotisがGoogle滞在時に実施 ● 動機:Knowledge Graph充実のため ⼈人間から知識識を集めたい

    ● 課題:専⾨門的知識識は答えられる⼈人間が限られる (例例:「モルジェロン病の症状はなんですか」

    ● アプローチ:Google広告にタスクを埋め込む

  • 4

    1本め概要:検索索広告にクイズ形式タスクを埋め込む。広告提⽰示の仕組みを使い適切切な⼈人にタスクを割り当て

    healthline

    Figure 3: Example ad to attract users

    how to create engaging and viral crowdsourcing applicationsin a replicable manner. The emergence of paid crowdsourcing(e.g., Amazon Mechanical Turk) allows direct engagementof users in exchange for monetary rewards. However, thepopulation of users who participate due to extrinsic rewardsis typically di↵erent from the users who participate becauseof their intrinsic motivation.Quizz uses online advertising to attract unpaid users to

    contribute. By running ads, we get into the middle groundbetween paid and unpaid crowdsourcing. Users who arrive atour site through an ad are not getting paid, and if they chooseto participate they obviously do so because of their intrinsicmotivation. This removes some of the wrong incentives andtends to alleviate concerns about indi↵erent users that “spam”the results just to get paid, or about workers that are tryingto do the minimum work necessary in order to get paid.Thanks to the sheer reach of modern advertising platforms,the population of unpaid users can potentially be orders ofmagnitude larger than that in paid marketplaces. Thereare billions of users reachable through advertising, whileeven the biggest crowdsourcing platforms have at most amillion users, many of them inactive [19, 18]. Therefore, ifthe need arises (and subject to budgetary constraints), ourapproach can elastically scale up to reach almost arbitrarilylarge populations of users, by simply increasing the budgetallocated to the advertising campaign. At the same time, weshow in Section 6 that our approach allows e�cient use ofthe advertising budget (which is our only expenditure), andour overall costs are the same or lower than those in paidcrowdsourcing installations.A significant additional benefit of using an advertising

    system is its ability to target users with expertise in specifictopics. For example, if we are looking for users possessingmedical knowledge, we can run a simple ad like the one inFigure 3. To do so, we select keywords that describe the topicof interest and ask the advertising platform to place the ad inrelevant contexts. In this study, we used Google AdWords2,and opted into both search and display ads, while in principlewe can use any other publicly available advertising system.

    Selecting appropriate keywords for an ad campaign is achallenging topic in itself [13, 1, 20]. However, we believethat trying to optimize the campaign only through manuallyfine-tuning its keywords is of limited utility. Instead, we pro-pose to automatically optimize the campaign by quantifyingthe behavior of the users that clicked on the ad. A userwho clicks on the ad but does not participate in the crowd-sourcing application is e↵ectively “wasting” our advertisingbudget; using the advertising terminology, such user has not“converted.” Since we are not just interested in attracting anyusers but are interested in attracting users who contribute,we use Google Analytics3 to track user conversions. Every

    2

    https://adwords.google.com

    3

    http://www.google.com/analytics

    time a user clicks on the ad and then participates in a quiz,we record a conversion event, and send this signal back tothe advertising system. This way, we are e↵ectively askingthe system to optimize the advertising campaign for maxi-mizing the number of conversions and thus increasing ourcontribution yield, instead of the default optimization forthe number of clicks.Although optimizing for conversions is useful, it is even

    better to attract competent users (as opposed to, say, userswho just go through the quiz without being knowledgeableabout the topic). That is, we want to identify users who areboth willing to participate and possess the relevant knowl-edge. In order to give this refined type of feedback to theadvertising system, we need to measure both the quantityand the quality of user contributions, and for each conversionevent report the true “value” of the conversion. To achievethis aim, we set up Google Analytics to treat our site asan e-commerce website, and for each conversion we also re-port its value. Section 3 describes in detail our approach toquantifying the values of conversions.

    When the advertising system receives fine-grained feedbackabout conversions and their value, it can improve the adplacement and display the ad to users who are more likelyto participate and contribute high quality answers. (In ourexperiments, in Section 6, this optimization led to an increasein conversion rate from 20% to over 50%, within a period ofone month, for a campaign that was already well-optimized.)For example, consider medical quizzes. We initially believedthat identifying users with medical expertise who are willingto participate in our system would be an impossible task.However, thanks to tracking conversions and modeling thevalue of user contributions, AdWords started displaying ourad on websites such as Mayo Clicic and HealthLine. Thesewebsites are not frequented by medical professionals but byprosumers. These users are both competent and are muchmore likely than professionals to participate in a quiz thatassesses their medical knowledge—often, this is exactly thetype of users that a crowdsourcing application is looking for.

    3. MEASURING USER CONTRIBUTIONSIn order to understand the contributions of a user for

    each quiz, we need first to define a measurement strategy.Measuring the user contribution using just the number ofanswers is problematic, as it does not consider the quality ofthe submissions. Similarly, if we just measure the quality ofthe submitted answers, we do not incentivize participation.Intuitively, we want users to contribute high quality answers,and also contribute many answers. Thus, we need a metricthat increases as both quality and volume increase.Information Gain: To combine both quality and quan-

    tity into a single, principled metric, we adopt an information-theoretic approach [36, 31]. We treat each user as a “noisychannel,” and measure the total information “transmitted”by the user during her participation. The information ismeasured as the information gain contributed for each an-swer, multiplied by the total number of answers submittedby the user; this is the total information submitted by theuser. More formally, assume that we know the probability qthat the user answers correctly a randomly chosen questionof the quiz. Then, the information gain IG(q, n) is definedas:

    IG(q, n) = H(1/n, n)�H(q, n) (1)

    145

    Figure 1: Screenshot of the Quizz system.

    advertiser. In our case, we initiate the process with simpleadvertising campaigns but also integrate the ad campaignwith the crowdsourcing application, and provide feedbackto the advertising system for each ad click: The feedbackindicates whether the user, who clicked on the ad, “converted”and the total contributions of the crowdsourcing e↵ort. Thisallows the advertising platform to naturally identify web-sites with user communities that are good matches for thegiven task. For example, in our experiments with acquiringmedical knowledge, we initially believed that “regular” Inter-net users would not have the necessary expertise. However,the advertising system automatically identified sites suchas Mayo Clinic and HealthLine, which are frequented byknowledgeable consumers of health information who endedup contributing significant amounts of high-quality medi-cal knowledge. Our idea is inspired by Ho↵man et al. [17],who used advertising to attract users to a Wikipedia-editingexperiment, although they did not attempt to target usersnor attempted to optimize the ad campaign by providingfeedback to the advertising platform.Once users arrive at our site, we need to engage them to

    contribute useful information. Our crowdsourcing platform,Quizz, invites users to test their knowledge in a variety ofdomains and see how they fare against other users. Figure 1shows an example question. Our quizzes include two kindsof questions: Calibration questions have known answers,and are used to assess the expertise and reliability of theusers. On the other hand, collection questions have noknown answers and actually serve to collect new information,and our platform identifies the correct answers based onthe answers provided by the (competent) participants. Tooptimize how often to test the user, and how often to present aquestion with an unknown answer, we use a Markov DecisionProcess [29], which formalizes the exploration/exploitationframework and selects the optimal strategy at each point.As our analysis shows, a key component for the success

    of the crowdsourcing e↵ort is not just getting users to par-ticipate, but also to keep the good users participating forlong, while gently discouraging low-quality users from par-ticipating. In a series of controlled experiments, involvingtens of thousands of users, we show that a key advantage

    Internet Users (display ads)

    AdvertisingCampaign

    Internet Users (sponsoredsearch ads)

    Calibration Questions(with known answers)

    Collection Questions(with uncertain answers)

    Serve Calibration orCollection Question?

    Feedback on conversion andcontributions for each user click

    UserContributionMeasurement

    Question

    Users

    Figure 2: An overview of the Quizz system.

    of attracting unpaid users through advertising is the strongself-selection of high-quality users to continue contributing,while low-quality users self-select to drop out. Furthermore,our experimental comparison with paid crowdsourcing (bothpaid hourly and paid piecemeal) shows that our approachdominates paid crowdsourcing both in terms of the qualityof users and in terms of the total monetary cost required tocomplete the task.The contributions of this paper are fourfold. First, we

    formulate the notion of targeted crowdsourcing, which allowsone to identify crowds of users with desired expertise. Wethen describe a practical approach to find such users at scaleby leveraging existing advertising systems. Second, we showhow to optimally ask questions to the users, to leveragetheir knowledge. Third, we evaluate the utility of a host ofdi↵erent engagement mechanisms, which incentivize users tocontribute more high-quality answers via the introduction ofshort-term goals and rewards. Finally, our empirical resultsconfirm that the proposed approach allows to collect andcurate knowledge with accuracy that is superior to that ofpaid crowdsourcing mechanisms at the same or lower cost.

    Figure 2 shows the overview of the system, and the variouscomponents that we discuss in the paper. Section 2 describesthe use of advertising to target promising users, and howwe set up the campaigns to allow for continuous, automaticoptimization of the results over time. Section 3 shows thedetails of our information-theoretic scheme for measuringthe expertise of the participants, while Section 4 gives thedetails of our exploration-exploitation scheme. Section 5discusses our experiments on how to keep users engaged,and Section 6 gives the details of our experimental results.Finally, Section 7 describes related work, while Section 8concludes.

    2. ADVERTISING FOR TARGETING USERSA key problem of every crowdsourcing e↵ort is soliciting

    users to participate. At a fundamental level, it is alwayspreferable to attract users that have an inherent motivationfor participation. Unfortunately, repeating the successes ofe↵orts such as Wikipedia, TripAdvisor, and Yelp seems moreof an art than a science, and we do not yet fully understand

    144

    (元論論⽂文Fig. 3) (元論論⽂文Fig. 2)

    検索索クエリに応じた問題へのリンクを広告に埋め込む。 ワーカの興味にあった問題を出せる &答えるかどうかはワーカに任せられる(self-selection)

  • 5

    ⼿手法:良良いワーカの探索索と活⽤用のバランスを取るためMDPを利利⽤用し⾏行行動決定

    ● できるだけ能⼒力力が⾼高いワーカの回答を獲得したい ● 2種類の問題を⽤用意   正解既知の問題: ワーカの能⼒力力を推定するのに⽤用いる(探索索)

      正解未知の問題: ワーカから知識識を収集するのに⽤用いる(活⽤用)

    ● 各ワーカの回答状況に応じて、 探索索・活⽤用どちらのタスクを提⽰示すべきかを MDPで決める

  • 6

    結果:99%の確率率率で正しい回答を、 ⼀一問あたり$0.16で獲得することに成功

    ● あらかじめ⽤用意した正解とワーカの回答を⽐比較、推定コストと正答率率率を評価(⼀一つの回答が$0.1、 ⼀一定の確信度度になるまで複数⼈人に聞く設定 →99%の確率率率で正答を得るのに⼀一問$0.16

    ●  Targeting(検索索クエリに応じた問題提⽰示)とUntargetingを⽐比較 →targetingの⽅方が⾼高品質の回答が得られた

    ● 「能⼒力力が⾼高いワーカほどたくさん回答する」傾向 →self-selectionにより⾼高品質の結果が得られた

  • 7

    2本め概要:ワーカ能⼒力力の類型を考慮して能⼒力力を推定し回答統合に利利⽤用

    • “Community-Based Bayesian Aggregation Models for Crowdsourcing ” ● 問題:同タスクを複数⼈人に依頼する時の回答統合 ● 従来:各ワーカの能⼒力力を考慮し回答統合   正解推定(=統合)⇔能⼒力力推定を交互に実施 ● 本論論⽂文:能⼒力力の類型を能⼒力力推定に利利⽤用

    哺乳類 ⿃鳥類 ⿃鳥類⿃鳥類 ⿂魚類 ⿃鳥類

    ??

    統合結果(推定した正解)

  • 8

    ⼿手法:各ワーカの能⼒力力は、能⼒力力の類型をベースにして⽣生成されるとしてモデル化

    ● 従来:各ワーカの能⼒力力をConfusion Matrixで表現

    ● 本論論⽂文:能⼒力力の類型で複数ワーカをまとめる   ひとつの類型にひとつのコミュニティが対応   各ワーカはいずれかのコミュニティに属する   各ワーカのConfusion Matrixは、所属コミュニティの

    Confusion Matrixにノイズが加わって⽣生成される

    正解が”+1”のとき”+1”と答える確率率率

    (図はhttp://crowdresearch.org/blog/?p=8971 より引⽤用)

  • 9

    結果:特に回答数が少ないとき、 能⼒力力の類型を考慮した提案⼿手法が有効

    ● 情報検索索・NLPに関する4種類のタスクで実験   クエリ拡張結果の⽐比較、検索索結果の⽐比較、 感情分析、 成⼈人コンテンツに関するクエリかどうかの判定

    ● 多数決、類型を考慮しない⼿手法と⽐比較 →特に獲得回答数が少ないときに提案⼿手法が有効

    ● 解釈しやすい「類型」が得られた   例例:5段階の4, 5点ばかりを使う「コンサバ型」

  • 10

    3本め概要:ワーカの属性を使って「良良いワーカ」を発⾒見見

    • “The Wisdom of Minority: Discovering and Targeting the Right Group of Workers for Crowdsourcing” ● あるタスクにおけるワーカの能⼒力力がワーカの属性(年年齢、性別等)に依存することはよくある   例例:「⼥女女⼦子⾼高⽣生向け化粧品」に関するタスクなら 「年年齢=10代」「性別=⼥女女性」の能⼒力力が⾼高そう

    ● 属性を使って良良いワーカを選択することで 正しい回答の獲得を⽬目指す

  • 11

    2つのアプローチ:「ワーカの属性から能⼒力力を推定」 「能⼒力力が⾼高い⼈人が満たす属性を推定」

    ●  正解既知のタスクを⽤用い⼀一部のワーカの能⼒力力は推定済み ●  ワーカ選択のアプローチを2つ提案   Bottom-up:ワーカの属性から能⼒力力を推定 o  属性ベクトルの線形和で能⼒力力が決まるモデル o  各次元の重みを学習

      Top-down: 能⼒力力が⾼高い⼈人が満たす 属性を推定 o  分散分析を⽤用い 重要な属性を絞り込む

    lab2. After fitting this fixed effect model to data gatheredduring the probing stage, we will learn the coefficients β̂ anda threshold τ0 on predict effect for a worker to be in the tar-get group. For a worker with features (X1, X2, · · · , Xt) inthe targeting stage, we evaluate the effect of the subgrouphe/she belongs to with τ̂ ← β̂0 +

    ∑tk=1 β̂kXk. The worker

    will be qualified for the task if τ̂ > τ0.Note that the accessibility parameter λ reflects roughly

    how many workers in the crowd satisfy the criterion — withpredicted effect greater than τ0. It thus controls how acces-sible the target group will be. For example, if λ is close to0, then τ0 will be relatively large, thus there might be onlyvery few qualifying workers. In contrast, if λ is close to 1,τ0 will be small, and most workers will qualify.

    4.2 Top-down Discovery AlgorithmThe Bottom-up Discovery Algorithm is directly solving

    the problem of predicting effect values for each worker group.However, it does not consider whether each feature is signif-icant enough to affect worker reliability, so the fitted modelmay not reveal the true association between worker featuresand effects. For example, if in the probing stage, there isone attribute that only very few workers have, e.g., {Educa-tion=PhD}, the bottom up approach will still try to connectsuch feature to the effect, which may not be stable. There-fore, we want to find a method that can generate more stableand interpretable results. The Top-down Discovery Algo-rithm described in this section is one of such approaches.

    The general idea is we should choose subgroups based onfeatures that are significantly associated with worker effects.ANOVA (ANalysis Of VAriance) [23] based on the fixed ef-fect model (6) is an appropriate tool for testing feature sig-nificance.

    One remaining issue is that when there are multiple fea-tures in F, and each feature has multiple levels (multiplepossible values), the number of workers that have the samefeatures might be too small, especially when M is alreadyvery small (typically 100 or 200 in real crowd-sourcing set-tings). The multiple-way ANOVA will be unstable in suchcase. More importantly, for achieving interpretability andreducing the risks of over-fitting, we also hope that outputworker subgroups are not too many.

    Based on the intuitions above, we propose to do one-wayANOVA sequentially on each feature and obtain the p-valuepk for Fk based on the fixed effect model:

    τi ∼ β0 + β1X(k)i + ϵ, ∀i ∈ [M ], Fk ∈ F. (7)

    Since not every feature will be strongly associated with theworker effect, we have to use a significance threshold psigto control the significance of each test, i.e., p-value. It iscommon to choose 0.10 or 0.05 as the significance threshold,and we use psig = 0.10 as default in our method.

    Similar to the Bottom-up Discovery Algorithm, we needto have a accessibility parameter λ to ensure that the sizeof the target group is more than 100λ% of the crowd.

    Algorithm 2 is the detailed description of the Top-downDiscovery Algorithm. The general steps include: (1) se-quentially testing if features are significantly associated withworker effects, (2) splitting the crowd using the most signif-icant features, and (3) picking the subgroup with highest

    2http://www.mathworks.com/help/stats/linearmodel.fit.html

    Algorithm 2 Top-down Discovery Algorithm

    Input: Feature pool: F = {F1, · · · , Ft}; M workers witheffect {τ1, · · · , τM} and feature vectors {X1, · · · , XM} whereXi = (X

    (1)i , · · · , X

    (t)i ); Accessibility parameter: λ ∈ (0, 1);

    Significant level: psig with default value 0.1.

    1: Initialization: Current feature pool Fcurrent ← F andcurrent crowd Scurrent ← [M ]; Fout ← ∅ and Lout ← ∅.

    2: repeat3: for feature Fk in Fcurrent do4: Computing one-way ANOVA on feature Fk with

    τi ∼ β0 + β1X(k)i + ϵ, and obtain the p-value pk.5: end for6: k∗ ← argmin {pk|Fk ∈ Fcurrent}.7: Suppose Fk∗ has n levels

    {L(k

    ∗)1 , · · · ,L

    (k∗)n

    }, which

    partitions the Scurrent to {S1, · · · , Sn}. Then ∀l ∈ [n],compute the average effect: El ← 1|Sl|

    ∑i∈Sl τi.

    8: l∗ ← argmaxl∈[n] {El}, Scurrent ←{i∣∣X(k)i = L

    (k∗)l∗

    }.

    9: if |Scurrent|M > λ or pk∗ ≤ psig then10: Fout ← Fout ∪ {Fk∗} and Lout ← Lout ∪

    {{L(k

    ∗)l∗ }

    }.

    11: Fcurrent ← Fcurrent\ {Fk∗}.12: else13: Stop and return Fout and Lout.14: end if15: until Fcurrent = ∅Output: Target feature pool Fout and feature levels Lout.

    Figure 3: Illustration of group discovery algorithmby a running example: suppose we have 100 workersand then only have features “Gender” and “Major”.Each circle represents a group of workers, the in-teger in each circle is the number of workers in thegroup and the real value is the average worker effect.Since the feature “Major” has smaller p-value than“Gender” (more significant), the algorithm splits thecrowd on “Major” and chooses the worker groupwith the highest average effect: “Science”. Thenit continues with the rest of the features. In theend, the algorithm outputs “Major = Science” and“Gender = Female” as the target group.

    average effect, and go to (1) to check if the group should befurther partitioned.

    As an illustration, Figure 3 shows a running example ofthis algorithm. Suppose we have hired 100 workers in theprobing stage, and the feature pool we choose to run theTop-down Discovery Algorithm is F ={Major, Gender}, andthe feature “Major” has two levels {“Science”, “Arts”}. Wechoose the accessibility parameter λ = 0.20 and the default

    169

    (元論論⽂文Fig. 3)

    この例例では”Major=Science, Gender=Femal”が重要属性となる

  • 12

    結果:属性によるワーカ選択後に回答統合する⼿手法が複数のタスクで⾼高精度度達成

    ●  3タスクで実験:クイズ、語義判定、含意関係判定 ●  提案⼿手法:「良良いワーカ」からの回答に絞った後、 能⼒力力を考慮した回答統合⼿手法を適⽤用

    ●  多数決、ワーカ選択無しで回答統合、提案⼿手法を⽐比較 →いずれのタスクでも提案⼿手法が⾼高精度度。 Bottom-upとTop-Downのどちらが良良いかはタスクに依存

    ●  クイズタスクでは”Major=Science”、 語義判定タスクでは”Major=Science”, “Major=Engineering”が重要、  “Education”はいずれでも重要ではない、と推定された