How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

  • View
    788

  • Download
    1

Embed Size (px)

Text of How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

  • 1. How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen DownieUniversity of Illinois at Urbana-ChampaignBrian McFee University of California at San DiegoMarkus Schedl Johannes Kepler University Linz ISMIR 2012Picture by Humberto SantosPorto, Portugal October 9th

2. lets review two papers 3. statisticallysignificantpaper A: paper B:+0.14* +0.21which one should get published?a.k.a. which research line should we follow? 4. statisticallysignificantpaper A: paper B:+0.14* +0.21which one should get published?a.k.a. which research line should we follow? 5. paper A:paper B:+0.14*+0.14*which one should get published?a.k.a. which research line should we follow? 6. paper A:paper B:+0.14*+0.14*which one should get published?a.k.a. which research line should we follow? 7. Goal of Comparing SystemsFind out the effectiveness difference (arbitrary query and arbitrary user)Impossible!requires running the systems for theuniverse of all queries -1 0 1effectiveness 8. what Evaluations can do Estimate with the average over a sample of queries -1 01 effectiveness 9. what Evaluations can do Estimate with the average over a sample of queries -1 01 effectiveness 10. what Evaluations can do Estimate with the average over a sample of queries -1 01 effectiveness 11. what Evaluations can do Estimate with the average over a sample of queries -1 01 effectiveness 12. what Evaluations can do Estimate with the average over a sample of queries There is always random error so we need ameasure of confidence 13. The Significance Drill Test these hypothesesH 0: = 0H 1: 0 14. The Significance Drill Test these hypothesesH 0: = 0H 1: 0 Result of the test p-value = P( | H0 ) interpretation of the testp-value is very small: reject H0 otherwise: accept H0 15. The Significance DrillTest these hypotheses H 0: = 0 H 1: 0We accept/reject H0(based on the p-value and )not the test! 16. Usual (wrong) conclusions A is substantially than BA is much better than BThe difference is importantThe difference is significant 17. What does it mean?That there is a difference (unlikely due to chance/random error) 18. What does it mean?That there is a difference(unlikely due to chance/random error)We dont need fancy statistics we already know they are different! 19. H0: = 0 is false by definition because systems A and Bare different to begin with 20. What is really important?The effect-size:magnitude of This is what predicts usersatisfaction, not p-values 21. What is really important? The effect-size: magnitude of This is what predicts user satisfaction, not p-values = +0.6 is a huge improvement = +0.0001 is irrelevant and yet, it can easily bestatistically significant 22. Example: t-test The larger the statistic ,=the smaller the p-valueHow to achieve statistical significance? 23. Example: t-test The larger the statistic ,=the smaller the p-valueHow to achieve statistical significance?a) Reduce variance 24. Example: t-test The larger the statistic ,=the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the system 25. Example: t-test The larger the statistic ,=the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the systemc) Evaluate with more queries! 26. Statistical Significance iseventually meaninglessall you have to do isuse enough queries 27. Practical Significance: Effect-Size Effectiveness / Satisfaction Statistical Significance: p-value ConfidenceAn improvement may be statistically significant, but thatdoesnt mean its important! 28. the real importanceof an improvement 29. Purpose of EvaluationHow goodIs system Ais my system? better than system B? 0 1 -101 effectiveness effectivenessWe measure system effectiveness 30. AssumptionSystem Effectiveness corresponds toUser Satisfactionuser satisfactionsystem effectiveness 31. AssumptionSystem Effectiveness corresponds toUser Satisfactionuser satisfactionsystem effectiveness 32. AssumptionSystem Effectiveness corresponds toUser Satisfactionuser satisfactionsystem effectiveness 33. AssumptionSystem Effectiveness corresponds toUser Satisfactionuser satisfactionsystem effectiveness 34. AssumptionSystem Effectiveness corresponds toUser Satisfactionuser satisfactionsystem effectiveness 35. AssumptionSystem Effectiveness corresponds toUser Satisfactionthis is ourultimate goal! Does it? How well? 36. How we measureSystem EffectivenessSimilarity scale we normalize to [0, 1]Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rankWhat correlates betterwith user satisfaction? 37. Experiment 38. Experiment 39. Experimentknown effectiveness 40. Experiment user preference 41. Experiment non-preference 42. What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluationNegative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfyingBad: both systems are unsatisfying 43. DataClips and Similarity Judgments fromMIREX 2011 Audio Music SimilarityRandom and Artificial examplesQuery: selected randomlySystem outputs: random lists of 5 documents 2200 examples for 73 unique queries2869 unique lists with 3031 unique clipsbalanced and complete design 44. Subjects CrowdsourcingCheap, fast and diverse pool of subjects2200 Qualityexamples controlTrap examples (known answers)$0.03 per example Worker pool 45. Results 6895 total answers881 workers from 62 countries3393 accepted answers (41%) 100 workers (87% rejected!)95% average quality when accepted 46. How good is my system? 884 nonpreferences (40%) What do we expect? 47. How good is my system? 884 nonpreferences (40%)Linear mapping 48. How good is my system? 884 nonpreferences (40%)What do we have? 49. How good is my system? 884 nonpreferences (40%) 50. How good is my system? 884 nonpreferences (40%) 51. How good is my system? 884 nonpreferences (40%) room for ~20%improvementwith personalization 52. Is system A better than B? 1316 preferences (60%)What do we expect? 53. Is system A better than B? 1316 preferences (60%) Users always notice the difference regardless of how large it is 54. Is system A better than B? 1316 preferences (60%) What do we have? 55. Is system A better than B? 1316 preferences (60%) 56. Is system A better than B? 1316 preferences (60%) 57. Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for>50% of users to agree 58. Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100% 59. Is system A better than B? 1316 preferences (60%) Do users prefer the(supposedly) worse system? 60. Is system A better than B? 1316 preferences (60%) 61. Statistical Significance has nothingto do with this 62. Picture by Ronny Welter 63. Reporting ResultsConfidence intervals / Variance0.584 64. Reporting ResultsConfidence intervals / Variance0.584 .023 Indicator of evaluation error Better understanding ofexpected user satisfaction 65. Reporting ResultsActual p-values+0.037 .031 * 66. Reporting ResultsActual p-values+0.037 .031 (p=0.02) Statistical Significance is relative =0.05 and =0.01 are completely arbitraryDepends on context, cost of Type I errors and implementation, etc. 67. lets review two papers(again) 68. paper A:+0.14*paper B:+0.21which one should get published?a.k.a. which research line should we follow? 69. paper A (500 queries):+0.14 0.03 (p=0.048)paper B (50 queries):+0.21 0.02 (p=0.052)which one should get published? a.k.a. which research line should we follow? 70. paper A (500 queries):+0.14 0.03 (p=0.048)paper B (50 queries):+0.21 0.02 (p=0.052)which one should get published? a.k.a. which research line should we follow? 71. paper A:+0.14 *paper B:+0.14 *which one should get published?a.k.a. which research line should we follow? 72. paper A (cost=$500,000):+0.14 0.01 (p=0.004)paper B (cost=$50):+0.14 0.03 (p=0.043)which one should get published?a.k.a. which research line should we follow? 73. paper A (cost=$500,000):+0.14 0.01 (p=0.004)paper B (cost=$50):+0.14 0.03 (p=0.043)which one should get published?a.k.a. which research line should we follow? 74. effect-sizes areindicators of user satisfactionneed to personalize resultssmall differences are not noticedp-values areindicators of confidencebeware of collection sizeneed to provide full reports 75. The difference betweenSignicant andNot Signicant is not itself statistically signicant A. Gelman & H. Stern