Exploring our world with freebase

  • View
    9.748

  • Download
    4

Embed Size (px)

DESCRIPTION

I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and other RDF data quickly by using Hadoop, Map/Reduce, and Amazon Web Services

Text of Exploring our world with freebase

  • 1. Exploring Our World With Freebase Paul Houle paul@ontology2.com

2. Generic Databases 3. Where does the data come from? Copyright 2009 CC-BY by Richard HeavenRobot Image Copyright 2007 CC-BY by Crispin Summers 4. Google Knowledge Graph 5. The Wikipedia Data Ecosystem 6. API RDF Deferencing Quad Dump Simple Topic Dump Type Tables 7. MQL { "status": "200 OK", "code": "/api/status/ok", "result": { "type": "/music/artist", "name": "The Police", "album": [ "Outlandos d'Amour", "Reggatta de Blanc", "Zenyatta Mondatta", "Ghost in the Machine", "Synchronicity" ] } } 8. My path to the semantic web 9. My path to the semantic web 10. My path to the semantic web 11. Infovore 1 Quad Dump Simple Topic Dump :BaseKB Pro :BaseKB Lite 12. Spring 2012 13. Fall 2012 Quad Dump Official RDF Dump Infovore 1.0 released as open source under Apache License 14. 13+ million Invalid Facts Image cc-by from arj03 15. Infovore 1.0 Quad Dump -> RDF Infovore 1.1 General RDF Cleanup & Filtering Millipede framework Map/Reduce on a single computer 16. Infovore 2 17. What does Freebase cover? 18. Is it a bibliographic database? 19. Ahead of their time? Reading Room, Library of Congress 20. MARC in electronic form since 1969! First standard data format with variable length fields & I18N. 21. Now everybody has a bibliographic database 22. Or, do documents annotate the world? 23. Social Semantic Systems Linked Data User-Generated Content 24. The dominant paradigm Triple store 25. How to break your triple store http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/ 26. The RDF data warehouse ETL warehouse operations development science 27. The RDF data warehouse II warehouse Operations tools Science Tools 28. Latency: low is not low enough 29. operations development science 30. 0 10 20 30 40 50 60 Freebase DBpedia any relational database machine learning Jena Amazon Web Services PHP map/reduce frameworks (ex. Hadoop) MongoDB Sesame Virtuoso OpenLink other NoSQL database Solid State Drives (SSD) other cloud computing service Neo4J Ruby Drupal alternative JVM languages (ex. Scala or Clojure) other triple store any key/value store (ex. JDBM or Berkeley DB) OWLIM Allegrograph 4store Factual dotNetRDF Stardog Kasabi/Talis Platform Oracle Spatial RDF Tools Popular With :BaseKB Users 31. Map/Reduce Inputs Mappers Shuffle Sort Reducers Output 32. RDF: Reduction on Subject :Goat :Bear :Alligator :Iguana :Dog :Elephant :Cat :Horse :Fox :Alligator :Dog :Goat :Bear :Elephant :Horse :Cat :Fox :Iguana 33. Jena Framework SDB Relational db-based Triple store TDB Native disk-based triple store Model In-memory triple store We use Jena Models like PHP programmers use hashtables -- Kendall Clark, Clark and Parsia 34. Hadoop Physical Architecture Namenode Jobtracker Datanodes & Tasktrackers HDFS 35. My development cluster Namenode/JobTracker 36. Hadoop tolerates Hardware failures 37. My other computer is 38. Amazon Elastic Map/Reduce Amazon S3 (Permanent Storage) 39. Its harder to make up names for things than to invent them - Tom Swift Fictional American Inventor 40. Infovore modules bakemono haruhi centipede chopper 41. Bakemono Super JAR 42. Bakemono Super JAR Contains applications like freebaseRDFPrefilter pse3 ranSample sieve3 Named after Japanese word for monsters 43. Haruhi (1) Japanese religious word for Full of Spirit ; (2) a very dominant person 44. Unpacking the Freebase RDF Dump photograph Copyright 2010 Ian Munroe CC-BY SA 45. Eliminate Bulk Up Front BIG DATA 46. Eliminate Bulk Up Front DATA 47. Inputs Mappers 48. freebaseRDFPrefilter removes Wasteful Facts 120M+ copies of the a predicate 60M+ access control predicates Violent and Dangerous facts ns:common.topic ns:type.type.instance ?o . Is repeated 30M times, and if you group on ?s and keep them in memory 49. uneven bin distribution 331 332330 333 334 335 50. Prefiltering stops memory exhaustion before it happens! 51. Parallel Super Eyeball triples valid triples junk Currently, 250,000 or so triples in Freebase are rejected by PSE3 52. Parallel Super Eyeball 3 53. Sieve3 literal facts (ex. ?s ?p 55. ) ?s :a ?p . ?s ?p ns:some_topic . ?s rdfs:label ?o . 54. Horizontal Decomposition of Freebase 55. a 5% description 18% key 11% keyNs 13% label 6% name 6% notability 0% nfp 0% text 8% web 6% links 20% other 7% percentage of gz compressed size 56. a 16% description 1% key 9% keyNs 11% label 6% name 6% notability 2% nfp 2% text 0% web 5% links 32% other 10% percentage of facts 57. a 15% description 7% key 8% keyNs 9% label 4% name 4% notability 2% nfp 1% text 3% web 6% links 30% other 11% percentage of uncompressed size 58. rdf:type aka a 16% 15% 5% facts bytes compressed bytes ns:m.02qvftw rdf:type ns:business.employer . 59. RDFS Inference :a :Actor ? 60. RDFS Inference Jesse Plemons Todd 61. :a :Actor . Jesse Plemons Todd implies 62. Descriptions 1% facts 18% bytes 7% compressed 63. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . 64. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . This does not compute! 65. Descriptions ns:m.010bfy ns:common.topic.description "Riverside u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt . ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . 66. Labels and Names ns:american_football.football_division rdfs:label "American football division"@en . ns:american_football.football_conference rdfs:label "Grupper inom amerikansk fotboll"@sv . ns:american_football.football_player ns:type.object.name "Football-Spieler"@de . ns:american_football.football_team ns:type.object.name "American football-team"@nl . 67. Freebase Labels Are Not Unique 68. Dbpedia Labels are Unique 69. https://github.com/paulhoule/infovore/wiki https://groups.google.com/forum/#!forum/infovore-basekb 70. Keys in the Freebase dump Most objects represented by mid identifiers 71. Keys in the Freebase dump Schema objects have friendly identifiers 72. Keys in the Freebase dump 73. Examples ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en . ns:american_football.football_division rdfs:label "American football division"@en . Freebase always uses the same key in the ?s, ?p, and ?o fields, but... 74. It wasnt always this way the old quad dump used mids in the subject field, but others in the destination field 75. Turtle0 Turtle1 Turtle2 Turtle3 Extract namespace graph Convert all identifiers to mids Extract type information from schema Convert to RDF types :BaseKB 2012 76. Freebase Knows Many Keys ns:g.11vk55hmr ns:type.object.key "/base/dspl/us_census/population/place" . ns:m.010004m ns:type.object.key "/authority/musicbrainz/339a2897-9ba4-4820-a2a8-f234c22608a4 . ns:Lm.01003_ ns:type.object.key "/wikipedia/de/Krum_$0028Texas$0029 . ns:m.01010d ns:type.object.key "/wikipedia/en_id/135860" . ns:m.0100_b ns:type.object.key "/authority/gnis/1352653" . ns:m.0100l2 ns:type.object.key "/authority/hud/countyplace/4814101390" . ns:m.01031l ns:type.object.key "/en/chandler_texas" . ns:m.015g9m ns:type.object.key "/en/aliens_from_space" . ns:m.015gdl ns:type.object.key "/en/self-publishing" . ns:m.015gjr ns:type.object.key "/authority/nndb/231$002F000085973" . and type.object.key spells them out 77. A directed acyclic graph /m/01 root /m/019s wikipedia /m/047w32v authority /m/0gt9 en /m/05x_rjr Geoff_Simmons /wikipedia/en/Geoff_Simmons = /authority/wikipedia/en/Geoff_Simmons 78. key: namespace encodes the graph ns:m.010005 key:wikipedia.pt "Corinth_$0028Texas$0029" . ns:m.010005h key:authority.musicbrainz "ab0b82ce-d1be-4641-b0d1-838896a25887" . 79. Useful external keys 80. Music 81. http://www.freebase.com/authority/musicbrainz/e217a1e9-9ec8-4e88-aebc-7d6b720384c1 82. Musical Composition Recording Recording appears on Album as track # 83. Functional Requirements For Bibliographic Records (FRBR) 84. Nick Hexium Rap Rock 311 Omaha, NE Los Angeles, CA 85. Unique data in DBpedia 86. Wikipedia Categories 87. Wikipedia Page Links 88. Smushing dbpedia:Striated_Heron :linksTo dbpedia:Heron . dbpedia:Striated_Heron owl:sameAs ns:m.01v7dp . dbpedia:Heron owl:sameAs ns:m.01jgnh . Ns:m.01v7dp :linksTo ns:m.01jgnh . 89. Duck Types ?a performed on music track ?b - ?a is a musician 90. Duck Types ?a employed ?b - ?a is an employer 91. Duck Types Book ?a was written about ?b ?b is a book subject 92. The Problem of Notability 93. ns:m.0100007 ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000_r ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000dh ns:common.topic.notable_types ns:m.09jd9nh. ns:m.01000pp ns:common.topic.notable_types ns:m.09jd9nh. ns:m.01000px ns:common.topic.notable_types ns:m.0kpv11. ns:m.01000w ns:common.topic.notable_types ns:m.01m9. ns:m.01000yk ns:common.topic.notable_types ns:m.0kpv11. ns:m.010012t ns:common.topic.notable_types ns:m.0kpv11. ns:m.010014_ ns:c