87
BigData and Modern XML Jim Fuller email: [email protected] twitter: @xquery Senior Engineer, Europe 19/09/12

XML Amsterdam 2012 Keynote

Embed Size (px)

DESCRIPTION

Defining Modern XML that works with Bigdata

Citation preview

  • 1. BigData and Modern XMLJim Fulleremail: [email protected] twitter: @xquerySenior Engineer, Europe19/09/12

2. Senior engineerhttp://jim.fuller.namehttp://exslt.org @xqueryXSLT UK 2001http://www.xmlprague.cz@perl6PerlmonksPilgrim 3. KickoffXML current status Modern XML & BigData 4. ontogeny recapitulates phylogeny orA (very)Brief History of ML Late 1950s: Noam Chomsky generativegrammars 1969: Charles Goldfarb (w/ Ed Mosher andRay Lorie) created GML 1986: SGML formalized 1998: XML 1.0 W3C recommendation 1998 2012: A lot of stuff happened Future: XML 2.0 microXML ? 5. RDBMS Goliath vs XML David Back then, XML was the proto nosql X in AJAX Now many davids AJAJ 6. Documents Back then, it wasnt unusual for vendor to saytough luck with your data (pay up) Now, most office documents are in XML 7. The long tail of XML Vocabularies Back then, vocabularies built with proprietary approaches Today, 1000s of vocabularies based on XML 2012 U.S. GAAP Taxonomy Adopted by SEC; FASB Webcast April 3 8. Anyone heard of shipdex ? 9. Back then, XML/Markup Conferences Software Development 99 East, November 8-13, 1999, Washington D.C.XML One Fall 99, November 8-11, 1999, Santa Clara, CAXML 99 December 6-9, 1999, Philadelphia PAMarkup Technologies 99 Conference December 5-9, 1999, PhiladelphiaWeb Design 2000, February 7-9, 2000, AtlantaXTech 2000, February 27-March 2, San JoseSoftware Development 2000 West, March 20-24, 2000, San Jose Sixteenth International Unicode Conference, Boston, March 27-30, 2000The Ninth International World Wide Web Conference, May 15-19, 2000,AmsterdamDL 2000: Fifth ACM Conference on Digital Libraries, June 3-6 2000, TexasXML Europe 2000, June 12-16, ParisWeb Design World 2000, July 17-21, 2000, Seattle, WashingtonMetaStructures, August 14-16, 2000, Montreal, Quebec, CanadaXML Developers Conference, August 17-18, 2000, Montreal, QuebecInternet World Expo, October 25-27, 2000, New York CityXML 2000/Markup Technologies 2000, December 3-7, Washington .. Even a Geek Cruises XML Excursion - January 2001 10. Today - XML/Markup Conferences The XML parallelogram Balisage XML Summer School XML Prague XML Amsterdam Xtech* markupForum XATA MarkLogic World (600 ppl) databaseX (London November 2013 ?) 11. Other important good stuff Evolution of the Operating System Unix is the operating system for text Windows tried to be the operating system forbinaries, then adopted xml .. Mixed bag Java (vm) has a strong xml stack The web changed everything to text basedmarkup. cheap RAM/Disk/CPU Virtualization = scale out 12. Other important good stuffhttp://googleblog.blogspot.cz/2012/02/unicode-over-60-percent-of-web.html 13. Unfair to point out failure ? Namespaces XLINK WS* astronautics Draconian error checking XML SCHEMA XFORMS XSLT 1.0 (or any xml) in the browser XHTML vs HTML5 Too many specs (modularity good, complexitybad) 14. Winning isnt everything. Thereshould be no conceit in victory and no despair in defeat. - Matt Busby 2001 I was the RDBMS serial killer kill RDBMS Define successful ? Adoption ? Cheaper ? Faster ? Better ? 15. Drill Down distraction - Why is Xquerysuccessful productive ? Choose my most successful (adhoc stories,visible success) Functional, dynamic work with structure,text and values stored proc + query lang XPATH^ Is it possible to qualify/quantify Xqueryproductivity? 16. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular stringprocessing problem - public domain 2006. 17. Programming Language ProductivityData compiled from studies by Prechelt and Garret of a particular string processing problem - public domain 2006. 18. * 28msec 2011 http://www.28msec.com/html/home Java XQuerySimpleDB 2905 572S3 8589 1469SNS2309 455 138032496 19. Developing an Enterprise WebApplication in XQuery - 2009 Martin Kaufmann, Donald KossmannJava/J2EE XQuery Model3100240 View 41001500 Controller 900 11808100 (?)2920 (3490) 20. Nooooo! The problem with loccorrelation of failurewith very high loc isthe only certain factwith loc Thats about it 21. An empirical comparison of C, C++,Java, Perl, Python, Rexx, and Tcl for asearch/string-processing programLutz Prechelt ([email protected]) Fakulta t fu r Informatik Universita t KarlsruheLanguage #loc per Function PointC 91 C++53 Java54 Perl21* Designing and writing programs using dynamic languages tended totake half as long as well as resulting in half the code. 22. Function Point MethodNooooo!#loc per FP = Lines of codePerFunction Point 23. Project Uncertainty Principle * Dilbert Comic 2003 United Features Syndicate Inc 24. Reviewed 11 projects FP Analysis Calc FP inputs/outputs Calc VAF (0.65 + [ (Ci) / 100]) AVP = VAF * sum(FP)#loc using cloc= #loc per FP* FP overview - http://www.softwaremetrics.com/fpafund.htm 25. Language#loc per Function PointPerl 21Eiffel 21SQL 13-30XQuery27-33Haskell38 Erlang40Python42-47Java50-80 Javascript 50-55Scheme 53C++ 59-80C128-140 http://www.qsm.com/resources/function-point-languages-table 26. Xquery 2011 Survey 27. Preferred Programming Language73%55%45%32%22% 28. Which data formats do you use the most ? 95% 40% 39%32% 27% 18% 15% 29. Do you think XQuery makes you a more productive programmer ? 67%14%10% 8% 30. Is XQuery more productive then (with???) Javain developing web based data applications ? 58%22%12% 8% 31. Time to bust one myth xml is too slow and bloated http://www.navioo.com/ajax/ajax_json_xml_Benchmarking.php In data orientated AJAJ scenarios withJSON best most benchmarks today is30% faster with less load (so more withless resources) 32. mongodb * http://www.linkedin.com/skills/skill 33. Javascript * http://www.linkedin.com/skills/skill 34. XQuery* http://www.linkedin.com/skills/skill 35. XSLT * http://www.linkedin.com/skills/skill 36. hadoop* http://www.linkedin.com/skills/skill 37. Java * http://www.linkedin.com/skills/skill 38. JSON * http://www.linkedin.com/skills/skill 39. XML* http://www.linkedin.com/skills/skill 40. Back When SQL Was Invented 41. born in the 90s 42. XML ? 43. Might even be 44. Channel effect of Aging inTechnology Average age of @guardian Facebookaudience is 29. Website is 37, print paper 44.Amazing channel effect, really. #newsrw Babyboomers, Gen X, Y and Z I feel a bit uneasy framing generationalarguments 45. Death of the XML ChildOverachieving Child Prodigiesgrow up 46. Lets not get distracted. 47. Dont mention the war 48. XML Hard Core - XML Hype cycle 20022006 201219982007 XMLs reported death-> 2009 49. REST of the World - XML Hype cycle 2006 200220091998 2012XMLs reported death-> 50. hype cycle*2012 Gartner Hype Cycle http://www.gartner.com 51. 2001 Edd Dumbill xml.comStop the XML hype, I want to get off As editor of XML.com, I welcome the massivesuccess XML has had. But things prized by the XMLcommunity openness and interoperability aregetting swallowed up in a blaze of marketing hype. Is thisthe price of success, or something we can avoid? Source: Edd Dumbill (March 2001) 52. 2012 Edd Dumbill g+ postFor many years I was the editor of XML.com,and the chair of the XML Europe conference.Today, it seems that XMLs mission to be a weblanguage is mostly dead. Im not saying XML isuseless: it has proved itself as a more easily-usedSGML, but Im not sure its expanded too faroutside of that. Source: Edd Dumbill (March 2012) 53. Current Status: XML is dead XML fought too many battles (RDBMS, NoSQL,web developers, HTML5) Age channeling and Hype curve in effect But XML technology stack is embracing JSONetc No room for sentimentality in technology 54. XML is dead boring 55. Halftime Break 56. Big Data & Modern XML 57. Whats the problem ? 58. Is XML Applicable to Big Data ? We know it is, thats why I am here Some of you already know Need to dig into the detail But we first need to simplify things 59. http://kensall.com/big-picture/bigpix22.html 60. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity 61. * http://gigaom.com/cloud/big-data-equals-big-opportunities-for-businesses-infographic/ BigData Opportunity 62. managing data variability, volume & velocity is hardYou need to be a (data) scientist to build this rocket ship. 63. So whats the problem again ? #1 How to Apply Modern XML to your BigData problems ? #1a: XML Milieu too complicated, need to identify what is successful as Modern XML #1b BigData is a huge opportunity #1c BigData has a huge learning curve and high risks 64. Solving #1 Defining Modern XML Identify the technologies Identify and classify the Scenarios 65. Modern XML Technology analysis Internal survey of ML Customer projects &External survey of projects (w/ pref towardsBig/Complex projects) Informal Survey (polldaddy) Qualitative and quantitative 66. Eisenhower - "What is important is seldomurgent and what is urgent is seldom important,"URGENT NOT URGENTIMPORTANT Critical GoalsNOT IMPORTANT interruptions Distractions 67. Survey Interpretations XML 1.0, Namespaces is important now XProc, XHTML important now XSLT 2 and XQuery 1 very important now XSLT 2 and XQuery 2 in the browser near future XQuery 3.0 important near future SAX/DOM now, XOM possible future XML Schema 1.0 now, 1.1 for the near future Schematron surprising Semweb is for the future SVG and MathML due to web browser support XML vocabulary has a very long tail 68. Modern XML Technology CandidatesCore XML 1.0 These technologies trended Namespaceshighly across all analysisOther Bold could be trending due to browser impl/historicalTransformXSLT 2.0dep XQuery 1.0Processing SAX, DOMSchema Schematron XML Schema 1.0SemanticsRDF OWLVocabularies Office Doc ML SVG 69. Modern XML Tier 1Core XML 1.0These technologies trended Namespaces highly across all analysisOtherXProcBold could be trending dueto browser impl/historicalTransformXSLT 2.0 / 3.0 / browser dep XQuery 1.0 / 3.0Processing SAX, DOMItalic strong signal, earlySchema Schematron usage, interest of unproven XML Schema 1.0 / 1.1 spec/techSemanticsRDF OWLVocabularies Office Doc ML SVG 70. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0XML Canonicalization Namespaces xml:idOtherXProcXHTML*TransformXSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAXRELAX-NGSchema Schematron XML Schema 1.0 / 1.1SPARQLSemanticsRDF OWLVocabularies Office Doc MLMathML SVGDocbookSOAP* , DITA, EPUB 71. Modern XML Modern XML Tier 1 Tier 2Core XML 1.0XML Canonicalization Namespaces xml:idXML infosetOtherXProcXHTML*TransformXSLT 2.0 / 3.0 / browser XSLT 1.0 XQuery 1.0 / 3.0Processing SAX, DOM XOM, STAXRELAX-NGSchema Schematron XML Schema 1.0 / 1.1SPARQLSemanticsRDF OWLVocabularies Office Doc MLMathML SVGDocbookSOAP , DITA, EPUB,Data Formats XML, text, binary, JSON 72. The technology triggers XML Database reduce the complexity/risk ofBigData MarkLogic eXist Zorba Sedna Basex Others (Oracle!) Xquery - Rapid prototyping Avoid purist architectures, embraceheterogeneity 73. Modern XML / BigData Scenarios Classic Scenarios Document (xml) Database Aggregation Enterprise Search Heterogeneous Content store Publishing BigData Scenarios BigData classic Extreme personalisation Predictive analytics Financial analysis Realtime analysis (management/financial) Actionable intelligence Semantic Web too early to categorize but its for real 74. Solving Problem #2 Focus on the Practicalities What type of Big Data problem do you have ? The urgent, important ones you know about The urgent, important ones you dont know about Create a dedicated team (analytics, problemdomain experts) to identify the later Assess data maturity (Data Audit) With power comes responsibility EthicalAnalytics 75. BigData Tech Advice Start using an XML database asap! Dont get distracted by the zoo starthadooping right away Data outlives code, spend more time on thedata, clean abstractions, cogent, opening it up 76. Size appropriatelyVolume will be relative to your current capability,if the requirement is a magnitude greater pastcurrent infrastructure scalingVelocity Updates versus reads ? High volatilitywith realtime queries ?Variety managing versioning ?Complexity multiples, complex processes 77. Size Appropriately: Are you a Facebook (Google, Yahoo) ? 2.5 billion content items shared per day (status updates + wall posts +photos + videos + comments) 2.7 billion Likes per day 300 million photos uploaded per day 100+ petabytes of disk space in one of FBs largest Hadoop (HDFS)clusters 105 terabytes of data scanned via Hive, Facebooks Hadoop querylanguage, every 30 minutes 70,000 queries executed on these databases per day 500+terabytes of new data ingested into the databases every day Are you planning to scale out too ~180,900 servers ? ~18000 database servers ingesting 500+ terabytes of data through aguestimated 50+ billion calls . A day! http://www.datacenterknowledge.com/the-facebook-data-center-faq/ 78. Solving Problem #3 Understandingthe risks Biggest mistakes seen with BigData adoption data scientists themselves dont have much ofintuition eitherand that is a problem. I saw anestimate recently that said 70 to 80 percent of theresults that are found in the machine learningliterature, which is a key Big Data scientific field, areprobably wrong because the researchers didntunderstand that they were overfitting the data. Alex Pentland MITs Big Data guy 79. Summary We reviewed some aspects of XML currentstatus in the dataverse Identified a subset of the XML Milieu callingit Modern XML Identified the scenarios where Modern XMLare being brought to bear with Bigdata Reviewed common mistakes and Risks withBigData 80. Final Thesis Modern XML provides great foundation today Great for classic scenarios Great technical positioning for addressingchallenges of BigData Great technical positioning for semweb Adopting an XML database mitigates risk Knowing Bigdata/Modern XML scenarios helpsus mitigate risks There is a big prize if you get BigData right 81. Avoid stereotypes Im a RDBMS Im a Protocol Buffer Im a JsonIm an XML 82. Jeni Tennison XML Prague 2012 talk JSON XML RDFHTML 83. Be wary of Paradigm Shifts RedMonks - Language divergence Andresson - Software is eating the world 128bit and beyond current vonneuman/harvard arch ? Power Wall (at server farms/mobile devices) The web revolution is not done yet(http://www.firebase.com/index.html) 84. Embrace change 85. Form is temporary. Class is permanent XML is emerging from its Trough ofdisillusionment, because its useful, productiveand reacting to new requirements. Modern XML is successful on many differentmeasure, mature and dead boring Modern XML can help solve your BigDataproblems 86. Pull the Technology Trigger Try an XML Database Today! MarkLogic 6 Web dev surface area, work with JSON REST API Java API Work across different data Zorba eXist BaseX Sedna