SDEC2011 Essentials of Pig

  • View
    1.416

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Text of SDEC2011 Essentials of Pig

  • 1. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Essentials of PigMastering Hadoop Map-reduce for Data AnalysisShashank Tiwariblog: shanky.org | twitter: @tshankyst@treasuryodeas.com
  • 2. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Session Agenda What is Pig and why should you use it? Installing & Setting up Pig Pigs Components Using Pig with Hadoop MapReduce Summary & Conclusion
  • 3. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.What is Pig? Higher-level abstraction for Hadoop MapReduce An infrastructure for data analysis using a scripting language named, Pig Latin
  • 4. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Why should you use Pig? Hadoop MapReduce: Requires you to be a programmer Forces you to design all your algorithms in terms of the map and reduce primitives
  • 5. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Installing & Setting Up Pig -- Required Software Required Software: Java 1.6.x Hadoop 0.20.x Ant 1.7+ (for builds) JUnit 4.5 (for tests) Cygwin (on Windows)
  • 6. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Download Source: http://pig.apache.org/ Version: 0.8.1 -- current stable
  • 7. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Install & Congure Extract: tar zxvf pig-0.8.1.tar.gz Move & Create Symbolic Link: ln -s pig-0.8.1 pig Edit: bin/pig export PIG_CLASSPATH=$HADOOP_HOME/conf
  • 8. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Verify Installation Verify: (remember to start Hadoop rst.) bin/pig -help (command options) bin/pig (run the grunt shell)
  • 9. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Running Pig Run Mode Local Mode -- single machine MapReduce Mode -- needs a Hadoop cluster (with HDFS) Run via: grunt shell pig scripts embedded programs
  • 10. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig IDE PigPen, an eclipse based IDE graphical data ow denition can show example data ow
  • 11. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Components Pig Latin Pig Engine execution engine on top of Hadoop includes default optimal congurations
  • 12. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.A client for your cluster Pig does not run on a Hadoop cluster It connects to one
  • 13. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin Data ow language (Not declarative like SQL) Increases productivity (less lines do more) Includes standard operations like join, lter, group, sort User code and existing binaries can be included Supports nested data types Does not require metadata
  • 14. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin Example Will leverage the tutorial that comes with the distribution Check the tutorial folder in the distribution
  • 15. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Start Grunt Shell cd $PIG_HOME bin/pig -x local
  • 16. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Aggregate Data grunt> log = LOAD tutorial/data/excite-small.log AS (user, timestamp, query); alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log); grunt> STORE counted INTO output;
  • 17. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Data grunt> grouped = GROUP log BY user; In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples. The key of the tuples is the same key as that of the (key, collection) pair
  • 18. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Filter Data grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query); grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; grunt> ltered = FILTER counted BY cnt > 75; grunt> STORE ltered INTO output1;
  • 19. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Order Data grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query); grunt> grouped = GROUP log BY user; grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; grunt> ltered = FILTER counted BY cnt > 50; grunt> sorted = ORDER ltered BY cnt; grunt> STORE sorted INTO output2;
  • 20. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Join Data Example Words appearing in Adventures of Huckleberry Finn by Mark Twain http://www.gutenberg.org/ebooks/76 Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle http://www.gutenberg.org/ebooks/1661
  • 21. Condential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading & Counting Huckleberry Finn Data grunt> A = load pg76.txt; grunt> B = foreach A generate atten(TOKENIZE((chararray)$0)) as word; grunt> C = lter B by word matches w+; grunt> D = group C by word; grunt> E = foreach D generate COUNT(C), group; store E in