34
Reproducible Environment for Scientific Applications (Lab session) Tak-Lon (Stephen) Wu

Reproducible Environment for Scientific Applications (Lab session)

  • Upload
    redford

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Reproducible Environment for Scientific Applications (Lab session). Tak -Lon (Stephen ) Wu. Overview. Introduction VirtualBox Prepackaged Image Example 1: Sandbox Hadoop WordCount Example 2: Cloud Twister WordCount Exercises Sandbox Hadoop /Twister Kmeans - PowerPoint PPT Presentation

Citation preview

Page 1: Reproducible Environment for Scientific  Applications  (Lab session)

Reproducible Environment for Scientific Applications (Lab session)

Tak-Lon (Stephen) Wu

Page 2: Reproducible Environment for Scientific  Applications  (Lab session)

Overview

• Introduction• VirtualBox Prepackaged Image• Example 1: Sandbox Hadoop WordCount• Example 2: Cloud Twister WordCount• Exercises

– Sandbox Hadoop/Twister Kmeans– Cloud Hadoop/Twister Kmeans

Page 3: Reproducible Environment for Scientific  Applications  (Lab session)

Motivations• Background knowledge

– Environment setting– Different cloud infrastructure

tools– Software dependencies– Long learning path

• Automatic these complicated steps?

• Solution: Salsa Dynamic Provisioning Infrastructure (SalsaDPI).– batch-like program

Page 4: Reproducible Environment for Scientific  Applications  (Lab session)

What is SalsaDPI? (Sandbox)

OS

chef-solo

SalsaDPI Jar

S/W

Applications

1. Read a Conf. file and execute software run-list

2. Install software

3. Run apps

User Configuration

Page 5: Reproducible Environment for Scientific  Applications  (Lab session)

OSChef

AppsS/W

VMOS

Chef

AppsS/W

VMOS

Chef

AppsS/W

VM

OS

Chef Client

SalsaDPI Jar

Chef Server

1. Bootstrap VMs with a conf. file

4. VM(s) Information

2. Retrieve conf. Info. and request Authentication and Authorization

3. Authenticated and Authorized to execute software run-list

5. Submit application commands

6. Obtain Result

What is SalsaDPI? (Cloud)

* Chef architecture http://wiki.opscode.com/display/chef/Architecture+Introduction

User Conf.

Page 6: Reproducible Environment for Scientific  Applications  (Lab session)

What is SalsaDPI? (Cont.)

• Chef features– On-demand install software when starting VMs– Monitor software installation progress– Easy to use

• SalsaDPI features– Provide configurable interface– Automate Hadoop/Twister/other binary execution

*Chef Official website: http://www.opscode.com/chef/

Page 7: Reproducible Environment for Scientific  Applications  (Lab session)

Hands-on Session

Page 8: Reproducible Environment for Scientific  Applications  (Lab session)

Online Tutorial page

• http://salsahpc.indiana.edu/ScienceCloud/reproduce-intro.html

Page 9: Reproducible Environment for Scientific  Applications  (Lab session)

Prerequisites• Install VirtualBox

on your laptop, download and import a prepackaged image

• Setup FutureGrid Eucalyptus environment

• Make sure you setup the shared folder between host and guest machine correctly

# login to FutureGrid India Headnode i136 $ ssh -i ~/fg_private_key.pem [email protected]

• ~/fg_private_key.pem have to be replaced to your own private key file name

Page 10: Reproducible Environment for Scientific  Applications  (Lab session)

About Pre-packaged Image

• It has the following software installed and configured under /root/software/:– Java JDK– Chef– Hadoop– Twister and ActiveMQ– Hbase– Pig– salsaDPI (/root/salsaDPI/)

Page 11: Reproducible Environment for Scientific  Applications  (Lab session)

Important Notes

• If you have activemq.log and kahadb in directory /root/software/apache-activemq-5.4.2/, please remove them. Otherwise, it will cause errors when running sandbox Twister applications.

$ cd /root/software/apache-activemq-5.4.2/$ ls activemq.log kahadb$ rm -rf activemq.log kahadb

Page 12: Reproducible Environment for Scientific  Applications  (Lab session)

Examples

• Example 1: Sandbox Hadoop WordCount• Example 2: Cloud Twister WordCount

• Goals– Learn and modify SalsaDPI json configuration file – Execute SalsaDPI java executable with passing the

configuration file– http

://salsahpc.indiana.edu/ScienceCloud/handson1_chef_sandbox.html

* Json metadata format example : http://json.org/example.html

Page 13: Reproducible Environment for Scientific  Applications  (Lab session)

Step 1. Open the Conf. File

• Locate and open the configuration file.– /root/salsaDPI/sandbox/templates/sandbox_hadoopTemplate.json

– /root/salsaDPI/sandbox/templates/sandbox_twisterTemplate.json

Page 14: Reproducible Environment for Scientific  Applications  (Lab session)

Step 2. Modify Conf. File'applicationParameters': {

'applicationType':'Hadoop',

'localPathOfProgramBinary':'/root/salsaDPI/apps/hadoopWordCount.jar', 

'localPathOfProgramInput':'/root/salsaDPI/input/hadoopWordCountInput.txt', 

'localPathOfBinaryDependency':'', 

'programExecuteLocation':'', 

'programArgs':'bin/hadoop jar #_JAR_# #_HDFS_INPUTDIR_# #_HDFS_OUTPUTDIR_#'

Page 15: Reproducible Environment for Scientific  Applications  (Lab session)

• Detail description could be see here:– http://salsahpc.indiana.edu/ScienceCloud/handso

n1_chef_sandbox.html

applicationParameters A json object that contains user-defined application's information

applicationType Type of user-defined application, options: Hadoop or Twister

localPathOfProgramBinary Full path of user-defined Hadoop or Twister compiled jar executable on the working machine

localPathOfProgramInput Full path of user-defined input file on the working machine, normally, a plaintext or a *.tar.gz file

localPathOfBinaryDependency Full path of user-defined program dependency file on the working machine, such as Twister Kmeans initial cluster file

programExecuteLocationPath to Twister program execution script refer to Twister package, such as samples/wordcount/bin or samples/kmeans/bin

twisterInputFilesPreFixTwister Input files prefix. Refer to the provided package, for Twister WordCount, the file prefixed is wc_data, for Twister Kmeans is km_data.

programArgs User-defined program execution command

Page 16: Reproducible Environment for Scientific  Applications  (Lab session)

Sandbox Hadoop WordCount

{ // Useful general variables of programArgs for applicationParameters object// #_JAR_#, #_JOB_ID_#, // #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud','mode':'sandbox',// chef-solo related parameters'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz', 'chefSoloConfFilePath':'/root/salsaDPI/solo.rb'}, 

// ssh passwordless related parameters'ssh':{'SSHLoginUsername':'root', 'SSHPrivateKeyPath':'/root/.ssh/id_rsa' }, 

// runtime softwares such as recipe[hadoopSandbox] or recipe[twisterSandbox]'softwareRecipes':['recipe[hadoopSandbox]'], // please don't change this line

// user-defined application parameters'applicationParameters':{'applicationType':'Hadoop','localPathOfProgramBinary':'/root/salsaDPI/apps/hadoopWordCount.jar', 'localPathOfProgramInput':'/root/salsaDPI/input/hadoopWordCountInput.txt', 'localPathOfBinaryDependency':'', 'programExecuteLocation':'', 'programArgs':'bin/hadoop jar #_JAR_# #_HDFS_INPUTDIR_# #_HDFS_OUTPUTDIR_#'} }

Page 17: Reproducible Environment for Scientific  Applications  (Lab session)

Step 3. Execute SalsaDPI with Conf.• Execute SalsaDPI with command:

$ cd ~/salsaDPI$ java -cp salsaDPI.jar cgl.salsa.salsadpi.Driver <path_to_conf_file>

• The output will be stored at <workingDir>/salsaDPI_output/<job_uuid>/output/*.

Page 18: Reproducible Environment for Scientific  Applications  (Lab session)

Demo

• Demo video– Video hands-on 1 Sandbox Hadoop WordCount– YouTube link (1080p)

Page 19: Reproducible Environment for Scientific  Applications  (Lab session)

Examples• Example 1: Sandbox Hadoop WordCount• Example 2: Cloud Twister WordCount

• Goals– Make sure FutureGrid Eucalyptus setup and download required

files correctly– Learn and modify SalsaDPI json configuration file– Execute SalsaDPI java executable with passing the configuration file– http://

salsahpc.indiana.edu/ScienceCloud/handson2_chef_cloud.html– For live testing, please make sure your name is here

* Json metadata format example : http://json.org/example.html

Page 20: Reproducible Environment for Scientific  Applications  (Lab session)

Step 1. Open the Conf. File

• Locate and open the configuration file.– /root/salsaDPI/cloud/templates/cloud_hadoopTemplate.json

– /root/salsaDPI/cloud/templates/cloud_twisterTemplate.json

Page 21: Reproducible Environment for Scientific  Applications  (Lab session)

Step 2. Modify Conf. File'eucaInfo':{

'eucarcFilePath':'#_FullPath_to_eucarc_File_#',

'eucaImageEmi':'emi-A8F63C29',

'eucaSSHPublicKey':'#_Euca_Keypair_PublicKeyName_#',

'eucaVmType':'m1.small',

'amountOfInstances':2},

 

Page 22: Reproducible Environment for Scientific  Applications  (Lab session)

Step 2. Modify Conf. File (Cont.)'ssh': { 'SSHLoginUsername':'root', 'SSHPrivateKeyPath':'/root/#_yourPrivatekey_FileName_#' }, 

Page 23: Reproducible Environment for Scientific  Applications  (Lab session)

Step 2. Modify Conf. File (Cont.)'applicationParameters': {

'applicationType':'Twister',

'localPathOfProgramBinary':'/root/salsaDPI/apps/Twister-WordCount-0.9.jar', 

'localPathOfProgramInput':'/root/salsaDPI/input/twisterWordCountInput.tar.gz', 

'localPathOfBinaryDependency':'', 

'programExecuteLocation':'samples/wordcount/bin',

'twisterInputFilesPreFix':'wc_data', 

'programArgs':'./run_wc.sh #_TWISTER_PARTITION_FILE_# #_TWISTER_OUTPUTDIR_#/wc.out 4 1'

Page 24: Reproducible Environment for Scientific  Applications  (Lab session)

• Detail description could be see here:– http://salsahpc.indiana.edu/ScienceCloud/handso

n2_chef_cloud.html

eucaInfoA json object that contains cloud mode Eucalyptus related information, 'eucarcFilePath', 'eucaImageEmi', 'eucaSSHPublicKey', 'eucaVmType', and 'amountOfInstances'

eucarcFilePath Full path to downloaed eucarc file

eucaImageEmi Eucalyptus VM image registered on FutureGrid, e.g. emi-52C93AC2

eucaSSHPublicKey Eucalyptus public key name (which you setup during the FutureGrid Eucalyptus setting)

eucaVmType Eucalypus VM type, e.g. c1.mediumamountOfInstances Amount of instances for this job, e.g. 2

ssh A json object that contains ssh information, SSHLoginUsername and SSHPrivateKeyPath

SSHLoginUsername Ssh login username, for cloud mode, it must be root.

SSHPrivateKeyPath Full path to ssh private key which uses to login to VM.

Page 25: Reproducible Environment for Scientific  Applications  (Lab session)

Step 3. Execute SalsaDPI with Conf.• Execute SalsaDPI with command:

$ cd ~/salsaDPI$ java -cp salsaDPI.jar cgl.salsa.salsadpi.Driver <path_to_conf_file>

• The output will be stored at <workingDir>/salsaDPI_output/<job_uuid>/output/*.

Page 26: Reproducible Environment for Scientific  Applications  (Lab session)

Cloud Twister WordCount{ // Useful general variables of programArgs for applicationParameters object// #_JAR_#, #_JOB_ID_#, // #_HDFS_INPUTDIR_#, #_HDFS_OUTPUTDIR_#,// #_TWISTER_INPUTDIR_#, #_TWISTER_OUTPUTDIR_#, #_TWISTER_PARTITION_FILE_#, #_BINARY_DEPENDENCY_#

// 'mode':'sandbox', | 'mode':'cloud','mode':'cloud',

// euca cloud parameters'eucaInfo':{'eucarcFilePath':'/root/eucarc','eucaImageEmi':'emi-A8F63C29','eucaSSHPublicKey':'stephen','eucaVmType':'m1.small','amountOfInstances':2},

// ssh passwordless related parameters'ssh':{'SSHLoginUsername':'root', 'SSHPrivateKeyPath':'/root/stephen.pem' }, 

// runtime softwares such as recipe[hadoopSandbox], recipe[twisterSandbox], // recipe[hadoopCloud], and recipe[twisterCloud]'softwareRecipes':['recipe[twisterCloud]'], 

// user-defined application parameters'applicationParameters':{ 'applicationType':'Twister', 'localPathOfProgramBinary':'/root/salsaDPI/apps/Twister-WordCount-0.9.jar',  'localPathOfProgramInput':'/root/salsaDPI/input/twisterWordCountInput.tar.gz',  'localPathOfBinaryDependency':'',  'programExecuteLocation':'samples/wordcount/bin', 'twisterInputFilesPreFix':'wc_data', 'programArgs':'./run_wc.sh #_TWISTER_PARTITION_FILE_# #_TWISTER_OUTPUTDIR_#/wc.out 4 1'} }

Page 27: Reproducible Environment for Scientific  Applications  (Lab session)

Demo

• Demo video– Video Hands-on 2 Cloud Twister WordCount

– YouTube link (1080P)

Page 28: Reproducible Environment for Scientific  Applications  (Lab session)

'applicationParameters':{'applicationType':'Twister','localPathOfProgramBinary':'#_FullPath_To_TwisterKmeans_JAR_#', 'localPathOfProgramInput':'#_FullPath_To_TwisterKmeans_Inputs_GZ_File_#', 

'localPathOfBinaryDependency':'#_FullPath_To_TwisterKmeans_InitClusterFile_#', 'programExecuteLocation':'samples/kmeans/bin','twisterInputFilesPreFix':'km_data', 'programArgs':'./run_kmeans.sh #_BINARY_DEPENDENCY_# 80

#_TWISTER_PARTITION_FILE_# > #_TWISTER_OUTPUTDIR_#/#_JOB_ID_#.txt'} 

Twister Kmeans• Modify Sandbox/Cloud conf. file for Twister Kmeans.• Below are hints for Twister Kmeans conf. file.

Page 29: Reproducible Environment for Scientific  Applications  (Lab session)

Hadoop Kmeans

'applicationParameters': {'applicationType':'Hadoop',

'localPathOfProgramBinary':'#_Path_HadoopKmeans_Jar_#', 'localPathOfProgramInput':'', 'localPathOfProgramDB':'', 'localPathOfBinaryDependency':'', 'programExecuteLocation':'', 'programArgs':'bin/hadoop jar #_JAR_# 500 10 8 3 #_JOB_ID_# > ~/#_JOB_ID_#/#_JOB_ID_#.txt'}

• Modify a Sandbox/Cloud conf. file for Hadoop Kmeans.• Below snapshot provides hints for Kmeans’ programArgs.

Page 30: Reproducible Environment for Scientific  Applications  (Lab session)

Thank you

Page 31: Reproducible Environment for Scientific  Applications  (Lab session)

Cloud Hadoop WordCount{ // mode = 'cloud' 'mode':'cloud', // euca cloud parameters 'eucaInfo':{'eucarcFilePath':'/root/eucarc', 'eucaImageEmi':'emi-A8F63C29', 'eucaSSHPublicKey':'stephen', // replace stephen to your pub key name 'eucaVmType':'m1.small', 'amountOfInstances':2},

'ssh':{'SSHLoginUsername':'root', 'SSHPrivateKeyPath':'/root/stephen.pem'}, // replace stephen.pem to your private key

'softwareRecipes':['recipe[hadoopCloud]'],

'applicationParameters':{ 'applicationType':'Hadoop', 'localPathOfProgramBinary':'/root/salsaDPI/apps/hadoopWordCount.jar', 'localPathOfProgramInput':'/root/salsaDPI/input/hadoopWordCountInput.txt', 'localPathOfProgramDB':'', 'programExecuteLocation':'', 'programArgs':'bin/hadoop jar #_JAR_# #_HDFS_INPUTDIR_# #_HDFS_OUTPUTDIR_#'} }

Page 33: Reproducible Environment for Scientific  Applications  (Lab session)

Sandbox Twister WordCount{ // mode = 'sandbox' 'mode':'sandbox', // chef solo parameters 'chef':{'chefSoloRecipeUrls':'http://129.79.49.248/chef-solo.tar.gz', 'chefSoloConfFilePath':'/root/solo.rb'},

'ssh':{'SSHLoginUsername':'root', 'SSHPrivateKeyPath':'/root/.ssh/id_rsa'},

'softwareRecipes':['recipe[twisterSandbox]'],

'applicationParameters':{ 'applicationType':'Twister', 'localPathOfProgramBinary':'/root/salsaDPI/apps/Twister-WordCount-0.9.jar', 'localPathOfProgramInput':'/root/salsaDPI/input/twisterWordCountInput.tar.gz', 'localPathOfBinaryDependency':'', 'localPathOfProgramDB':'', 'programExecuteLocation':'samples/wordcount/bin', 'twisterInputFilesPreFix':'wc_data', 'programArgs':'./run_wc.sh #_TWISTER_PARTITION_FILE_# #_TWISTER_OUTPUTDIR_#/wc.out 4 1'} }