18
RHive tutorial - basic functions This tutorial explains how to load RHive library and use basic Functions for RHive. Loading RHive Load RHive with the method used when using any R package. Load RHive like below: library(RHive) But before loading RHive, you must not forget to configure HADOOP_HOME and HIVE_HOME environment And if they are not set then you can temporarily set them before loading the library, like as follows. HADOOP_HOME is the home directory where Hadoop is installed and HIVE_HOME is the home directory where Hive is installed. Consult RHive tutorial - RHive installation and setting for details on environment variables. Sys.setenv(HIVE_HOME="/service/hive0.7.1") Sys.setenv(HADOOP_HOME="/service/hadoop0.20.203.0") library(RHive) rhive.init rhive.init is a procedure that internally initializes and if, before loading RHive, environment variables were calibrated accurately then they will automatically run. But if these environment variable were not configured while RHive was loaded via library(RHIve) then the following error message will result. rhive.connect() Error in .jcall("java/lang/Class", "Ljava/lang/Class;", "forName", cl, : No running JVM detected. Maybe .jinit() would help. Error in .jfindClass(as.character(class)) : No running JVM detected. Maybe .jinit() would help.

RHive tutorials - Basic functions

Embed Size (px)

DESCRIPTION

One can learn how to use basic functions in RHive as reading this document. This document was updated at 5th March 2012.

Citation preview

Page 1: RHive tutorials - Basic functions

RHive tutorial - basic functions This tutorial explains how to load RHive library and use basic Functions for RHive.

Loading RHive Load RHive with the method used when using any R package. Load RHive like below:

library(RHive)  

But before loading RHive, you must not forget to configure HADOOP_HOME and HIVE_HOME environment And if they are not set then you can temporarily set them before loading the library, like as follows. HADOOP_HOME is the home directory where Hadoop is installed and HIVE_HOME is the home directory where Hive is installed. Consult RHive tutorial - RHive installation and setting for details on environment variables.

Sys.setenv(HIVE_HOME="/service/hive-­‐0.7.1")  

Sys.setenv(HADOOP_HOME="/service/hadoop-­‐0.20.203.0")  

library(RHive)  

rhive.init rhive.init is a procedure that internally initializes and if, before loading RHive, environment variables were calibrated accurately then they will automatically run. But if these environment variable were not configured while RHive was loaded via library(RHIve) then the following error message will result.

rhive.connect()  

Error  in  .jcall("java/lang/Class",  "Ljava/lang/Class;",  "forName",  cl,    :  

   No  running  JVM  detected.  Maybe  .jinit()  would  help.  

Error  in  .jfindClass(as.character(class))  :  

   No  running  JVM  detected.  Maybe  .jinit()  would  help.  

Page 2: RHive tutorials - Basic functions

For this case then designate HADOOP_HOME and HADOOP_HOME as shown below or exit R then configure environment variables and restart R.

Sys.setenv(HIVE_HOME="/service/hive-­‐0.7.1")  

Sys.setenv(HADOOP_HOME="/service/hadoop-­‐0.20.203.0")  

rhive.init()  

Or,

close  R  

export  HIVE_HOME="/service/hive-­‐0.7.1"  

export  HADOOP_HOME="/service/hadoop-­‐0.20.203.0"  

open  R  

rhive.connect All Functions of RHive will only work after having connected to Hive server. If before using other Functions of RHive, you have not established a connection by using the rhive.connect Function, All RHive Functions will malfunction and produce the following errors when running.

Error  in  .jcast(hiveclient[[1]],  new.class  =  "org/apache/hadoop/hive/service/HiveClient",    :  

   cannot  cast  anything  but  Java  objects  

Establishing a connection with Hive server to use RHive is simple with the following:

rhive.connect()  

The example above can additionally assign a few more things.

rhiveConnection  <-­‐  rhive.connect("10.1.1.1")  

In the case the user’s Hive server is installed to a server other than the one with RHive installed, and has to remotely connect, a connection can be made by handing arguments over to the rhive.connect Function.

Page 3: RHive tutorials - Basic functions

Then if you have multiple Hadoop and Hive clusters, then after making the right configurations to have RHive activated, and you want to switch between the Hives then just like using DB client such as MySQL, you should make connections and hand it over to the Functions via arguments to explicitly select connection.

rhive.query If the user has experience in using Hive, then he/she probably knows that Hive supports SQL syntax to handle the data for Map/Reduce and HDFS. rhive.query gives SQL to Hive and receives results from Hive. Users who know SQL syntax will find this a frequently encountered example.

rhive.query("SELECT  *  FROM  usarrests")  

If you run the example above then you will see the contents of a table named ‘usarrests’ printed on the screen. Or, on top of printing the returned result on the screen, you can also assign to a data.frame object those results.

resultDF  <-­‐  rhive.query("SELECT  *  FROM  usarrests")  

A thing to beware of is if the data returned from rhive.query is bigger than the RHive server’s memory or laptop’s, exhaustion of available memory will induce an error message. That is why you must not receive and put into object any data of such size. It is better to first create a temporary table and then put the results of the SQL to the temporary table. You can do it as the following.

rhive.query("  

CREATE  TABLE  new_usarrests  (  

   rowname        string,  

   murder        double,  

   assault              int,  

   urbanpop              int,  

   rape        double  

)")  

Page 4: RHive tutorials - Basic functions

   

rhive.query("INSERT  OVERWRITE  TABLE  new_usarrests  SELECT  *  FROM  usarrest")  

Consult a Hive document for a detailed account of how to use Hive SQL.

rhive.close If you have finished using Hive and do not wish to use RHive Functions any longer, you can use the rhive.close Function to terminate the connection.

rhive.close()  

Alternatively, you can assign a specific connection to close it.

conn  <-­‐  rhive.connect()  

rhive.close(conn)  

rhive.list.tables The rhive.list.tables Function returns the results of tables in Hive.

rhive.list.tables()  

             tab_name  

1                  aids2  

2  new_usarrests  

3          usarrests  

This is effectively identical to this:

rhive.query("SHOW  TABLES")  

rhive.desc.table The rhive.desc.table Function shows the description of the chosen table.

rhive.desc.table("usarrests")  

Page 5: RHive tutorials - Basic functions

   col_name  data_type  comment  

1    rowname        string  

2      murder        double  

3    assault              int  

4  urbanpop              int  

5          rape        double  

This is effectively identical to this:

rhive.query("DESC  usarrests")  

rhive.load.table The rhive.load.table Function loads Hive tables’ contents as R’s data.frame object.

df1  <-­‐  rhive.load.table("usarrests")  

df1  

This is effectively identical to this:

df1  <-­‐  rhive.query("SELECT  *  FROM  usarrests")  

df1  

rhive.write.table The rhive.write.table Function is the antithesis of rhive.load.table. But it is more useful than rhive.load.table. If you wish to add data to a table located in Hive, you must first make a table. But using rhive.write.table does not require any additional work, and simply creates R’s dataframe into Hive and inserts all data.

head(UScrime)  

       M  So    Ed  Po1  Po2    LF    M.F  Pop    NW    U1  U2  GDP  Ineq          Prob        Time        y  

1  151    1    91    58    56  510    950    33  301  108  41  394    261  0.084602  26.2011    791  

Page 6: RHive tutorials - Basic functions

2  143    0  113  103    95  583  1012    13  102    96  36  557    194  0.029599  25.2999  1635  

3  142    1    89    45    44  533    969    18  219    94  33  318    250  0.083401  24.3006    578  

4  136    0  121  149  141  577    994  157    80  102  39  673    167  0.015801  29.9012  1969  

5  141    0  121  109  101  591    985    18    30    91  20  578    174  0.041399  21.2998  1234  

6  121    0  110  118  115  547    964    25    44    84  29  689    126  0.034201  20.9995    682  

   

rhive.write.table(UScrime)  

[1]  "UScrime"  

   

rhive.list.tables()  

             tab_name  

1                  aids2  

2  new_usarrests  

3          usarrests  

4              uscrime  

   

rhive.query("SELECT  *  FROM  uscrime  LIMIT  10")  

     rowname      m  so    ed  po1  po2    lf      mf  pop    nw    u1  u2  gdp  ineq          prob        time  

1                1  151    1    91    58    56  510    950    33  301  108  41  394    261  0.084602  26.2011  

2                2  143    0  113  103    95  583  1012    13  102    96  36  557    194  0.029599  25.2999  

3                3  142    1    89    45    44  533    969    18  219    94  33  318    250  0.083401  24.3006  

4                4  136    0  121  149  141  577    994  157    80  102  39  673    167  0.015801  29.9012  

5                5  141    0  121  109  101  591    985    18    30    91  20  578    174  0.041399  21.2998  

6                6  121    0  110  118  115  547    964    25    44    84  29  689    126  

Page 7: RHive tutorials - Basic functions

0.034201  20.9995  

7                7  127    1  111    82    79  519    982      4  139    97  38  620    168  0.042100  20.6993  

8                8  131    1  109  115  109  542    969    50  179    79  35  472    206  0.040099  24.5988  

9                9  157    1    90    65    62  553    955    39  286    81  28  421    239  0.071697  29.4001  

10            10  140    0  118    71    68  632  1029      7    15  100  24  526    174  0.044498  19.5994  

           y  

1      791  

2    1635  

3      578  

4    1969  

5    1234  

6      682  

7      963  

8    1555  

9      856  

10    705  

The rhive.write.table Function encounters an error and does not work if the table to be saved into Hive already exists. Hence, if attempting to save to Hive any dataframes with the same name and symbol as any table already in Hive, it is imperative that you delete them before using rhive.write.table.

if  (rhive.exist.table("uscrime"))  {  

   rhive.query("DROP  TABLE  uscrime")  

}  

   

rhive.write.table(UScrime)  

Page 8: RHive tutorials - Basic functions

RHive - alias functions RHive’s Functions look similar to S3 generic’s naming rules but many are actually not generic. This is for the S3 generic Functions which RHive may or may not support in the future. For users who detest confusion wrought by Functions that, despite containing “.” yet still do not count as generic, there exist some Functions with different names but serve the same roles. The following alias Functions are such as described below.

hiveConnect This is same as rhive.connect.

hiveQuery This is same as rhive.query.

hiveClose This is same as hive.close.

hiveListTables This is same as hive.list.tables.

hiveDescTable This is same as hive.desc.table.

hiveLoadTable This is same as hive.load.table.

Page 9: RHive tutorials - Basic functions

rhive.basic.cut rhive.basic.cut converts one numerical column from a table to one factorized column. First, the range of the numerical column is divided into intervals, and the values in the numerical column are factorized according to which interval they fall. Rhive.basic.cut receives the following six arguments, tablename(a table name), col(a numerical column name), breaks, right, summary, and forcedRef. breaks are numerical cut points for the numerical column. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. summary = TRUE spits out total counts of numerical values corresponding to the intervals. If FALSE, the name of a new table updated by the factorized table is returned. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.

Example for summary = FALSE

>  table_name  =  rhive.basic.cut(tablename  =  "iris",  col  =  "sepallength",  breaks  =  seq(0,  5,  0.5),  right  =  FALSE,  summary  =  FALSE,  forcedRef  =  TRUE)  

>  table_name  

[1]  "rhive_result_1330382904"  

attr(,"result:size")  

[1]  4296  

>  results  =  rhive.query("select  *  from  rhive_result_1330382904")  

>  head(results)  

   rowname  sepalwidth  petallength  petalwidth  species  sepallength  

1              1                3.5                  1.4                0.2    setosa                NULL  

2              2                3.0                  1.4                0.2    setosa      [4.5,5.0)  

3              3                3.2                  1.3                0.2    setosa      [4.5,5.0)  

4              4                3.1                  1.5                0.2    setosa      [4.5,5.0)  

5              5                3.6                  1.4                0.2    setosa                NULL  

6              6                3.9                  1.7                0.4    setosa                NULL  

Example for summary = TRUE

Page 10: RHive tutorials - Basic functions

>  summary  =  rhive.basic.cut(tablename  =  "iris",  col  =  "sepallength",  breaks  =  seq(0,  5,  0.5),  right  =  FALSE,  summary  =  TRUE,  forcedRef  =  TRUE)  

>  summary  

         NULL  [4.0,4.5)  [4.5,5.0)  

           128                  4                18  

rhive.basic.cut2 rhive.basic.cut2 converts two numerical columns from a table to two factorized columns. That is, the range of each numerical column is divided into intervals, and the values in each numerical column are factorized according to which interval they fall. Rhive.basic.cut2 receives the following eight arguments, tablename(a table name), col1, col2(two column names), breaks1, breaks2, right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points for the two numerical columns. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. keepCol = TRUE makes the two numerical columns kept even after the conversion. Otherwise, the factorized columns replace the original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.

Example for right = TRUE and keepCol = FALSE

> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2 = "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right = TRUE, keepCol = FALSE, forcedRef = TRUE)

> table_name

[1] "rhive_result_1330385833"

attr(,"result:size")

[1] 5272

> results = rhive.query("select * from rhive_result_1330385833")

> head(results)

Page 11: RHive tutorials - Basic functions

rowname sepalwidth petalwidth species sepallength petallength rep

1 1 3.5 0.2 setosa NULL (1.0,1.5] 1

2 2 3.0 0.2 setosa (4.5,5.0] (1.0,1.5] 1

3              3                3.2                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  

4              4                3.1                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  

5              5                3.6                0.2    setosa      (4.5,5.0]      (1.0,1.5]      1  

6              6                3.9                0.4    setosa                NULL      (1.5,2.0]      1  

Example for right = FALSE and keepCol = TRUE

>  table_name  =  rhive.basic.cut2(tablename  =  "iris",  col1  =  "sepallength",  col2  =  "petallength",  breaks1  =  seq(0,  5,  0.5),  breaks2  =  seq(0,  5,  0.5),  right  =  FALSE,  keepCol  =  TRUE,  forcedRef  =  TRUE)  

>  table_name  

[1]  "rhive_result_1330315663"  

attr(,"result:size")  

[1]  6374  

>  results  =  rhive.query("select  *  from  rhive_result_1330315663")  

>  head(results)  

   rowname  sepalwidth  petalwidth  species  sepallength  sepallength_cut  petallength  petallength_cut  rep  

1              1                3.5                0.2    setosa                  5.1                        NULL                  1.4              [1.0,1.5)      1  

2              2                3.0                0.2    setosa                  4.9              [4.5,5.0)                  1.4              [1.0,1.5)      1  

3              3                3.2                0.2    setosa                  4.7              [4.5,5.0)                  1.3              [1.0,1.5)      1  

4              4                3.1                0.2    setosa                  4.6              [4.5,5.0)                  1.5              [1.5,2.0)      1  

Page 12: RHive tutorials - Basic functions

5              5                3.6                0.2    setosa                  5.0                        NULL                  1.4              [1.0,1.5)      1  

rhive.basic.xtabs rhive.basic.xtabs makes a contingency table from cross-classifying factors. A formula object and a table name are used as input arguments and a contingency table with matrix format is returned based on the given formula. For instance, two column names, agegp and alcgp from a table are cross-classifying factors in this formula, "ncontrols ~ agegp + alcgp". Also, observations for each combination of the cross-classifying factors are summed up through another column name, ncontrols.

Example for esoph data

>  xtab_formula    =  as.formula(paste("ncontrols","~",  "agegp",  "+","alcgp",sep  =""))  

>  xtab_formula  

ncontrols  ~  agegp  +  alcgp  

>  table_result  =  rhive.basic.xtabs(formula  =  xtab_formula,  tablename  =  "esoph")  

>  head(table_result)  

             alcgp  

agegp      0-­‐39g/day  120+  40-­‐79  80-­‐119  

   25-­‐34                61        5        45            5  

   35-­‐44                89      10        80          20  

   45-­‐54                78      15        81          39  

   55-­‐64                89      26        84          43  

   65-­‐74                71        8        53          29  

   75+                    27        3        12            2  

rhive.basic.t.test The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case the two sample's mean difference is tested while holding the alternative hypothesis, "two sample's mean difference is not 0." Thus, two-side test is performed.

Page 13: RHive tutorials - Basic functions

The following is an example of test the mean difference between the irises' sepal widths and petal widths. Pay attention to how the Functions that used the "sepallength" and "petallength" variables were called.

>  rhive.basic.t.test("iris",  "sepallength",  "iris",  "petallength")  

[1]  "t  =  13.1422338118038,  df  =  211.542688378717,  p-­‐value  =  0,  mean  of  x  :  5.84333333333333,  mean  of  y  :  3.758"  

$statistic  

             t    

13.14223    

$parameter  

           df    

211.5427    

$p.value  

[1]  0  

$estimate  

$estimate[[1]]  

mean  of  x    

 5.843333    

$estimate[[2]]  

mean  of  y    

       3.758    

>  

Interpreting the results gives you a p-value of 0, thus revealing a difference between the means of petal width and sepal width. The resulting statistics are converted as an R list Object, and the string made from amassed statistics is printed onto console.

Iris data is 150 observation cases provided by R. Using this data for R's t.test results in a slightly off t-statistic of 13.0984. This is due to the variance used by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test Function uses population variance. Like the example scenario, in the case of little data, t-statistic deviance may exist but the larger the data gets the deviance dwindles. With rhive.basic.t.test being a Function made for massive data analysis in mind, population variance is used for speedy calculations.

Page 14: RHive tutorials - Basic functions

rhive.block.sample The percent argument is an optional argument that sets the percentage of data to extract from the total data. It has a default value of 0.01, which means it extracts 0.01% of the total data. But this percent argument's value is not the ratio of the actually sampled data count to the total data count but more akin to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function takes Samples by the Block.

Thus the entire data may be returned when using the rhive.block.sample Function on Hive Tables of small data size. This occurs when the data is smaller than the Block size set in Hive.

The seed variable is for specifying the Random Seed used when executing Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block Sampling returns the same results. Thus in order to guarantee Random Samples for every sampling, it is best to assign a value for the seed variable in rhive.block.sample, by using the Sample Function of R.

The subset variable is an optional variable that can specify the condition for the data to be extracted from the Table targeted by Hive, when returning Sample Block. This argument uses the character type and corresponds to the 'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's where clause.

rhive.block.sample Function's return values are the character values of the name of the Hive Table that contain Sample Block results. That is, the rhive.block.sample Function uses Sample Block to automatically create a temporary Hive Table and return that Table's name. The following example involves sampling data worth 0.01% of the Hive Table called listvirtualmachines. This example used R's sample Function for the Random Seed to be used during Block Sampling of Hive.

seedNumber  <-­‐  sample(1:2^16,  1)  

   

rhive.block.sample("listvirtualmachines",  seed=seedNumber  )  

   

[1]  "rhive_sblk_1330404552"

Page 15: RHive tutorials - Basic functions

As per this example, a Hive Table of the name "rhive_sblk_1330404552" bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has been created.

rhive.basic.scale The rhive.basic.scale function converts numerical data with 0 average and 1 deviation. Input table name for the first argument, and the output column name for the second.

In the returned list, there is added a "scaled_column name" column saved as a string. This is also approachable/editable in RHive, along with/just like other Hive tables.

scaled  <-­‐  rhive.basic.scale("iris",  "sepallength")  

attr(scaled,  "scaled:center")  

#  [1]  5.843333  

attr(scaled,  "scaled:scale")  

#  [1]  0.8253013  

>  rhive.desc.table(scaled[[1]])  

col_name  data_type  comment  

#  1                        rowname        string  

#  2                  sepalwidth        double  

#  3                petallength        double  

#  4                  petalwidth        double  

#  5                        species        string  

#  6                sepallength        double  

#  7  sacled_sepallength        double  

rhive.basic.by The rhive.basic.by Function consists of code that runs group by for a specified/particular column. Thus the code below excecutes/applies group by for "species" column, and returns the result of applying the sum Function on

Page 16: RHive tutorials - Basic functions

"sepallength". In the results you will find the sum of each species and sepallength.

rhive.basic.by("iris",  "species",  "sum","sepallength")  

#  species      sum  

#  1          setosa  250.3  

#  2  versicolor  296.8  

#  3    virginica  329.4  

rhive.basic.merge rhive.basic.merge makes new data set from merging two tables, based on their common rows.

#  checking  data  

 rhive.query('select  *  from  iris  limit  5')  

   rowname  sepallength  sepalwidth  petallength  petalwidth  species  

1              1                  5.1                3.5                  1.4                0.2    setosa  

2              2                  4.9                3.0                  1.4                0.2    setosa  

3              3                  4.7                3.2                  1.3                0.2    setosa  

4              4                  4.6                3.1                  1.5                0.2    setosa  

5              5                  5.0                3.6                  1.4                0.2    setosa  

   

 rhive.query('select  *  from  usarrests  limit  5')  

         rowname  murder  assault  urbanpop  rape  

1        Alabama      13.2          236              58  21.2  

2          Alaska      10.0          263              48  44.5  

3        Arizona        8.1          294              80  31.0  

4      Arkansas        8.8          190              50  19.5  

5  California        9.0          276              91  40.6  

   

##rhive.basic.merge  

 rhive.basic.merge('iris','usarrests',by.x='sepallength',by.y='

Page 17: RHive tutorials - Basic functions

murder')  

     sepallength  sepalwidth  petallength  petalwidth        species  assault  urbanpop  rape  rowname  

1                    4.3                3.0                  1.1                0.1          setosa          102              62  16.5            14  

2                    4.4                2.9                  1.4                0.2          setosa          149              85  16.3              9  

3                    4.4                3.0                  1.3                0.2          setosa          149              85  16.3            39  

4                    4.4                3.2                  1.3                0.2          setosa          149              85  16.3            43  

5                    4.9                3.1                  1.5                0.1          setosa          159              67  29.3            10  

Merge is similar with ‘join’ in SQL. Followings are same with that.

#  Use  join  to  extract  and  print  the  names  of  all  rows  not  found  to  be  common  after  merging.  #  Should  row  names  overlap,  only  print  out  the  name  of  the  former  row.    

rhive.big.query('select  a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species,b.assault,b.urbanpop,b.rape,a.rowname  from  iris  a  join  usarrests  b  on  a.sepallength  =  b.murder')  

     sepallength  sepalwidth  petallength  petalwidth        species  assault  urbanpop  rape  rowname  

1                    4.3                3.0                  1.1                0.1          setosa          102              62  16.5            14  

2                    4.4                2.9                  1.4                0.2          setosa          149              85  16.3              9  

3                    4.4                3.0                  1.3                0.2          setosa          149              85  16.3            39  

4                    4.4                3.2                  1.3                0.2          setosa          149              85  16.3            43  

5                    4.9                3.1                  1.5                0.1          setosa          159              67  29.3            10  

Page 18: RHive tutorials - Basic functions

rhive.basic.mode rhive.basic.mode returns the mode and its frequency within a specified row of the Hive table.

rhive.basic.mode('iris',  'sepallength')  

   sepallength  freq  

1                      5      10  

rhive.basic.range rhive.basic.range returns the greatest and lowest values within the specified numerical row of the Hive table.

rhive.basic.range('iris',  'sepallength')  

[1]  4.3  7.9