Upload
rohit-agrawal
View
79
Download
2
Embed Size (px)
Citation preview
Advance Hive, NoSQL DataBase
(HBase)
HiveQL: Data Manipulation
Loading Data into Managed Tables
• Hive has no row-level insert, update, and delete operations.• Only data can be loaded in tables through “bulk” load operations.
LOAD DATA LOCAL INPATH ‘/usr/hive/warehouse/california-employees'OVERWRITE INTO TABLE employeesPARTITION (country = 'US', state = 'CA');
Inserting Data into Tables from Queries• INSERT statement allows to load data into a table from a query.•With OVERWRITE, any previous contents of the partition (or
whole table if not partitioned) are replaced.
HiveQL: Data Manipulation
Dynamic Partition Inserts• Dynamic partition feature, where it can infer the partitions to create
based on query parameters.
HiveQL: Data Manipulation
HiveQL: Data ManipulationCreating Tables and Loading Them in One Query
Exporting Data
User Defined Functions• Hive has the ability to use User Defined Functions written in Java to perform
computations that would otherwise be difficult (or impossible) to perform using the built-in Hive functions and SQL commands.
• To invoke a UDF from within a Hive script, it is required to:1. Register the JAR file that contains the UDF class, and
2. Define an alias for the function using the CREATE TEMPORARY FUNCTION command.
Example UDF 1
2
3
45
public class UDFZodiacSign extends UDF {private SimpleDateFormat df;public UDFZodiacSign() {df = new SimpleDateFormat("MM-dd-yyyy");}
public String evaluate(Date bday) {return this.evaluate(bday.getMonth(), bday.getDay());}
public String evaluate(String bday) {Date date = null;try {date = df.parse(bday);} catch (Exception ex) {return null;}return this.evaluate(date.getMonth() + 1, date.getDay());}
public String evaluate(Integer month, Integer day) {if (month == 1) {if (day < 20) {return "Capricorn";} else {return "Aquarius";}}if (month == 2) {if (day < 19) {return "Aquarius";} else {return "Pisces";}}return null;}}
Custom Map/Reduce in Hive
HBase: Introduction to HBase
• HBase is a distributed column-oriented data store built on top of HDFS.
• HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns
HBase vs. HDFS
• HDFS is good for batch processing (scans over big files)• Not good for record lookup• Not good for incremental addition of small batches• Not good for updates
• HBase is designed to efficiently address the above points• Fast record lookup• Support for record-level insertion• Support for updates (not in place)
Tables, Rows, Column family• Table: HBase organizes data into tables. Table names are Strings and composed of
characters that are safe for use in a file system path.
• Row: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[ ] (byte array).
• Column Family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.
• Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ].
• Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].
• Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. If a timestamp is not specified during a write, the current timestamp is used. If the timestamp is not specified for a read, the latest one is returned. The number of cell value versions retained by HBase is configured for each column family. The default number of cell versions is three.
Column, Cell, Timestamp
Pictorial Representation
Representation as a Multi Dimensional Map
SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>>
HBase Table as Key-Value Store
HBase Architecture
Client API: Administrative API
Client API: CRUD Operations put()
Client API: CRUD Operations get()
Client API: CRUD Operations delete()
HBase Clients• Java Client• Useful when the interacting application is written in a java language.
• REST and Thrift• HBase ships with REST and Thrift interfaces. These are useful when the
interacting application is written in a language other than Java.
HBase MapReduce Integrationpublic class SimpleRowCounter extends Configured implements Tool {static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result> {public static enum Counters { ROWS }@Overridepublic void map(ImmutableBytesWritable row, Result value, Context context) {context.getCounter(Counters.ROWS).increment(1);}}@Overridepublic int run(String[] args) throws Exception {if (args.length != 1) {System.err.println("Usage: SimpleRowCounter <tablename>"); return -1;}String tableName = args[0];Scan scan = new Scan();scan.setFilter(new FirstKeyOnlyFilter());
Job job = new Job(getConf(), getClass().getSimpleName());job.setJarByClass(getClass());TableMapReduceUtil.initTableMapperJob(tableName, scan,RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);job.setNumReduceTasks(0);job.setOutputFormatClass(NullOutputFormat.class);return job.waitForCompletion(true) ? 0 : 1;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(HBaseConfiguration.create(),new SimpleRowCounter(), args);System.exit(exitCode);}}