1. Apache Hive Sheetal Sharma Intern At IBM Innovation
Centre
2. Apache Hive Apache Hive is a tool built on top of Hadoop for
analyzing large, unstructured data sets using a SQL-like syntax,
thus making Hadoop accessible to legions of existing BI and
corporate analytics researchers. Hive is fundamentally an
operational data store that's also suitable for analyzing large,
relatively static data sets where query time is not important.
3. Apache Hive Hive makes an excellent addition to an existing
data warehouse, but it is not a replacement. Instead, using Hive to
augment a data warehouse is a great way to leverage existing
investments while keeping up with the data deluge. Hive data store
brings together vast amounts of unstructured data -- such as log
files, customer tweets, email messages, geo-data, and CRM
interactions -- and stores them in an unstructured format on cheap
commodity hardware.
4. Apache Hive Hive allows analysts to project a databaselike
structure on this data, to resemble traditional tables, columns,
and rows, and to write SQL- like queries over it. This means that
different schemas may be projected over the same data sets,
depending on the nature of the query, allowing the user to ask
questions that weren't envisioned when the data was gathered.
5. Apache Hive Hive queries traditionally had high latency, and
even small queries could take some time to run because they were
transformed into map-reduce jobs and submitted to the cluster to be
run in batch mode. long-running queries were inconvenient and
troublesome to run in a multi-user environment, where a single job
could dominate the cluster.
6. Apache Hive multi-user environment
7. Apache Hive HiveQL, the query language, is based on SQL-92,
it differs from SQL in some important ways due to its running on
top of Hadoop. For instance, DDL (Data Definition Language)
commands need to account for the fact that tables exist in a
multi-user file system that supports multiple storage formats.
Nevertheless, SQL users will find the HiveQL language familiar and
should not have any problems adapting to it.
8. Hive platform architecture
9. Hive platform architecture From the top down, Hive looks
much like any other relational database. Users write SQL queries
and submit them for processing, using either a command line tool
that interacts directly with the database engine or by using
third-party tools that communicate with the database via JDBC or
ODBC. By using the JDBC and ODBC drivers, available for Mac and
Windows, data workers can connect their favorite SQL client to Hive
to browse, query, and create tables.
10. Working with Hive HiveQL was designed to ease the
transition from SQL and to get data analysts up and running on
Hadoop right away. Most BI and SQL developer tools can connect to
Hive as easily as to any other database. Using the ODBC connector,
users can import data and use tools like PowerPivot for Excel to
explore and analyze data, making big data accessible across the
organization.
11. Differences in HiveQL and standard SQL Hive 0.13 was
designed to perform full-table scans across petabyte-scale data
sets using the YARN and Tez infrastructure, so some features
normally found in a relational database aren't available to the
Hive user. These include transactions, cursors, prepared
statements, row-level updates and deletes, and the ability to
cancel a running query. The absence of these features won't
significantly affect data analysis, but it might affect your
ability to use existing SQL queries on a Hive cluster.
12. Differences in HiveQL and standard SQL In a traditional
database environment, the database engine controls all reads and
writes to the database. In Hive, the database tables are stored as
files in the Hadoop Distributed File System (HDFS), where other
applications could have modified them. Although this can be a good
thing, it means that Hive can never be certain if the data being
read matches the schema.
13. Aspects of Data Storage File formats and Compression Tuning
Hive queries can involve making the underlying map-reduce jobs run
more efficiently by optimizing the number, type, and size of the
files backing the database tables. Hive's default storage format is
text, which has the advantage of being usable by other tools. The
disadvantage, however, is that queries over raw text files can't be
easily optimized.
14. Hive can read and write several file formats and decompress
many of them on the fly. Storage requirements and query efficiency
can differ dramatically among these file formats, as can be seen in
the figure below (courtesy of Hortonworks). File formats are an
active area of research in the Hadoop community. Efficient file
formats both reduce storage costs and increase query
efficiency.
15. For Example For example, let's say you want to do a query
that's not part of the built-in SQL. Without a UDF, you would have
to dump a temporary table to disk, run a second tool (such as Pig
or Java) for your custom query, and possibly produce a third table
in HDFS that would be analyzed by Hive
16. Hive Query Performance Hive 0.13 is the final piece in the
Stinger initiative, a community effort to improve the performance
of Hive. The most significant feature of 0.13 is the ability to run
queries on the new Tez execution framework. query times drop by
half when run on Tez. On queries that could be cached, times
dropped another 30 percent. On larger data sets, the speedup was
even more dramatic. possible to execute petabyte-scale queries to
refine and cleanse data for later incorporation into data warehouse
analytics.
17. Hive Query Performance Hadoop and Hive could also be used
in the reverse scenario: to off-load data summaries that would
otherwise need to be stored in the data warehouse at much greater
cost. Organizations or departments without a data warehouse can
start with Hive to get a feel for the value of data analytics. It
does make a great, low-cost, large-scale operational data store
with a fair set of analytics tools. Hive offers near linear
scalability in query processing, an order of magnitude better
price/performance ratio than traditional enterprise data
warehouses.