Apache AVRO (Boston HUG, Jan 19, 2010)

Preview:

Citation preview

Apache AVROWhat's new?

Philip Zeyliger, Cloudera(AVRO committer)

Boston HUGJanuary 19, 2009

What's AVRO?

A data serialization systemIncludes:

A schema languageA compact serialized formAn RPC frameworkA handful of APIs, in a handful of languages

Goals:Cross-languageSupport for dynamic accessSimple but expressive schema evolution

Same "space" as Apache Thrift, Google Protocol Buffers, Binary JSON, and XDR. Subtle differences with all of them.

AVRO Protocols & Schemas@namespace("org.apache.avro.demo")protocol CurrencyConversion { enum Currency { USD, GBP, EUR, JPY } record Money { Currency currency; int amount; } error UnknownRateError { Currency currency; } Money convert(Money input, Currency targetCurrency) throws UnknownRateError; double rate(Currency input, Currency output) throws UnknownRateError;}

"genavro" IDL (AVRO-258)

$java -jar avro-tools-1.2.0-dev.jar genavro < demo.genavro { "protocol" : "CurrencyConversion", "namespace" : "org.apache.avro.demo", "types" : [ { "type" : "enum", "name" : "Currency", "symbols" : [ "USD", "GBP", "EUR", "JPY" ] }, { "type" : "record", "name" : "Money", "fields" : [ { "name" : "currency", "type" : "Currency" }, { "name" : "amount", "type" : "int" } ] }, { "type" : "error", "name" : "UnknownRateError", "fields" : [ { "name" : "currency", "type" : "Currency" } ] } ],

"messages" : { "convert" : { "request" : [ { "name" : "input", "type" : "Money" }, { "name" : "targetCurrency", "type" : "Currency" } ], "response" : "Money", "errors" : [ "UnknownRateError" ] }, "rate" : { "request" : [ { "name" : "input", "type" : "Currency" }, { "name" : "output", "type" : "Currency" } ], "response" : "double", "errors" : [ "UnknownRateError" ] } }}[

JSON Representation of Protocol and Schemas

Types

primitivestringbytesint & longfloat & doublebooleannull

complexrecordarraymap: string -> Tunionfixed<N>enum

Schema Evolution & ProjectionAVRO binary data never travels without its schema. This allows dynamic tooling.Writer's Schema and Reader's Schema may be different.

{ /* Writer */ "type" : "record", "name" : "Person", "fields" : [ { "name" : "first", "type" : "string" }, { "name" : "sport", "type" : "string", } }

Serialized Data:

"Alice", "Ultimate Frisbee"

{ /* Reader */ "type" : "record", "name" : "Person", "fields" : [ { "name" : "first", "type" : "string" }, { "name" : "age", "type" : "int", "default": 0, } }

Data presented to application:

"Alice", 0

APIs

PythonDynamic

JavaSpecific (generated code)Generic (container-based)Reflection (induces schemas from classes)

CC++Ruby

C API

char buf[64];avro_writer_t writer = avro_writer_memory(buf, sizeof(buf));avro_schema_t writers_schema = avro_schema_string();avro_datum_t datum = avro_string("Hello, world!");avro_write_data(writer, writers_schema, datum);

avro_reader_t reader = avro_reader_memory(buf, sizeof(buf));avro_schema_t readers_schema = avro_schema_string();avro_datum_t read_datum;avro_read_data(reader, writers_schema, readers_schema, &read_datum);

Data File Format (AVRO-160)

Features: * Splittable (important for Hadoop!) * Append only with same schema. * Compression * Arbitrary metadata * Simple

Hadoop IntegrationUsers

AvroInputFormat/AvroOutputFormat (MR-815)Using AVRO in the shuffle (MR-1126)

Note that AVRO schemas let you specify sort order; binary comparators are a thing of the past

Many Writables can be AVRO+Reflection insteadAVRO sort order leaves hand-writing RawComparators in the past; for Streaming, you now get fast comparators for free!

FrameworkAVRO for Hadoop RPC (e.g., HDFS-982)

GoalsOpen up protocols for cross-language use

avro-tools

Available tools: compile Generates Java code for the given schema.fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. genavro Generates a JSON schema from a GenAvro file getschema Prints out schema of an Avro data file. induce Induce a schema/protocol from Java class/interface.jsontofrag Renders a JSON-encoded Avro datum as binary.rpcreceive Opens an HTTP RPC Server and listens for one message. rpcsend Sends a single RPC message. tojson Dumps an Avro data file as JSON, one record per line.

1.3 to be released soon...

Good time to try it out!

What's evolving?Trying not to evolve the serialized format.APIs are evolving.Transports are evolving.

Obligatory Links

Web page: http://hadoop.apache.org/avro/Mailing list:avro-user-subscribe@hadoop.apache.orgSource repository:http://svn.apache.org/repos/asf/hadoop/avro/

Thanks!

Questions?

Philip Zeyligerphilip@cloudera.com

Recommended