View
2.117
Download
0
Category
Tags:
Preview:
Citation preview
Doug Cutting21 July, 2010
Introduction toApache Avro
Avro is...
● data serialization
● file format
● RPC format
Existing Serialization Systems:Protocol Buffers & Thrift
● expressive● efficient (small & fast)● but not very dynamic
● cannot browse arbitrary data● viewing a new datatype
– requires code generation & load● writing a new datatype
– requires generating schema text– plus code generation & load
Avro Serialization● spec's a serialization format● schema language is in JSON
● each lang already has JSON parser
● each lang implements data reader & writer● in normal code
● code generation is optional● sometimes useful in statically typed languages
● data is untagged● schema required to read/write
Avro Schema Evolution
● writer's schema always provided to reader● so reader can compare:
● the schema used to write with● the schema expected by application
● fields that match (name & type) are read● fields written that don't match are skipped● expected fields not written can be identified● same features as provided by numeric field ids
Avro JSON Schemas
// a simple three-element record{"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "integer"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ]}
// a linked list of strings or ints{"name": "MyList", "type": "record":, "fields": [ {"name": "value", "type": ["string", "int"]}, {"name": "next", "type": ["MyList", "null"]} ]}
Avro IDL Schemas
// a simple three-element recordrecord Block { string id; int length; array<string> hosts;}
// a linked list of strings or intsrecord MyList { union {string, int} value; MyList next;}
Hadoop Data Formats
● Today, primarily● text
– pro: interoperable– con: not expressive, inefficient
● Java Writable– pro: expressive, efficient– con: platformspecific, fragile
Avro Data
● expressive● small & fast● dynamic
● schema stored with data– but factored out of instances
● APIs permit reading & creating– new datatypes without generating & loading code
Avro Data
● includes a file format● replacement for SequenceFile
● includes a textual encoding● handles versioning
● if schema changes● can still process data
● hope Hadoop apps will● upgrade from text; standardize on Avro for data
Avro MapReduce API
● Single-valued inputs and outputs● key/value pairs only required for intermediate
● map(IN, Collector<OUT>)● map-only jobs never need to create k/v pairs
● map(IN, Collector<Pair<K,V>>)● reduce(K, Iterable<V>, Collector<OUT>)
● if IN and OUT are pairs, default is sort
● In Avro trunk today, built on Hadoop 0.20 APIs.● in Avro1.4.0 release next month
Avro MapReduce Example
public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c, Reporter r) throws IOException {
StringTokenizer i = new StringTokenizer(text.toString());
while (i.hasMoreTokens()) c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L));}
public void reduce(Utf8 word, Iterable<Long> counts, AvroCollector<Pair<Utf8,Long>> c, Reporter r) throws IOException { long sum = 0; for (long count : counts) sum += count; c.collect(new Pair<Utf8,Long>(word, sum));}
Avro RPC
● leverage versioning support● permit different versions of services to interoperate
● for Hadoop, will● let apps talk to clusters running different versions● provide crosslanguage access
Avro IDL Protocol
@namespace("org.apache.avro.test")
protocol HelloWorld {
record Greeting { string who; string what; }
Greeting hello(Greeting greeting);}
Avro Status
● Current● C, C++, Java, Python & Ruby APIs● Interoperable RPC and data● Mapreduce API for Java
● Upcoming● MapReduce APIs for other languages
– efficient, rich data
● RPC used in Flume, Hbase, Cassandra, Hadoop, etc.– interversion compatibility– nonJava clients
Questions?
Recommended