Upload
bryan-bende
View
1.145
Download
17
Embed Size (px)
Citation preview
2 ©HortonworksInc.2011– 2016.AllRightsReserved
Background
à FlowFile– Unitofworkthatmovesthroughthedataflow– Madeupofattributes+content
à Attributesareamapofkey/valuepairs– Availablein-memory asstrings– Accessiblefromexpressionlanguage– Usefulforquickdecision-making/routing
à Contentisarbitrarybytes– FlowFileisapointertothecontentinthecontentrepository– Contentisonlyaccessediftheprocessorneedstooperateonit– Couldpassthroughmanyprocessorswithouteveryaccessingthecontent
3 ©HortonworksInc.2011– 2016.AllRightsReserved
TheProblem
à Specializedprocessorstooperateondifferentdatatypes– SplitJson,EvaluateJsonPath,ConvertJsonToAvro– SplitAvro,ExtractAvroMetadata,ConvertAvroToJson– SplitText,ExtractText,RouteText
à Sometimesmissingconversions– NoConvertCsvToJson,soConvertCsvToAvro thenConvertAvroToJson
à Sometimesmissingaspecificfunctionforadatatype– NoEvaluateAvroPath,soConvertAvroToJson thenEvaluateJsonPath
à Sometimesimplementedwithdifferentlibrariescausinginconsistencies– SomeAvroprocessorsimplementedwithKite,otherswithApacheAvrolibraries– Eachlibrarymayhavedifferentfeatures/error-handling
4 ©HortonworksInc.2011– 2016.AllRightsReserved
TheSolution
à Introducetheconceptofa”record”
à Centralizethelogicforreading/writingrecordsintocontrollerservices
à Providestandardprocessorsthatoperateonrecords
à Canstillhandlearbitrarydata,butprocessrecordswhenappropriate
5 ©HortonworksInc.2011– 2016.AllRightsReserved
RecordReaders&Writers
à Readers– AvroReader– CsvReader– GrokReader– JsonPathReader– JsonTreeReader– ScriptedReader
à Writers– AvroRecordSetWriter– CsvRecordSetWriter– JsonRecordSetWriter– FreeFormTextRecordSetWriter– ScriptedRecordSetWriter
6 ©HortonworksInc.2011– 2016.AllRightsReserved
Buthowisdataturnedintoarecord?
à Arecordhasfields,andfieldshaveinformationlikeanameandtype
à Schemasdefinethefieldsofarecordandgivemeaningtothedata
à ApacheAvroalreadyutilizesschemas,widelyused&supportedbymanytools
à WecanuseAvroschemastodefineaschemaforanytypeofdata
à Eachreader&writerneedsawaytoobtainaschema
7 ©HortonworksInc.2011– 2016.AllRightsReserved
SchemaAccessStrategy
à SchemaName– ProvidethenameofaschematolookupinaSchemaRegistry,canuseELtoobtainthename
à SchemaText– Providethetextofaschemainreader/writer,canuseELtoobtainthetext
à HWXContent-EncodedSchemaReference– ContentoftheFlowFilecontainsspecialheaderreferencingaschemainaSchemaRegistry
à HWXSchemaReferenceAttributes– FlowFilecontainsthreeattributesthatwillbeusedtolookupaschemafromtheconfigured
SchemaRegistry:‘schema.identifier’,‘schema.version’,and ‘schema.protocol.version’
à Readers&writersmayhaveadditionaloptionsspecifictothedatatype– Ex:CsvReader canmakeaschemaontheflyfromthecolumnnames– Ex:AvroReader canusetheschemaembeddedintheAvrodatafile
8 ©HortonworksInc.2011– 2016.AllRightsReserved
SchemaRegistries
à AvroSchemaRegistry– Accessschemabyname– OnlyaccessiblewithinNiFi
à HortonworksSchemaRegistry– Accessschemabynameand/orversion– Accessibleacrosssystemsintheenterprise– https://github.com/hortonworks/registry
à ConfluentSchemaRegistry– Accessschemabynameand/orversion– Accessibleacrosssystemsintheenterprise– https://github.com/confluentinc/schema-registry– NotinanofficialApacheNiFi releaseyet,availableinmasterbranch(1.4.0-snapshot)
9 ©HortonworksInc.2011– 2016.AllRightsReserved
FullPictureAbstractControllerService
SchemaRegistryService
RecordReaderFactory
AvroReader
CsvReader
GrokReader
JsonPathReader
JsonReader
Implements
RecordSetWriterFactory
AvroRecordSetWriter
CsvRecordSetWriter
JsonRecordSetWriter
FreeFormTextWriter
ImplementsExtendsExtends
Extends
SchemaRegistry
AvroSchemaRegistry
HWXSchemaRegistry
Uses
Implements
ConfluentSchemaRegistry
10 ©HortonworksInc.2011– 2016.AllRightsReserved
RecordPath
à Domainspecificlanguage(DSL)forspecifying/accessingfieldsofarecord
à SimilartoJSONPathorXPath
à Examples:– Child:/details/address/zip– Descendant://zip– Arrays:/addresses[1]– Maps:/details/address['zip']– Predicates:/*[./state != 'NY']
à Moreinfo…– https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html
11 ©HortonworksInc.2011– 2016.AllRightsReserved
RecordProcessors
à Manyprocessorsforoperatingonrecords– ConvertRecord– LookupRecord– PartitionRecord– QueryRecord– SplitRecord– UpdateRecord– ConsumeKafkaRecord_0_10– PublishKafkaRecord_0_10
à Goalistokeepmanyrecordsperflowfileandavoidsplittingifpossible
à Checklatestdocsusagedetailsandotherrecordprocessors– https://nifi.apache.org/docs.html
13 ©HortonworksInc.2011– 2016.AllRightsReserved
Example- CSVtoJSON
à IncomingCSVthatlookslike:first_name, last_name
John, Smith
Mike, Jones
à WantJSONthatlookslike:[
{“first_name” : “John”, ”last_name” : “Smith”},
{“first_name” : “Mike”, “last_name” : “Jones”}
]
14 ©HortonworksInc.2011– 2016.AllRightsReserved
Step1– DefineanAvroSchema
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" }
]
}
18 ©HortonworksInc.2011– 2016.AllRightsReserved
Step5– GenerateFlowFile Processor
à SetRunScheduletosomethinglike10seconds
à PutexampleCSVdatainCustomTextproperty
à Thereader&writerhadtheir’SchemaName’setto${schema.name}
à Addanpropertycalled‘schema.name’withthevalueof‘person’sincethisisthenameintheschemaregistry
19 ©HortonworksInc.2011– 2016.AllRightsReserved
Step6– ConvertRecordProcessor
à Selecttheappropriatereaderandwriter
22 ©HortonworksInc.2011– 2016.AllRightsReserved
Step9– Checknifi-app.log forJSON
--------------------------------------------------StandardFlowFile AttributesKey:'entryDate ' Value:'ThuAug3113:28:02EDT2017’Key:'lineageStartDate' Value:'ThuAug3113:28:02EDT2017’Key:'fileSize' Value:'137’FlowFile AttributeMapContentKey:'filename' Value:'326844487150210’Key:'mime.type' Value:'application/json’Key:'path'Value:'./’Key:'record.count' Value:’2’Key:'schema.name' Value:'person’Key:'uuid' Value:'e9198166-0cff-400b-a39d-9c8c9c565f85’--------------------------------------------------[{"first_name":"John","last_name":"Smith"},{"first_name":"Mike","last_name":"Jones"}]
24 ©HortonworksInc.2011– 2016.AllRightsReserved
Step1– RuntheHortonworksSchemaRegistry
à Downloadthelatestrelease– https://github.com/hortonworks/registry/releases/download/v0.2.1/hortonworks-registry-0.2.1.tar.gz
à Extractthetarandruntheapplication– tar xzvf hortonworks-registry-0.2.1.tar.gz – cd hortonworks-registry-0.2.1 – ./bin/registry-server-start.sh conf/registry-dev.yaml
à NavigatetoregistryUIinyourbrowser– http://localhost:9090
29 ©HortonworksInc.2011– 2016.AllRightsReserved
Step6– Runthesameflowwithsameresults
--------------------------------------------------StandardFlowFile AttributesKey:'entryDate ' Value:'ThuAug3113:28:02EDT2017’Key:'lineageStartDate' Value:'ThuAug3113:28:02EDT2017’Key:'fileSize' Value:'137’FlowFile AttributeMapContentKey:'filename' Value:'326844487150210’Key:'mime.type' Value:'application/json’Key:'path'Value:'./’Key:'record.count' Value:’2’Key:'schema.name' Value:'person’Key:'uuid' Value:'e9198166-0cff-400b-a39d-9c8c9c565f85’--------------------------------------------------[{"first_name":"John","last_name":"Smith"},{"first_name":"Mike","last_name":"Jones"}]
31 ©HortonworksInc.2011– 2016.AllRightsReserved
SpecifyingaSchemaVersion
à Previousexampleused“SchemaName”for“SchemaAccessStrategy”– NiFi retrievedlatestversionofschemaforname– Cachedschemabasedonconfigurationincontrollerservice
à Wecanalsouse“HWXSchemaReferenceAttributes”tobemorespecific– schema.identifier– schema.version– schema.protocol.version
33 ©HortonworksInc.2011– 2016.AllRightsReserved
ObtainingIdentifier,Version,Protocol
à WecangetthesevaluesfromtheschemaregistryRESTAPI– http://localhost:9090/api/v1/schemaregistry/schemas/person– http://localhost:9090/api/v1/schemaregistry/schemas/person/versions– ProtocolVersionisalways‘1’fornow
34 ©HortonworksInc.2011– 2016.AllRightsReserved
UpdateFlowtoSpecifyAttributes
à Removeschema.name andaddadditionalattributesinGenerateFlowFile
36 ©HortonworksInc.2011– 2016.AllRightsReserved
UpdateJsonRecordSetWriter withnewSchemaAccessStrategy
37 ©HortonworksInc.2011– 2016.AllRightsReserved
RuntheFlowAgain
à Usingv2oftheschemaweshouldonlyseefirst_name:
Key: 'schema.identifier' Value: '1’Key: 'schema.name'Value: 'person’Key: 'schema.protocol.version' Value: '1’Key: 'schema.version' Value: '2’Key: 'uuid' Value: '34407f4e-3bf1-46d5-a6d4-6da5ba197eb8’--------------------------------------------------[{"first_name":"John"},{"first_name":"Mike"}]
39 ©HortonworksInc.2011– 2016.AllRightsReserved
Publishing
à PublishKafkaRecord_0_10– StreamsincomingflowfileasrecordsusingconfiguredRecordReader– SerializeseachrecordtobytesusingconfiguredRecordSetWriter
à Generallydon’twanttopublishschemaoneverymessage– “SchemaWriteStrategy”ofRecordSetWriter controlswhereschemaendsup– “HWXContent-EncodedSchemaReference”encodesschemainfoatbeginningofcontent– Singlerecordpublishedasencodedschemareference+bytesofarecord
Protocol(1byte)
Identifier(8bytes)
Version(3bytes)
RecordBytes
40 ©HortonworksInc.2011– 2016.AllRightsReserved
Consuming
à ConsumeKafkaRecord_0_10– ReadsmessagesfromKafkaintorecordsusingconfiguredRecordReader– WritesrecordstoaflowfileusingconfiguredRecordSetWriter
à Ifpublisherused“HWXContent-EncodedSchemaReference” astheSchemaWriterStrategy thenconsumerneedstouse““HWXContent-EncodedSchemaReference”astheSchemaAccessStrategy
41 ©HortonworksInc.2011– 2016.AllRightsReserved
Publish&Consume
KafkaPublishKafkaRecord_0_10
HWXSchemaRegistry
[schemaref][record]
1.PublishConsumeKafkaRecord_0_10
2.Consume
4.RetrieveSchemaforencodedprotocol,id,
andversion
3.Readencodedschemainfofrom
message
42 ©HortonworksInc.2011– 2016.AllRightsReserved
AdditionalResources
à https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
à https://blogs.apache.org/nifi/entry/real-time-sql-on-event
à https://community.hortonworks.com/content/kbentry/119766/installing-a-local-hortonworks-registry-to-use-wit.html
à https://community.hortonworks.com/articles/131320/using-partitionrecord-grokreaderjsonwriter-to-pars.html
à https://community.hortonworks.com/articles/115311/convert-csv-to-json-avro-xml-using-convertrecord-p.html