Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
B4M36DS2, BE4M36DS2: Database Systems 2h p://www.ksi.m .cuni.cz/~svoboda/courses/181-B4M36DS2/
Lecture 2
Data FormatsMar n Svobodamar [email protected]
8. 10. 2018
Charles University, Faculty of Mathema cs and PhysicsCzech Technical University in Prague, Faculty of Electrical Engineering
Lecture OutlineData formats• XML – Extensible Markup Language• JSON – JavaScript Object Nota on• BSON – Binary JSON• RDF – Resource Descrip on Framework• CSV – Comma-Separated Values• Protocol Bu ers
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 2
XMLExtensible Markup Language
Introduc onXML = Extensible Markup Language• Representa on and interchange of semi-structured data
+ a family of related technologies, languages, speci ca ons, …
• Derived from SGML, developed byW3C, started in 1996• Design goals
Simplicity, generality and usability across the Internet
• File extension: *.xml, content type: text/xml• Versions: 1.0 and 1.1• W3C recommenda on
h p://www.w3.org/TR/xml11/
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 4
Example
<?xml version="1.1" encoding="UTF-8"?><movie year="2007">
<title>Medvídek</title><actors>
<actor><firstname>Jiří</firstname><lastname>Macháček</lastname>
</actor><actor>
<firstname>Ivan</firstname><lastname>Trojan</lastname>
</actor></actors><director>
<firstname>Jan</firstname><lastname>Hřebejk</lastname>
</director></movie>
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 5
Document StructureDocument• Prolog: XML version + some other stu• Exactly one root element
Contains other nested elements and/or other content
<?xml<?xml versionversion == "" versionversion "" ...... ?>?> ......
elementelement
Example<?xml version="1.1" encoding="UTF-8"?><movie>
...</movie>
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 6
ConstructsElement• Marked using opening and closing tags
… or just an abbreviated tag in case of empty elements
• Each element can be associated with a set of a ributes
<< namename
attributeattribute
>> element contentelement content << // namename >>
<< namename
attributeattribute
// >>
Examples<title>...</title><actors/>
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 7
ConstructsTypes of element content• Empty content• Text content• Element content
Sequence of nested elements• Mixed content
Elements arbitrarily interleaved with text values
texttext
elementelement
elementelement
texttext
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 8
ConstructsA ribute• Name-value pair
namename == "" valuevalue ""
Escaping sequences (prede ned en es)• Used within values of a ributes or text content of elements• E.g.:
< for <> for >" for "…
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 9
XML Conclusion
XML constructs• Basic: element, a ribute, text• Addi onal: comment, processing instruc on, …
Schema languages• DTD, XSD (XML Schema), RELAX NG, Schematron
Query languages• XPath, XQuery, XSLT
XML formats = par cular languages• XSD, XSLT, XHTML, DocBook, ePUB, SVG, RSS, SOAP, …
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 10
JSONJavaScript Object Nota on
Introduc onJSON = JavaScript Object Nota on• Open standard for data interchange• Design goals
Simplicity: text-based, easy to read and writeUniversality: object and array data structures
– Supported by majority of modern programming languages– Based conven ons of the C-family of languages
(C, C++, C#, Java, JavaScript, Perl, Python, …)
• Derived from JavaScript (but language independent)• Started in 2002• File extension: *.json• Content type: application/json• h p://www.json.org/
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 12
Example
{"title" : "Medvídek","year" : 2007,"actors" : [
{"firstname" : "Jiří","lastname" : "Macháček"
},{
"firstname" : "Ivan","lastname" : "Trojan"
}],"director" : {
"firstname" : "Jan","lastname" : "Hřebejk"
}}
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 13
Data StructureObject• Unordered collec on of name-value pairs (proper es)
Correspond to structures such as objects, records, structs,dic onaries, hash tables, keyed lists, associa ve arrays, …
• Values can be of di erent types, names should be unique
{{
stringstring :: valuevalue
,,
}}
Examples• { "name" : "Ivan Trojan", "year" : 1964 }• { }
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 14
Data StructureArray• Ordered collec on of values
Correspond to structures such as arrays, vectors, lists,sequences, …
• Values can be of di erent types, duplicate values are allowed
[[
valuevalue
,,
]]
Examples• [ 2, 7, 7, 5 ]• [ "Ivan Trojan", 1964, -5.6 ]• [ ]
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 15
Data Structure
Value• Unicode string
Enclosed with double quotesBackslash escaping sequencesExample: "a \n b \" c \\ d"
• NumberDecimal integers or oatsExamples: 1, -0.5, 1.5e3
• Nested object• Nested array• Boolean value: true, false• Missing informa on: null
stringstring
numbernumber
objectobject
arrayarray
truetrue
falsefalse
nullnull
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 16
JSON Conclusion
JSON constructs• Collec ons: object, array• Scalar values: string, number, boolean, null
Schema languages• JSON Schema
Query languages• JSONiq, JMESPath, JAQL, …
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 17
BSONBinary JSON
Introduc onBSON = Binary JSON• Binary-encoded serializa on of JSON documents
Extends the set of basic data types of values o ered by JSON(such as a string, …) with a few new speci c ones
• Design characteris cs: lightweight, traversable, e cient• Used byMongoDB
Document NoSQL database for JSON documentsData storage and network transfer format
• File extension: *.bson• h p://bsonspec.org/
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 19
ExampleJSON
{"title" : "Medvídek","year" : 2007
}
BSON24 00 00 0002 74 69 74 6C 65 00 0A 00 00 00 4D 65 64 76 C3 AD 64 65 6B 0010 79 65 61 72 00 D7 07 00 0000
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 20
Document StructureDocument = serializa on of one JSON object or array• JSON object is serialized directly• JSON array is rst transformed to a JSON object
Property names derived from numbers of posi onsE.g.:[ "Trojan", "Svěrák" ] →{ "0" : "Trojan", "1" : "Svěrák" }
• StructureDocument size (total number of bytes)Sequence of elements (encoded JSON proper es)Termina ng hexadecimal 00 byte
int32int32 elementelement 0000
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 21
Document StructureElement = serializa on of one JSON property
......
0202 namename int32int32 bytebyte 0000
0101 namename doubledouble
1010 namename int32int32
1212 namename int64int64
0303 namename documentdocument
0404 namename documentdocument
0808 namename 0000
0101
0A0A namename
0909 namename int64int64
1111 namename int64int64
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 22
Document StructureElement = serializa on of one JSON property• Structure
Type selector– 02 (string)– 01 (double), 10 (32-bit integer), 12 (64-bit integer)– 03 (object), 04 (array)– 08 (boolean)– 0A (null)– 09 (date me), 11 ( mestamp)– …
Property name– Unicode string terminated by 00
bytebyte 0000
Property value
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 23
RDFResource Descrip on Framework
Introduc onRDF = Resource Descrip on Framework• Language for represen ng informa on about resourcesin the World Wide Web
+ a family of related technologies, languages, speci ca ons, …Used in the context of the Seman c Web, Linked Data, …
• Developed byW3C• Started in 1997• Versions: 1.0 and 1.1• W3C recommenda ons
h ps://www.w3.org/TR/rdf11-concepts/– Concepts and Abstract Syntax
h ps://www.w3.org/TR/rdf11-mt/– Seman cs
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 25
StatementsResource• Any real-world en ty
Referents = resources iden ed by IRI– E.g. physical things, documents, abstract concepts, …
Values = resources for literals– E.g. numbers, strings, …
Statement about resources = one RDF triple• Three components: subject, predicate, and object
Examples<http://db.cz/movies/medvidek><http://db.cz/terms#actor><http://db.cz/actors/trojan> .
<http://db.cz/movies/medvidek><http://db.cz/terms#year>"2007" .
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 26
StatementsTriple components• Subject
Describes a resource the given statement is aboutIRI or blank node iden er
• PredicateDescribes the property or characteris c of the subjectIRI
• ObjectDescribes the value of that propertyIRI or blank node iden er or literal
Although triples are inspired by natural languages, they havenothing to do with processing of natural languages
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 27
Example
<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/machacek> .
<http://db.cz/movies/medvidek><http://db.cz/terms#actor> <http://db.cz/actors/trojan> .
<http://db.cz/movies/medvidek><http://db.cz/terms#year> "2007" .
<http://db.cz/movies/medvidek><http://db.cz/terms#director> _:n18 .
_:n18<http://db.cz/terms#firstname> "Jan" .
_:n18<http://db.cz/terms#lastname> "Hřebejk" .
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 28
Iden ers and LiteralsIRI = Interna onalized Resource Iden er• Absolute (not rela ve) IRIs with op onal fragment iden ers• RFC 3987• Unicode characters• Examples
http://db.cz/movies/medvidekhttp://db.cz/terms#actormailto:[email protected]:issn:0167-6423
• URLs are o en used in prac ce→ informa on about givenresources are then intended to be published / retrieved viastandard HTTP
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 29
Iden ers and LiteralsLiterals• Plain values
E.g.: "Medvídek", "2007"• Typed values
E.g.: "Medvídek"^^xs:string, "2007"^^xs:integerXML Schema simple data types are adopted and used
• Strings with language tagsE.g.: "Medvídek"@cs
• Types and language tags cannot be mutually combined
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 30
Iden ers and LiteralsBlank node iden ers• Blank nodes (anonymous resources)
Allow to express statements about resources without explicitlynaming (iden fying) them
• Blank node iden ers only have local scope of validityE.g. within a given le, query expression, …
• Par cular syntax depends on a serializa on formatE.g.: _:node18
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 31
Data ModelDirected labeled mul graph• Ver ces
One vertex for each IRI or literal value• Edges
One edge for each individual tripleEdges are directed subject predicate−−−−−→ objectProperty names (predicate IRIs) are used as edge labels
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 32
Example
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 33
Serializa onAvailable approaches• N-Triples nota on
h ps://www.w3.org/TR/n-triples/• Turtle nota on (Terse RDF Triple Language)
h ps://www.w3.org/TR/turtle/• RDF/XML nota on
XML syntax for RDFh ps://www.w3.org/TR/rdf-syntax-grammar/
• JSON-LD nota onJSON-based serializa on for Linked Datah ps://www.w3.org/TR/json-ld/
• …
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 34
N-Triples Nota onRDF N-Triples nota on = A line-based syntax for an RDF graph• Simple, line-based, plain text format• File extension: *.rdf• h ps://www.w3.org/TR/n-triples/
Example• Already presented…
Document• Statements are terminated by dots, delimited by EOL
tripletriple
0D 0A0D 0A tripletriple
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 35
N-Triples Nota onStatement• Individual triple components are delimited by spaces
subjectsubject predicatepredicate objectobject ..
Triple components: subject, predicate, object
IRI referenceIRI reference
blank node idblank node id
IRI referenceIRI reference IRI referenceIRI reference
blank node idblank node id
literalliteral
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 36
N-Triples Nota onIRI reference• IRIs are enclosed in angle brackets
<< IRIIRI >>
Blank node iden er
__ :: labellabel
Literal• Literals are enclosed in double quotes
"" valuevalue ""
^^^^ IRI referenceIRI reference
@@ language taglanguage tag
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 37
Turtle Nota onTurtle = Terse RDF Triple Language• Compact text format,various abbrevia ons for common usage pa erns
• File extension: *. l• Content type: text/turtle• h ps://www.w3.org/TR/turtle/
Example@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek
i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 38
Turtle Nota onDocument• Contains a sequence of triples and/or declara ons• Pre x declara ons
Pre xed names can then be used instead of full IRI references• Groups of triples
Individual groups are terminated by dots
@prefix@prefix prefixprefix :: IRI referenceIRI reference ..
triplestriples
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 39
Turtle Nota onTriples• Triples sharing the same subject and object or at leastthe same subject can be grouped together
object list for a shared subject and predicatepredicate-object list for a shared subject
• Brackets can be used to de ne blank nodes
subjectsubject predicatepredicate objectobject
,,
;;
[[ predicatepredicate objectobject
,,
;;
]]
predicatepredicate objectobject
,,
;;
..
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 40
Turtle Nota onTriple components: subject, predicate, object
IRI referenceIRI reference
prefixed nameprefixed name
blank node idblank node id
IRI referenceIRI reference
prefixed nameprefixed name
IRI referenceIRI reference
prefixed nameprefixed name
blank node idblank node id
literalliteral
[[ predicate-object listpredicate-object list ]]
IRI reference / pre xed name
<< IRIIRI >> prefixprefix :: local namelocal name
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 41
Turtle Nota onLiteral• Tradi onal literals+ new abbreviated forms of numeric and boolean literals
"" valuevalue ""
^^^^ IRI referenceIRI reference
prefixed nameprefixed name
@@ language taglanguage tag
truetrue
falsefalse
integer / decimal / doubleinteger / decimal / double
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 42
ExampleExample revisited
@prefix i: <http://db.cz/terms#> .@prefix m: <http://db.cz/movies/> .@prefix a: <http://db.cz/actors/> .m:medvidek
i:actor a:machacek , a:trojan ;i:year "2007" ;i:director [ i:firstname "Jan" ; i:lastname "Hřebejk" ] .
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 43
RDF Conclusion
RDF statements• Subject, predicate, and object components
Schema languages• RDFS (RDF Schema)• OWL (Web Ontology Language)
Query languages• SPARQL (SPARQL Protocol and RDF Query Language)
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 44
CSVComma-Separated Values
Introduc onCSV = Comma-Separated Values• Unfortunately not fully standardized
Di erent eld separators (commas, semicolons)Di erent escaping sequencesNo encoding informa on
• RFC 4180, RFC 7111• File extension: *.csv• Content type: text/csv
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 46
Example
firstname,lastname,yearIvan,Trojan,1964Jiří,Macháček,1966Jitka,Schneiderová,1973Zdeněk,Svěrák,1936Anna,Geislerová,1976
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 47
Document StructureDocument• Op onal header + list of records
namename ,, namename 0D 0A0D 0A
recordrecord 0D 0A0D 0A recordrecord
Record• Comma separated list of elds
valuevalue ,, valuevalue
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 48
Protocol Bu ers
Introduc onProtocol Bu ers• Extensible mechanism for serializing structured data
Used in communica on protocols, data storage, …• Design goals
Language-neutral, pla orm-neutralSmall, fast, simple
• Developed (and widely used) by Google• Started in 2008 internally and 2011 publicly• Versions: proto2, proto3• File extension: *.proto• h ps://developers.google.com/protocol-bu ers/• Real-world usage: RiakKV, HBase
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 50
Introduc onIntended usage• Schema crea on→ automa c source code genera on→sending messages between applica ons
Components• Interface descrip on language• Source code generator (protoc compiler)
Supported languages– O cial: C++, C#, Java, Python, Ruby …– 3rd party: Perl, PHP, Scala, …
• Binary serializa on formatCompact, not self-describing
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 51
Example
syntax = "proto3";message Actor {
string firstname = 1;string lastname = 2;
}message Movie {
string title = 1;int32 year = 16;repeated Actor actors = 17;enum Genre {
UNKNOWN = 0;COMEDY = 1;FAMILY = 2;
}repeated Genre genres = 2048;
}
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 52
Schema StructureSchema• One schema may contain mul plemessage descrip ons
Other constructs are allowed as well, e.g. enumera ons
...... messagemessage
enumenum
Message• Represents a small logical record of informa on
De nes a set of uniquely numbered eldsNested messages or enumera ons are allowed too
messagemessage namename {{ fieldfield
messagemessage
enumenum
}}
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 53
Schema StructureField• Describes one data value
repeatedrepeated
typetype namename == tagtag ;;
• Rule – allowed number of value occurrencesDefault = 0 or 1 valuerepeated = 0 or more values (i.e. an arbitrary number)
– The order of individual values is preserved
• …
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 54
Schema StructureField• Type
Atomic: int32, int64, double, string, bool, bytes, …– Mappings to data types of par cular programming languages
as well as default values are introducedComposed: messages, enumera ons, …
• Name – name of a given eld• Tag – internal integer iden er
Used to iden fy individual elds of a message in a binaryformatFrequently used elds should be assigned lower tags
– Since lower number of bytes will then be needed
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 55
Schema StructureEnumera on• Descrip on of a prede ned list of values• The rst item is considered to be the default valueand its value must be equal to 0
enumenum namename {{ itemitem == constantconstant ;; }}
A few other constructs are available too (e.g. maps)
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 56
Lecture ConclusionData formats• Tree: XML, JSON• Graph: RDF• Rela onal: CSV
Binary serializa ons• BSON, Protocol Bu ers
B4M36DS2, BE4M36DS2: Database Systems 2 | Lecture 2: Data Formats | 8. 10. 2018 58