View
5
Download
0
Category
Preview:
Citation preview
Ghislain Fourny
Big Data for Engineers Spring 201811. Data Models
pinkyone / 123RF Stock Photo
2
CSV (Comma separated values)
ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"
This is syntax
3
CSV (Comma separated values)
ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"
ID Last name First name Theory1 Einstein Albert General, Special Relativity2 Gödel Kurt "Incompleteness" Theorem
This is a data model
This is syntax
4
Syntax vs. Data Models
ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"
Physical viewSyntax
5
Syntax vs. Data Models
ID,Last name,First name,Theory,1,Einstein,Albert,"General, Special Relativity"2,Gödel,Kurt,"""Incompleteness"" Theorem"
Physical viewSyntax
ID Last name First name Theory1 Einstein Albert General, Special Relativity2 Gödel Kurt "Incompleteness" Theorem
Logical viewData Model
6
Syntax vs. Data Models
<a><d e="f"/><c>This is <b>text</b>.</c>
</a>
Physical viewSyntax
7
Syntax vs. Data Models
<a><d e="f"/><c>This is <b>text</b>.</c>
</a>
Physical viewSyntax
Logical viewData Model
a
d
This is
c
b .
text
e = f
8
Edge vs. Node labeling
foo
bar
Labels are on the edges
9
Edge vs. Node labeling
foo
bar
foobar
foo
bar
Labels are on the edges Labels are on the nodes
10
XML Data models
Information Set (Infoset)http://www.w3.org/TR/xml-infoset/
11
XML Data models
Information Set (Infoset)http://www.w3.org/TR/xml-infoset/
Post Schema-Validation Infoset (PSVI)http://www.w3.org/TR/xmlschema11-1/
12
XML Data models
Information Set (Infoset)http://www.w3.org/TR/xml-infoset/
Post Schema-Validation Infoset (PSVI)http://www.w3.org/TR/xmlschema11-1/
XQuery and XPath Data Model (XDM)http://www.w3.org/TR/xpath-datamodel/
13
JSON Data Models
"original" (implicit) JSON Data Modelhttp://www.json.org/
14
JSON Data Models
"original" (implicit) JSON Data Modelhttp://www.json.org/
JSON Schema Data Modelhttps://www.ietf.org/archive/id/draft-wright-json-
schema-01.txt
15
JSON Data Models
"original" (implicit) JSON Data Modelhttp://www.json.org/
JSON Schema Data Modelhttps://www.ietf.org/archive/id/draft-wright-json-
schema-01.txt
JSONiq Data Model (JDM)http://www.jsoniq.org/docs/JSONiqExtensionToXQuer
y/html/section-jsoniq-data-model.html
16
HTML/XML Data model
Document Object Model (DOM)http://www.w3.org/TR/REC-DOM-Level-1/
17
XML Information Setgrigory_bruev / 123RF Stock Photo
18
Information Set
<?xml version="1.0" encoding="UTF-8"?> <dc:metadata xmlns:dc="http://www.systems.ethz.ch"> <title xml:lang="en" year="2008" >Systems Group</title> <publisher>ETH Zurich</publisher> </dc:metadata>
19
Information Set
2020
The 11 XML Information Items
DocumentElementAttributeProcessing InstructionCharacterComment
NamespaceUnexpanded Entity ReferenceDTDUnparsed EntityNotation
2121
The 11 XML Information Items
DocumentElementAttributeProcessing InstructionCharacterComment
NamespaceUnexpanded Entity ReferenceDTDUnparsed EntityNotation
22
Document Information Items
Document Information Item[children] Element Information Item[document element] Element Information Item metadata[notations] <empty>[unparsed entities] <empty>[base URI ] file:///Users/bigdata/Documents/info.xml[character encoding scheme] UTF-8[standalone] <no value>[version] 1.0
docmetadata
23
Element Information Items
Element Information Item[namespace name] http://www.systems.ethz.ch[local name] metadata[prefix] dc[children] Element Information Items [attributes] <empty>[namespace attributes] Attribute Information Item xmlns:dc[in-scope namespaces] Namespace Information Items[base URI] file:///Users/bigdata/Documents/info.xml[parent] Document Information Item
metadata
doc
title publisher
24
Attribute Information Items
Attribute Information Item [namespace name] empty[local name] year[prefix] empty[normalized value] 2008[specified] true[attribute type] <no value> [references] unknown[owner element] Element Information Item
year
title
25
XML Infoset - the treedoc
metadataxmlns:dc
title
dc->systems
ETH Zurich
publisher
langyear
Systems Group
dc->systemsdc->systems
26
Post-Schema-Validation Infoset
Infoset
+
Types
Post-Schema-Validation Infoset (PSVI)
27
XPath and XQuery Data Model
Weerapat Kiatdumrong / 123RF Stock Photo
28
XDM: Sequences of Items
( , , , , , )
29
XDM: Sequence of one item
30
XDM: Sequence of one item
= ( )
31
XDM: Sequences are flat
(( , ), )
32
XDM: Sequences are flat
(( , ), )=( , , )
33
XDM: Items
Atomic Node
34
XDM: Seven Kinds of XML Nodes
§ Document node§ Element node§ Attribute node§ Text node§ Comment node§ Processing instruction node§ Namespace node
35
XDM: Seven Kinds of XML Nodes
Infoset
XDM
36
XDM vs. Infoset
Infoset
XDM
xs:untyped
37
XDM: New Items in 3.0 and 3.1
Functions
38
XDM: New Items in 3.0 and 3.1
Functions Maps
lorem
ipsum
dolor
sit
amet
39
XDM: New Items in 3.0 and 3.1
Functions Maps Arrays
lorem
ipsum
dolor
sit
amet
40
XDM and Querying
Expression
for if thenelse where
order by
whileany
every
let return
exit with
=
+
41
Types
42
Type Systems
Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:
43
Type Systems
Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:
- Distinction between atomic types and structured types
44
Type Systems
Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:
- Distinction between atomic types and structured types
- Same categories of atomic types
45
Type Systems
Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:
- Distinction between atomic types and structured types
- Same categories of atomic types
- Lists and maps as structured types
46
Type Systems
Almost all type systems (Java, SQL, PSVI, JDM, Protocol buffers, Avro, Parquet, and so on) share the following properties:
- Distinction between atomic types and structured types
- Same categories of atomic types
- Lists and maps as structured types
- Sequence type cardinalities
47
Types (General)
Atomic Typesvs.
Structured Types
48
Atomic Types
49
Atomic Types
Strings
50
Strings(Character sequences with monoid structure)
"foo"
"Zurich"
"Ilsebill salzte nach."
f o o
Z u r i c h
I l s e b i l l ␣ s a l z t e ␣ n a c h .
51
Atomic Types
StringsNumbers
52
Interval-based integer types(exist as signed and unsigned)
8-bit (Java's byte)16-bit (Java's short)32-bit (Java's int)64-bit (Java's long)
53
Arbitrary precision decimals (and integers)
Any precision and scale
3141592653.5897932384626433832795
54
Float and DoubleIEEE 754 standard
single precision double precision
32 bits 64 bits
ca. 7 digits3141592000 3141592653.58979
ca. 15 digits
10-37 to 1037 10-307 to 10308
55
Atomic Types
StringsNumbersBooleans
56
Booleans
TRUEttrueyyeson1
FALSEffalsennooff0
57
Atomic Types
StringsNumbersBooleansDates and Times
58
Dates and times
Date
Time
Timestamp
Duration
59
Dates (Gregorian calendar)
Year + Month + Day2017 August 1st
(AD)
60
Times
Hours + Minutes + Seconds
10 : 31 : 15.109378
61
Timestamps
Year + Month + Day + Hours + Minutes + Seconds
2017 August 1st 10 : 31 : 15.109378(AD)
62
Atomic Types
StringsNumbersBooleansDates and TimesTime Intervals
63
Duration kinds
Year Month Day Hour Minute Second
Example: 2 years and 4 months
64
Duration kinds
Year Month Day Hour Minute Second
Example: 2 years and 4 months
Example: 3 hours and 14 minutes
65
Atomic Types
StringsNumbersBooleansDates and TimesTime IntervalsBinaries
66
Atomic Types
StringsNumbersBooleansDates and TimesTime IntervalsBinariesNull
67
Lexical space vs. value space
Value space Lexical space
68
Lexical space vs. value space
"1""01"...
Value space Lexical space
69
Lexical space vs. value space
"4""04""100b"...
Value space Lexical space
70
Subtypes
Supertype'svalue space
Subtype'svalue space
71
Structured Types
Data Structure ExamplesAssociative Arrays (a.k.a. maps)
JSON Object,Protobuf Message,Set of XML Attributes
Ordered Lists JSON Array,XML Element,Protobuf repeated field
72
Structured Types
Data Structure ExamplesAssociative Arrays (a.k.a. maps)
JSON Object,Protobuf Message,Set of XML Attributes
Ordered Lists JSON Array,XML Element,Protobuf repeated field
73
Cardinality
Howmany?
Commonsign
Common adjective
74
Cardinality
Howmany?
Commonsign
Common adjective
One required
75
Cardinality
Howmany?
Commonsign
Common adjective
One requiredZero or more *
76
Cardinality
Howmany?
Commonsign
Common adjective
One requiredZero or more *Zero or one ? optional
77
Cardinality
Howmany?
Commonsign
Common adjective
One requiredZero or more *Zero or one ? optionalOne or more +
78
JSON Data Modelwklzzz / 123RF Stock Photo
79
JSON Values
Atomic values
Strings
Numbers
Booleans
Null
Objects
Arrays
Structured values
80
JSON Values
Atomic values
Strings
Numbers
Booleans
Null
ObjectsString-to-Value map
ArraysList of values
Structured values
81
JSON Values
Atomic values
Strings
Numbers
Booleans
Null
ObjectsString-to-Value map
ArraysList of values
Structured values
Recursion
82
Tree-based visual model
{"foo" : true,"bar" : [{"foobar" : "foo"
},null
]}
83
Tree-based visual model
{"foo" : true,"bar" : [{"foobar" : "foo"
},null
]}
object
84
Tree-based visual model
{"foo" : true,"bar" : [{"foobar" : "foo"
},null
]}
foo
true
bar
object
array
85
Tree-based visual model
{"foo" : true,"bar" : [{"foobar" : "foo"
},null
]}
foo
true
bar
foobar
null
object
array
object
foo
86
ValidationBurak Cakmak / 123RF Stock Photo
87
Document
Validation: The Pipeline
88
Document Well-Formedness
Validation: The Pipeline
89
Document Well-Formedness Validation
Validation: The Pipeline
90
On the oXygen Cheat Sheet
Validity Well-Formedness
91
Validation vs. Annotation
Validation
92
Validation vs. Annotation
Validation
Annotation
93
Validation
94
Validation
95
Validation
96
Validation
97
Validation
98
XML Schema
99
Empty Schema
<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema">&</xs:schema>&&
100
Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>
<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>
SchemaInstance
schema.xsd
file.xml
101
Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>
<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>
SchemaInstance
schema.xsd
file.xml
102
Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:string"/> </xs:schema>
<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This is text. </foo>
SchemaInstance
103
Simple Scenario<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="foo" type="xs:integer"/> </xs:schema>
<?xml version="1.0" encoding="UTF-8"?> <foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> 142857 </foo>
SchemaInstance
104
Simple Types: Built-inStrings string
anyURIQName
Numbers decimalintegerfloatdoublelong int short bytepositiveInteger nonNegativeInteger...unsignedLong unsignedInt...
Booleans boolean
105
Simple Types: Built-inDates and Times dateTime
timedategYearMonthgMonthDaygYeargMonthgDaydateTimeStamp
Time Intervals durationyearMonthDurationdayTimeDuration
Binaries hexBinary base64BinaryNull -
106
Dates
2014-12-02
2014-12-02T10:15:00Z
01:15:00-08:00
107
Durations
P1Y2MT3H
108
User-defined types
109
User-defined types
Restriction
110
User-defined types
RestrictionUnion Not atomic
111
User-defined types
RestrictionUnion Not atomic
List Not atomic
112
Restriction<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema">&&&<xs:simpleType&name="myFixedLengthString">&&&&&<xs:restriction&base="xs:string">&&&&&&&<xs:length&value="3"/>&&&&&</xs:restriction>&&&</xs:simpleType>&&&<xs:element&name="foo"&type="myFixedLengthString"/>&</xs:schema>&
<?xml&version="1.0"&encoding="UTF98"?>&<foo&&&xmlns:xsi="http://www.w3.org/2001/XMLSchema9instance"&&&xsi:noNamespaceSchemaLocation="schema.xsd">ZRH</foo>&&&
SchemaInstance
113
Restriction<xs:simpleType,name="myFixedLengthString">,,,<xs:restriction,base="xs:string">,,,,,<xs:length,value="3"/>,,,</xs:restriction>,</xs:simpleType>,,
<foo>ZRH</foo>,,,
SchemaInstance
114
List
<xs:simpleType,name="myList">,,,<xs:list,itemType="xs:string"/>,</xs:simpleType>,,<foo>foo,bar,foobar</foo>,,
SchemaInstance
115
Union
<xs:simpleType,name="myUnion">,,,<xs:union,memberTypes="xs:integer,xs:boolean"/>,</xs:simpleType>,,
<foo>true</foo>,
,,,
SchemaInstance
116
Complex Types
117
Complex Types
Empty <foo/>
118
Complex Types
EmptySimple Content
<foo/>
<foo>text</foo>
119
Complex Types
EmptySimple Content
Complex Content
<foo/>
<foo>text</foo>
<foo><a/><b/>
</foo>
120
Complex Types
EmptySimple Content
Complex ContentMixed Content
<foo/>
<foo>text</foo>
<foo><a/><b/>
</foo><foo>Text<a/>Text<b/>
</foo>
121
Complex content
SchemaInstance
<xs:complexType-name="complexContent">---<xs:sequence>-----<xs:element-name="twotofour"-type="xs:string"-minOccurs="2"-maxOccurs="4"/>-----<xs:element-name="zeroorone"-type="xs:boolean"-minOccurs="0"-maxOccurs="1"/>---</xs:sequence>-</xs:complexType>--
<foo>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<zeroorone>true</zeroorone>-</foo>----
122
Complex content
<xs:complexType-name="complexContent">---<xs:sequence>-----<xs:element-name="twotofour"-type="xs:string"-minOccurs="2"-maxOccurs="4"/>-----<xs:element-name="zeroorone"-type="xs:boolean"-minOccurs="0"-maxOccurs="1"/>---</xs:sequence>-</xs:complexType>--
<foo>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<twotofour>foobar</twotofour>---<zeroorone>true</zeroorone>-</foo>----
SchemaInstance
123
Empty content
<xs:complexType-name="emptyType">---<xs:sequence/>-</xs:complexType>--<foo/>---
SchemaInstance
124
Simple content
<xs:complexType-name="dateCountry">---<xs:simpleContent>-----<xs:extension-base="xs:date">-------<xs:attribute-name="country"-type="xs:string"/>-----</xs:extension>---</xs:simpleContent>-</xs:complexType>--
<foo-country="Switzerland">2014D12D02</foo>---
SchemaInstance
125
Mixed content
<xs:complexType-name="mixedContent"-mixed="true">---<xs:sequence>-----<xs:element-name="b"-type="xs:string"-minOccurs="0"-maxOccurs="unbounded"/>---</xs:sequence>-</xs:complexType>--
<foo>Some-text-and-some-<b>bold</b>-text.</foo>-----
SchemaInstance
126
Simple type on attributes
<xs:complexType-name="withAttribute">---<xs:sequence/>---<xs:attribute-name="country"-----------------type="xs:string"-----------------default="Switzerland"/>-</xs:complexType>--<foo-country="Switzerland"/>---
SchemaInstance
127
Named Types<xs:complexType name="empty"> <xs:sequence/> </xs:complexType> <xs:element name="c" type="empty"> </xs:element> <c/>
SchemaInstance
128
Anonymous Types
<xs:element name="c"> <xs:complexType> <xs:sequence/> </xs:complexType> </xs:element> <c/>
SchemaInstance
129
No namespaces<?xml&version="1.0"&encoding="UTF98"?>&<xs:schema&&&xmlns:xs="http://www.w3.org/2001/XMLSchema"&&&<xs:element&name="foo"&type="xs:string"/>&</xs:schema>&&
&
<?xml&version="1.0"&encoding="UTF98"?>&<foo&&&xmlns:xsi="http://www.w3.org/2001/XMLSchema9instance"&&&xsi:noNamespaceSchemaLocation="schema.xsd">&&&This&is&text.&</foo>&&
SchemaInstance
130
With namespaces<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/bigdata" xmlns:big="http://www.example.com/bigdata"> <xs:element name="foo" type="xs:string"/> </xs:schema> <?xml version="1.0" encoding="UTF-8"?> <big:foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.example.com/bigdata schema.xsd" xmlns:big="http://www.example.com/bigdata"> This is text. </big:foo>
SchemaInstance
131
Warning: named types with namespaces<xs:complexType name="empty"> <xs:sequence/> </xs:complexType> <xs:element name="c" type="big:empty"> </xs:element> <big:c/>
SchemaInstance
@name:implicitly in the target namespace
always unprefixed
132
Bonus material: The Schema of Schemas
!!<xs:schema!!!!!xmlns:xs="http://www.w3.org/2001/XMLSchema"!!!!!targetNamespace="http://www.w3.org/2001/XMLSchema">!!!!!<xs:element!name="schema"!id="schema">!!!!!!!<xs:complexType>!!!!!!!!!<xs:complexContent>!!!!!!!!!!!..!!!!!!!!!</xs:complexContent>!!!!!!!</xs:complexType>!!!!!</xs:element>!!!!!<xs:element!name="element"!type="xs:topLevelElement"!id="element"/>!!!!!<xs:element!name="simpleType"!type="xs:topLevelSimpleType"!id="simpleType"/>!!!!!<xs:element!name="complexType"!type="xs:topLevelComplexType"!id="complexType"/>!!!!!<xs:complexType!name="element"!abstract="true">!!!!!!!<xs:complexContent>!!!!!!!!!..!!!!!!!</xs:complexContent>!!!!!</xs:complexType>!!!</xs:schema>!
133
Alternate data models and validation formatswklzzz / 123RF Stock Photo
134
Protocol buffers
message Person {required string last_name = 1;repeated string first_name = 2;optional Title title = 3;optional Person boss = 4;
}
135
Avro
fields valuesname ETHcanton ZHstudents 20,000
{"type" : "map","name" : "university","fields" : [{ "name" : "name", "type" : "string" },{ "name" : "canton", "type" : "cantonal-code" },{ "name" : "students", "type" : "long" }
]}
Recommended