38
Basic Technologies (Unicode, URIs, Namespaces, XML) Camilo Thorne Room 00.012 Institut f¨ ur Maschinelle Sprachverarbeitung Universit¨ at Stuttgart +49 (0) 711 685-81369 [email protected] Semantic Web, SS 2017 (based on slides by W. Kessler) C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 1 / 35

Basic Technologies - (Unicode, URIs, Namespaces, XML)

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Basic Technologies(Unicode, URIs, Namespaces, XML)

Camilo Thorne

Room 00.012Institut fur Maschinelle Sprachverarbeitung

Universitat Stuttgart+49 (0) 711 685-81369

[email protected]

Semantic Web, SS 2017(based on slides by W. Kessler)

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 1 / 35

Page 2: Basic Technologies - (Unicode, URIs, Namespaces, XML)

The Semantic Web Stack [W3C, Tim Berners-Lee]

URI Unicode, UTF-8

XML, XMLSchema, Namespaces

RDF

SPARQLRDFS

Ontology, OWL

Logic, Rules

Proof

En

cryp

tion

Dig

ital

Sig

nat

ure

s

Trust

User Interface, Software Agents

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 2 / 35

Page 3: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 3 / 35

Page 4: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 4 / 35

Page 5: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Recap on Modeling Basics

Pinpoint entities, concepts, relations, states of affairs and constraintsmentioned in the following text, and build a formal representation:

Frames were proposed by Marvin Minsky in the paper “A Frame-work for Representing Knowledge.” Frames consist of slots andvalues. Frames are the primary data structure used in AI framelanguages. Frames are similar to class hierarchies in object-oriented languages, but their design goals are different.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 5 / 35

Page 6: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 6 / 35

Page 7: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Unicode

First computers only “spoke” English and stored the characters with 7bit, the first bit of a byte is 0→ ASCII: A is 01000001

With the first bit set to 1, we can encode “other” stuff→ e.g., in Latin-1: A is 01000001, a is 11100100

You have to know the encoding to display a text correctly which isoften not specified anywhere – this is madness!

Since 1987, there have been attempts to create one character set forevery existing writing system

In 1991 the first Unicode standard was published

Unicode maps each character to a (abstract, hexadecimal) codepoint: A is U+0041, a is U+00E4

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 7 / 35

Page 8: Basic Technologies - (Unicode, URIs, Namespaces, XML)

UTF-8: An Encoding for Unicode

The way to store a character in bits/bytes is not part of the Unicodestandard

There are many encodings for Unicode, the most widely used isUTF-8

UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)

Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8

B Examples:

Character Unicode UTF-8

A U+0041 01000001

a U+00E4 11000011 10100100

e U+20AC 11100010 10000010 10101100

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35

Page 9: Basic Technologies - (Unicode, URIs, Namespaces, XML)

UTF-8: An Encoding for Unicode

The way to store a character in bits/bytes is not part of the Unicodestandard

There are many encodings for Unicode, the most widely used isUTF-8

UTF-8 is a variable length encoding and stores Unicode code pointsin one or up to six bytes (up to 6× 8 = 48 bits)

Code points 0-127 are stored in one byte, so that text using onlyEnglish characters looks the same in ASCII and UTF-8

B Examples:

Character Unicode UTF-8

A U+0041 01000001

a U+00E4 11000011 10100100

e U+20AC 11100010 10000010 10101100

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 8 / 35

Page 10: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Quiz: Unicode

Which of these statements are true?

A) Unicode is an encoding

B) UTF-8 is an encoding

C) One character uses at most 2 byte in UTF-8 encoding

D) There are Unicode code points for Egyptian Hieroglyphs

E) Everybody uses UTF-8 encoding per default today

F) Documents you hand in during this course should use UTF-8

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 9 / 35

Page 11: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 10 / 35

Page 12: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Unique Resource Identifiers (URIs)

“Everything has a URI”

The URI is a unique identifier for a specific resource, i.e., no tworesources can have the same URI in the same domain

One resource can have several URIs, e.g., I have a URI that refers tome as a teacher and one that refers to me as a singer

A URI could be anything, it can be a URL (Unified Resource Locator,or Web address), but not all URIs are URLs

A URI does not necessarily enable access to a resource

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 11 / 35

Page 13: Basic Technologies - (Unicode, URIs, Namespaces, XML)

URI Examples

For us, URIs will always look like URLs, e.g.,http://www.example.org/#JohnSmith.

URIs have two parts:

Namespace http://www.example.org/#

Local name JohnSmith

We can define prefixes for namespaces and abbreviate URIs withprefix:LocalName.

We will define ex as prefix for the example namespace, sohttp://www.example.org/#JohnSmith is abbreviated asex:JohnSmith

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 12 / 35

Page 14: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Quiz: URIs

Which of these statements are true?

A) Two different URIs can never refer to the same object.

B) Two different objects can have the same URI.

C) All URIs are URLs.

D) INwOXOz96UQOU is a valid URI.

E) URIs must be assigned by the W3C to be valid.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 13 / 35

Page 15: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 14 / 35

Page 16: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML: eXtensible Markup Language

W3C Recommendation since 1998 (first draft 1996).

Markup-language based on tags.

XML separates content from formatting.

XML documents are meant to be understood by bothhumans and computers.

XML as a format for exchanging data became far more common thanoriginally intended.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 15 / 35

Page 17: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML vs. HTML

Both are markup-languages based on tags.

In both languages tags may be nested.

In XML all tags must be closed(every opening tag <tag> must have a closing tag </tag>).HTML allows tags that are not closed.

In XML users define their own tags,HTML has predefined tags.

XML separates content from formatting,HTML does not.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 16 / 35

Page 18: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Syntax – Prologue and Root Element

<?xml version="1.0" encoding="UTF-8"?>

<pets>

...

</pets>

The first line in any XML file is the XML declaration and specifiesXML version and character encoding.

There is only one outermost element in the document(called the root element).

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 17 / 35

Page 19: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Syntax – Elements

<pet>

<name>Fifi</name>

<petType>Dog</petType>

<dateOfBirth></dateOfBirth>

</pet>

Elements represent the “things” the document talks about.

An element consists of an opening tag with its name, a closing tagand the element’s content between the tags.

The content may be text, other elements, or nothing.

If there is no content, then the element is called empty and can beabbreviated like this: <dateOfBirth />.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 18 / 35

Page 20: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Syntax – Example

<?xml version="1.0" encoding="UTF-8"?>

<pets>

<!-- This is a comment -->

<pet>

<name>Fifi</name>

<petType>Dog</petType>

<dateOfBirth></dateOfBirth>

<owner>

<name>Jane Doe</name>

<city>Heretown</city>

</owner>

</pet>

</pets>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 19 / 35

Page 21: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML as a Tree

<?xml version="1.0" encoding="UTF-8"?>

<pets>

<!-- This is a comment -->

<pet>

<name>Fifi</name>

<petType>Dog</petType>

<dateOfBirth></dateOfBirth>

<owner>

<name>Jane Doe</name>

<city>Heretown</city>

</owner>

</pet>

<pet>

<name>Fluffy</name>

...

</pet

</pets>

pets

pet

name: FifipetType: DogdateOfBirth: –owner

name: Jane Doecity: Heretown

pet

name: Fluffy. . .

. . .

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 20 / 35

Page 22: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML as a Tree

<?xml version="1.0" encoding="UTF-8"?>

<pets>

<!-- This is a comment -->

<pet>

<name>Fifi</name>

<petType>Dog</petType>

<dateOfBirth></dateOfBirth>

<owner>

<name>Jane Doe</name>

<city>Heretown</city>

</owner>

</pet>

<pet>

<name>Fluffy</name>

...

</pet

</pets>

pets

pet

name: FifipetType: DogdateOfBirth: –owner

name: Jane Doecity: Heretown

pet

name: Fluffy. . .

. . .

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 20 / 35

Page 23: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Quiz: XML

Find the errors in this XML document:

<?xml version="1.0" encoding="UTF-8"?>

<fruits>

<fruit>

<fruit name>Orange</fruit name>

<price>3.15</pricePerKilo>

<amount>0.570<amount/>

<priceTotal>1.57</priceTotal>

<origin>Germany<producer>Bioland Mauck</origin></producer>

</fruit>

</fruits>

<customer>

<name>John</name><name>Smith</name>

<customerid>7271</customerID>

</customer>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 21 / 35

Page 24: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 22 / 35

Page 25: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Combining XML Documents

Doc 1 describes books, <title> refers to booktitle:

<book>

<title>Max und Moritz</title>

<author>Wilhelm Busch</author>

</book>

Doc 2 describes people, <title> refers to academic degree:

<person>

<title>Prof. Dr. med.</title>

<name>Friedrich Busch</name>

</person>

XML documents can import things from various sources→ “name clashes” are inevitable.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 23 / 35

Page 26: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Namespaces

Namespaces define a set of element names used in one document andserve to disambiguate elements with the same name from differentsources

Namespaces can be used in tags with <prefix:elementName>

B Using the namespace books for document 1, people for document 2solves the disambigation problem:

<books:title>Max und Moritz</books:title>

is clearly different from

<people:title>Prof. Dr. med.</people:title>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 24 / 35

Page 27: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Namespaces

Namespaces define a set of element names used in one document andserve to disambiguate elements with the same name from differentsources

Namespaces can be used in tags with <prefix:elementName>

B Using the namespace books for document 1, people for document 2solves the disambigation problem:

<books:title>Max und Moritz</books:title>

is clearly different from

<people:title>Prof. Dr. med.</people:title>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 24 / 35

Page 28: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Example with Namespaces

<?xml version="1.0" encoding="UTF-8"?>

<ex:pets

xmlns:ex="http://www.example.org/#"

xmlns:cust="http://www.examplecustomers.org/people/#" >

<!-- This is a comment -->

<ex:pet>

<ex:name>Fifi</ex:name>

<ex:petType>Dog</ex:petType>

<ex:dateOfBirth></ex:dateOfBirth>

<ex:owner>

<cust:name>Jane Doe</cust:name>

<cust:city>Heretown</cust:city>

</ex:owner>

</ex:pet>

</ex:pets>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 25 / 35

Page 29: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Quiz: Namespaces

Given the following namespace prefixes:

PREFIX ex: <http://www.example.org/#>

PREFIX bsp: <http://www.example.org/#>

PREFIX ims: <http://www.ims.uni-stuttgart.de/#>

Which URIs refer to http://www.example.org/#SemanticWeb?

A) http://www.example.org/SemanticWeb

B) ex:SemanticWeb

C) bsp:SemanticWeb

D) ex#SemanticWeb

E) ims:SemanticWeb

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 26 / 35

Page 30: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 27 / 35

Page 31: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Schema: Defining XML in XML

XML Schema offers a language for defining the syntactic structure ofXML documents1

This means an XML Schema defines which types of elements andattributes are allowed in which places inside an XML document

XML Schema provides a set of predefined data types that are widelyused

1An older, less powerful method for doing this is by using DTDsC. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 28 / 35

Page 32: Basic Technologies - (Unicode, URIs, Namespaces, XML)

XML Schema Data Types

PREFIX xsd: <http://www.w3.org/2001/XMLSchema/#>

Some predefined data types:

Text xsd:string, . . .

Numbers xsd:int, xsd:integer, xsd:nonNegativeInteger,xsd:positiveInteger, . . .

Decimals xsd:decimal, xsd:float, xsd:double, . . .

Dates xsd:date, xsd:dateTime, . . .

Boolean xsd:boolean

URIs xsd:anyURI

B Used to type elements

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 29 / 35

Page 33: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Example XML Schema

<?xml version="1.0" encoding="UTF-8"?>

<!-- HelloWorld Example -->

<schema

xmlns:xsd="http://www.w3.org/2001/XMLSchema/#"

xmlns:hello="http://example.com/HelloWorld/#"

targetNamespace="http://example.com/HelloWorld/#">

<xsd:complexType="employee"

<xsd:sequence>

<xsd:element name="name" type="xsd:string">

<xsd:attribute name="email" type:"xsd:string">

</xsd:element>

<xsd:element name="id" type="xsd:integer"/>

<xsd:element name="income" type="xsd:string"/>

</xsd:sequence>

</xsd:complexType>

</schema>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 30 / 35

Page 34: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Example XML File

<?xml version="1.0" encoding="UTF-8"?>

<!-- HelloWorld Example, cntd. -->

<hello:employee

xmlns:hello="http://example.com/HelloWorld/#"

<hello:name hello:email="[email protected]">

John Doe

</hello:name>

<hello:id/>

<hello:income>28,000,000</hello:income>

</hello:employee>

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 31 / 35

Page 35: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 32 / 35

Page 36: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Summary: Semantic Web Basis Technologies

Unicode is a mapping from writing characters of any writing systemto abstract code points.

UTF-8 is an encoding for unicode code points.

A URI is a unique identifier for a specific resource.

XML is a format for exchanging data between applications.

XML has no predefined tags, for exchanging data applications have toagree on a common vocabulary (a set of tags with specified meaning).

Namespaces serve to disambiguate XML elements from differentsources and make URIs more readable.

XML Schema describes the syntax of XML documents and providesa set of common data types.

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 33 / 35

Page 37: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Outline

1 Recap

2 Unicode: One Character Set to Represent Them All

3 URIs: Unique Resource Identifiers

4 XML: eXtensible Markup Language

5 XML Namespaces

6 XML Schema: Defining XML in XML

7 Summary

8 References

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 34 / 35

Page 38: Basic Technologies - (Unicode, URIs, Namespaces, XML)

Suggested Reading

Pascal Hitzler, Markus Krotzsch, Sebastian Rudolph and York Sure.Semantic Web. Grundlagen. Springer textbook, 2008. (Chapter 2)

Pascal Hitzler, Markus Krotzsch and Sebastian Rudolph. Foundationsof Semantic Web Technologies. Chapman & Hall/CRC, 2009.(Appendix A)

C. Thorne (IMS Stuttgart) Basic Technologies SemWeb, SS 2017 35 / 35