17
September 24, 2018 Sam Siewert CS317 - File and Database Systems Lecture 5, Part - 2 http://www.ibmbigdatahub.com/video/ibm-big-data-minute-drowning-petabytes http://bigdata-madesimple.com/dilberts-20-funniest-cartoons-on-big-data/ “Drowning in Data - Every 2 days the world generates as much data as it had through all of history up to the year 2003. ....” - IBM Big Data & Analytics Hub

CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

September 24, 2018 Sam Siewert

CS317 - File and Database Systems

Lecture 5, Part-2 http://www.ibmbigdatahub.com/video/ibm-big-data-minute-drowning-petabytes

http://bigdata-madesimple.com/dilberts-20-funniest-cartoons-on-big-data/

“Drowning in Data - Every 2 days the world generates as much data as it had through all of history up to the year 2003. ....” - IBM Big Data & Analytics Hub

Page 2: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Survey Says …Flip a few classes (SQL practice, teams) to engageOnly 4 Quizzes total (1 make-up at the end for take-away)

Sam Siewert 2

Page 3: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

SQL Theory and Standards

DBMS DesignBig Data

Sam Siewert

3

Page 4: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

For Discussion…Big Data – Velocity, volume, variety, veracity [2014]

1. Daily – 2.5 quintillion bytes (2,500,000,000,000,000,000) or 2 Exabytes, or 46,566,128 50GB Blu-Ray Discs, IBM Estimate

2. Annually – 7.5 billion in global population, produce/consume 2.25 unique Blu-Rays per Year, or 23 DVDs (assuming even distribution – unlikely)

3. Annually – If produced/consumed by US population alone – 53 Blu-Rays per Year or 564 DVDs per person

4. Data in Total is 40 trillion gigabytes or 800 billion Blu-Rays for just over 100 (unique) Blu-Rays per person globally

5. Data by Powers of 10 and 2 – 264 is 16 Exabytes of Addressable Data [PC limit]

6. Data Max Veolicity is 100 Gbps is Fastest Ethernet [8b/10b – 10 billion bytes per second]

7. How much is Truly Unique Data vs. Duplicated

8. What is the Quality (Veracity) of this Data? Sam Siewert 4

Page 5: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Data Archives - Digital Tape vs. VTLTape still competitive -Roadmap 2015-2025

Disk aerial density > tape, but total capacity less - e.g. MIT Tech Review on HAMR

IEEE Spectrum on Mag Tape

Sam Siewert 5

LTO-8 (12TB), < $200 on Amazon, $0.016/GB; E.g. Spectra Logic (640PB in 5 42U Racks), Nathan T.Seagate Exos (12TB), < $400 on Amazon, $0.032/GB, HAMR HDD; E.g. DDN Exascaler (35PB in 4 42U Racks)Tape is ½ cost, and 14x uncompressed storage density (1U = 1.75 inches, 42U is just over 6 foot tall)

Page 6: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Data CentersLarge - to host an Exabyte (large room, e.g. 30+ person classroom)Thermal and power challenge120 6+ foot, 19” wide, 28.5” deep Racks!

Sam Siewert 6https://news.microsoft.com/features/under-the-sea-microsoft-tests-a-datacenter-thats-quick-to-deploy-could-provide-internet-connectivity-for-years/

“The Project Natick data center has 12 racks containing a total of 864 servers and associated cooling system infrastructure.” - Microsoft AI & Research

Page 7: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Big DataVolume and Velocity Can Be Estimated as Shown– Disk drives shipped and in use– Online data only, or removable and archive media as well?– Bit-rot (media eventually fails, limited storage lifetime)

Variety, Depends on Level of Data Duplication– Enterprise Storage System Deduplication – E.g. EMC Deduplication– Internet Archive [petabytes] and Wayback machine– http://www.loc.gov/about/general-information/ [traditional volumes]– Stanford Digital Repository, National Archives, National A/V

Conservation

Veracity, perhaps Most Challenging Part– Is the Data Correct – Not Corrupted– Is it Valid – From a Known, Trusted Source, Corresponding to

Metadata Description– Has the Data Been Processed and if so, How?– Is it Raw Data (from a sensor, user, other)?– Veracity is difficult – E.g. http://berkeleyearth.org/about-data-set

Sam Siewert 7

Page 8: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Why NoSQL instead of MySQL?MongoDB - Linux install (SE Workstation for projects)

C++ with Persistence (STLplus C++ library)

Redis - “in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes…”Cassandra - “a free and open-source, distributed, wide column store, NoSQLdatabase management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure” -Wikipedia

Sam Siewert 8

Semi-structured

Page 9: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Unstructured DataBLOBs - Binary Large Objects– Images– Digital Video and Audio – Digital Media– Binary Data (Documents and Code), Perhaps Proprietary– Moose-to-Skeleton.png– Sled-Dogs.jpg– korean-air-profile.jpg

CLOBs – Character Large Objects– Log files and Traces (IT)– Transaction Logs

Semi-Structured (Self-describing)– XML, HTML, XDS, etc. [Web documents typically via HTTP,

HTTPS]– JSON– NoSQL

Sam Siewert 9

Page 10: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Semi-Structured DataHTML - Web pages

XML - Extensible Markup Language

JSON - “JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.” -Wikipedia

OO Schemas Sam Siewert 10

Page 11: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

OO Concepts – “Real World”OOA – Object Oriented Analysis– Define Class Hierarchies (Abstract Classes with Attributes) and

Interfaces (Public, Private) and Methods (Operations)– Inheritance and Multiple Inheritance

OOD – OO Design– Encapsulation of Methods with Data (Attributes) for Abstract and

Derived Classes– Instantiation and Use of Objects [Use Cases]

OOP – Object Oriented Programming (Java, C++, …)– Programming Language – Direct Implementation of OOD– Implementation of Re-useable OO Code Libraries

Boost - http://www.boost.org/OpenCV [C++ version]Many More … in other OOPLs

Sam Siewert 11

Page 12: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Classes Useful in Real WorldE.g. Biology – Kingdom, Phylum, Class, Order, Genus, Species [Multiple Inheritance Examples], Proven Use

Parts – Components compose Sub-system(s) compose System(s) compose System of Systems

Supports Re-Use of Objects Instantiated from Class Hierarchy

Multiple Inheritance – Odd?

Can be Abstract, Derived and Concrete

– E.g. Mathematical, Data Structures, Image Processing

– Organization of Information (Classes in Ontological Web Language)

– Simulation of Physical Systems– Most Often Software Libraries

Sam Siewert 12

http://en.wikipedia.org/wiki/Platypus#mediaviewer/File:Wild_Platypus_4.jpg

https://www.youtube.com/watch?v=kDay5OWDPn4#t=26

Page 13: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Quick Review of OO [not just C++]Encapsulation of Data and Methods in an Instantiated Object

Objects are Instances from a Class Hierarchy– Classes Define Encapsulated Data and Methods

Virtual Functions can Be RefinedPure Virtual Functions in Abstract Classes Defined must be Refined

– Can Inherit Data and Methods from Parent Classes– Can In Fact Have Multiple Inheritance– Instantiated Objects Call Dynamically Bound Methods [Determined at Runtime]

Enables Semantic Overload [Can be Done without OO too]– Overloaded Functions (Methods), Resolved by Type Signatures or Subtype/Sub-

class– Overloaded Operators (E.g. math operators work not only on integers and real

numbers, but also vectors, matrices, and complex numbers)– Derived Data Types from Base types

Polymorphism– Parametric – Re-useable Templates (E.g. Ada and Java Generic, C++ Template)– Functional Semantic Overloading– Dynamic or Subtype or Subclass Polymorphism using Late Binding

OOPs – Smalltalk to more current Java, C++, Ada95, … CLOS Sam Siewert 13

Page 14: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Operator and Function OverloadingWhat is Required to Be OO?

Common Consensus is –Encapsulation, Class Hierarchy, Polymorphism(Parametric & Subtype or Subclass with Late Binding), Inheritance

Operator Overloading Not Required (E.g. Java Frowns Upon, No Support)

Some PLs have OO Features, but not All Sam Siewert 14http://en.wikipedia.org/wiki/Operator_overloading

Page 15: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Storing Objects in Relational Databases

One approach to achieving persistence with an OOPL is touse an RDBMS as the underlying storage engine.– O2 – merged with Informix and acquired by IBM– ObjectStore - http://www.objectstore.com/– Objectivity - http://www.objectivity.com/products/objectivitydb– Versant - http://www.actian.com/products/operational-databases/

Requires mapping class instances (i.e. objects) to one ormore tuples distributed over one or more relations.

To handle class hierarchy, have two basics tasks to perform:(1) design relations to represent class hierarchy;(2) design how objects will be accessed.

Pearson Education © 2009 15

Page 16: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

Stonebraker’s View

Pearson Education © 2014 16

Page 17: CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... · 2018-09-27 · CS317 - File and Database Systems Lecture 5, Part-2 ... Data Centers

SQL:2011 - New OO FeaturesType constructors for row types and reference types.

User-defined types (distinct types and structured types)that can participate in supertype/subtype relationships.

User-defined procedures, functions, methods, andoperators.

Type constructors for collection types (arrays, sets, lists,and multisets).

Support for large objects – BLOBs and CLOBs.

Recursion.

Pearson Education © 2014 17