Databases:
Organizing Data and
Information
Files and Databases
Data Resource Managementa managerial activity that applies information
systems technology and management tools to manage the company’s data resources
DBA Database Administrator: responsible for the
development and management of the organization’s database
DBAs work with programmers and system’s analysts to design and implement the database
DBAs also work with users and managers of the firm to establish policies for managing an organization’s database
Why is there a demand for more and better data?
Competitive environment powerful workstations that can handle
data quickly More computer literate personnel
Equivalents
DBMS Office
Database File Cabinet
File File Drawer
Record File Folder
Record Attributes
Report
Example of a Data Elements
Hierarchy of Data
Database Files Records Attributes Characters Bits
Hierarchy of Data
DatabaseIntegrated collection of logically related data elementsPersonnel File, Payroll File, Department File
File (eg. personnel file)098-40-1370 Portillo, Talynn 1-20-2010594-39-3948 Baker, Derek 3-24-2009234-34-3483 Fletcher, Kari 2-02-2009457-74-3854 Brown, Peyton 1-23-2009
Hierarchy of Data
Record (Record containing SSN, Last and First name, Hire date)594-41-3955 Bartholomew, Courtney 4-24-07
Attribute (first name)BartholomewCharacters (bytes)01100010 (Letter “B” in Binary Code)
Bits0 or 1
Flat File
A table storing all database information in one large two-dimensional table
Data Management Approaches
Traditional Approach Database Approach
TraditionalFile Processing
Traditional Approach• data are stored, organized and processed in
independent files of data records• Problems:
– data duplication– lack of integration– data dependence
Database Management Approach
Database Approach• files are consolidated into a common pool of
records available to many different application programs
• Requires the use of a database management system (DBMS)
Database Approach Advantages
Minimal data redundancy (data duplication minimized)
Data integration (no lack of integration) Program – data Independence (no data
dependence)
Three models to represent how data can be related or linked
Hierarchical (Tree) Model: Network Model Relational Model
Hierarchical Model
Hierarchical Model
• data is organized in a top-down (inverted tree) structure.
• Parent - child relationships• one-to-many relationship (one parent,
many children)• only one access path to any particular
data element
Network Model
Network Model
an extension of the hierarchical model.
owner - member relationships a member may have several owners more than one path of access
Relational Model
Relational Structure
• Most widely used structure• Data elements are viewed as being stored in
tables• Row represents record• Column represents field• Can relate data in one file with data in
another file if both files share a common data element
Database Development: Data Model A data model is a map or diagram of
entities and their relationships. The entity-relationship diagram is an
abstract and conceptual representation of data
Data modeling usually involves understanding a specific business problem to be solved and analyzing the data and information needed to deliver a solution.
Database Development: E-R Diagrams
• Entity-relationship (ER) diagrams:– Commonly used when designing databases– graphical descriptions of entities and the
relationships between entities.– helps in the design and management of data.
Data in a Database - Entities
Entity is something you collect data about. For example an item, a person, a place, or a thing.
Entities can be thought of as nouns. For example, computer, artist, employee, player.
Entities are represented by rectangles.
Data in a Database - Relationships
Relationships capture how entities are related to one another. For example Entities “artist” and “song” can be related by the Relationship “performs”. Therfore, artist-performs-song.
Relationships can be thought of as verbs and are represented by diamonds. Examples are “performs” “owns” “houses” “supervises”
Entity #2RelationshipEntity #1
Database Development: E-R Diagrams
Entity #2RelationshipEntity #1
Noun #1 Verb Noun #2
player plays team
Database Development: E-R Diagrams
• Types of Relationships include:– one-to-one (1:1)– one-to-many (1:N)– many-to-many (N:M)
Database Development: E-R Diagrams
Entity #2RelationshipEntity #11 1
Entity #2RelationshipEntity #11 M
Entity #2RelationshipEntity #1N M
Database Development: E-R Diagrams
TeamhousesHome_stadium
Football team and stadium: Big 12
Team has Player
Football team and games: Big 12
player plays game
Football players and games: Big 12
Data in a Database - Representation
Entities and Relationships are represented in a Relational Database in Tables
Each column in a table is called an attribute, which would be a characteristic of an entity (field name)
Each row in a table is called a record or a tuple. Each record consists of many attributes.
Key: a field in a record that is used to uniquely identify the record.
Data in a Database (cont.)
Key Field
Attributes (fields)
Entity: Employee
005-10-6322
534-34-6321
776-54-4525
342-43-6378
12-13-09
2-23-10
11-10-09
01-15-09
Gray Rachel
Hughey Abigail
Morris Kyle
Cobb Cade
Table: player (QB)
001
002
003
004
Brees
Romo
Rodgers
Kaepernick
Table: team
001 001
Table: plays
Team_IDPlayer_ID
002 003003 004004 002
player plays team
001
002
003
004
49ers
Cowboys
Saints
Packers
Jets005
Football Team: Entities
Team
Player
Home_stadium
Games
Conference
Football Team - ER example
Team
N
has
Player
1
Home_stadium
house1
1
GameM
participate
2
partake
Conference
M
1
Data in a Database - Representation
Entities and relationships have attributes which are represented as ovals and are connected with a line to their respected entity
Primary keys (attributes that are unique identifiers) are underlined
Customer
SSN
Name
Football Conference Example
Team
partake
Conference
M
1
Data in a Database (cont.)
Key Field
Attributes (fields)
Table: Team
001
002
003
004
Texas
Texas Tech
USC
Auburn
005 Georgia
001 001
Table: Partake
Conf_IDTeam_ID
001
002
003
004
Big 10
Big 12
SEC
PAC 10
Table: Conference
002 001003 004004 003005 003
Video Rental - ER example
Customer
Movie
Actors
ISBN
SSNDueDate
Video Rental - ER example
Customer
N
Rents
Movie
M
SSN
Name
Gender
ISBN
Title
Has
N
M
Actor_ID
Name Actors
ISBN
Actor_ID
100200300400
Keane Stanton MSplawn Reid MUrias Andrea FKitten Kayleen F
Customer
600 Towles Keely F700 Seal Samantha F
200 1-101-2 04/12/15300 1-607-9 10/11/14400 1-702-5 11/28/14
Rents
700 1-699-4 11/26/14
100 1-101-1 04/10/15
1-101-2 Chariots of Fire1-607-91-699-4 Navy SealsMovie1-702-5 The Cat in the Hat
1-101-1 Rudy
Space Jam
1-799-2 Dumb and Dumber
Video Rental - ER example
Actor_ID
MovieISBN
Title
Has
N
M
Name Actors
ISBN
Actor_ID
0101010201030104
Carrey JimDiaz Cameron
Dusnt KristenMaguire Tobey
Actor
0105 Stiller Ben0106 Sandler Adam
0001 01020002 01020002 01050003 0103
0001 0101
Has
Movie_ID Actor_ID
0005 01060005 01050006 01030006 0104
0004 0105
0007 0102
Movie
0002 Something About Mary
00030004 Dodge Ball0005 Happy Gilmore
0001 The Mask
Bring it On!
0006 Spider-Man
1998200020041996
1994
20020007 Vanilla Sky 2001
0008 0104
DBMS
Problem with the last tables?
DBMS
• DBMS: software that provides for the creation, implementation, usage, and updating of a database.
• schema: a general description of the entire database that shows all of the record types and their relationships to each other.
• QBE – Query By Example. Used for Query Design• SQL – Structured Query Language• Natural Language Query
Source: Courtesy of Microsoft Corp.
SELECT [Customers].[Company Name],[Customers].[Contact Name]
FROM [Customers]
WHERE not Exists [SELECT [Ship Name] FROM [Orders]
WHERE Month {[Order Date]}=1 and Year{[Order Date]}=2008 and [Customers].[Customer ID]=Orders].{[Customer ID]}
Non-relational databases: “NoSQL”More flexible data modelData sets stored across distributed machines Easier to scaleHandle large volumes of unstructured and structured data (Web, social media, graphics)
Databases in the cloudTypically, less functionality than on-premises DBsAmazon Relational Database Service, Microsoft SQL AzurePrivate clouds
The Database Approach to Data Management
• Capabilities of database management systems– Data definition capability: Specifies structure of database
content, used to create tables and define characteristics of fields
– Data dictionary: Automated or manual file storing definitions of data elements and their characteristics
– Data manipulation language: Used to add, change, delete, retrieve data from database • Structured Query Language (SQL)• Microsoft Access user tools for generating SQL
– Many DBMS have report generation capabilities for creating polished reports (Crystal Reports)
The Database Approach to Data Management
Microsoft Access has a rudimentary data dictionary capability that displays information about the size, format, and other characteristics of each field in a database. Displayed here is the information maintained in the SUPPLIER table. The small key icon to the left of Supplier_Number indicates that it is a key field.
FIGURE 6-6
MICROSOFT ACCESS DATA DICTIONARY FEATURES
Illustrated here are the SQL statements for a query to select suppliers for parts 137 or 150. They produce a list with the same results as Figure 6-5.
FIGURE 6-7
EXAMPLE OF AN SQL QUERY
Illustrated here is how the query in Figure 6-7 would be constructed using Microsoft Access query building
tools. It shows the tables, fields, and selection criteria used for the query.
FIGURE 6-8
AN ACCESS QUERY
• Designing Databases– Conceptual (logical) design: abstract model from business
perspective– Physical design: How database is arranged on direct-access storage
devices
• Design process identifies:– Relationships among data elements, redundant database elements– Most efficient way to group data elements to meet business
requirements, needs of application programs
• Normalization– Streamlining complex groupings of data to minimize redundant data
elements and awkward many-to-many relationships
The Database Approach to Data Management
An unnormalized relation contains repeating groups. For example, there can be many parts and suppliers for each order. There is only a one-to-one correspondence between Order_Number and Order_Date.
FIGURE 6-9
AN UNNORMALIZED RELATION FOR ORDER
After normalization, the original relation ORDER has been broken down into four smaller relations. The relation ORDER is left with only two attributes and the relation LINE_ITEM has a combined, or concatenated, key consisting of Order_Number and Part_Number.
FIGURE 6-10
NORMALIZED TABLES CREATED FROM ORDER
Referential integrity rules• Used by RDMS to ensure relationships between tables
remain consistent
Entity-relationship diagramUsed by database designers to document the data modelIllustrates relationships between entities
– Caution: If a business doesn’t get data model right, system won’t be able to serve business well
The Database Approach to Data Management
This diagram shows the relationships between the entities SUPPLIER, PART, LINE_ITEM, and ORDER that might be used to model the database in Figure 6-10.
FIGURE 6-11
AN ENTITY-RELATIONSHIP DIAGRAM
• Big data• Massive sets of unstructured/semi-structured data
from Web traffic, social media, sensors, and so on• Petabytes, exabytes of data
• Volumes too great for typical DBMS• Can reveal more patterns and anomalies
Using Databases to Improve Business Performance and Decision Making
Business intelligence infrastructureToday includes an array of tools for separate systems,
and big data
Contemporary tools:Data warehousesData martsHadoopIn-memory computingAnalytical platforms
Using Databases to Improve Business Performance and Decision Making
• Data warehouse: – Stores current and historical data from many core
operational transaction systems– Consolidates and standardizes information for use across
enterprise, but data cannot be altered– Provides analysis and reporting tools
• Data marts: – Subset of data warehouse– Summarized or focused portion of data for use by specific
population of users– Typically focuses on single subject or line of business
Using Databases to Improve Business Performance and Decision Making
A contemporary business intelligence infrastructure features capabilities and tools to manage and
analyze large quantities and different types of data from multiple sources. Easy-to-use query and
reporting tools for casual business users and more sophisticated analytical toolsets for power users
are included.
FIGURE 6-12
COMPONENTS OF A DATA WAREHOUSE
HadoopEnables distributed parallel processing of big data
across inexpensive computersKey services
○ Hadoop Distributed File System (HDFS): data storage○ MapReduce: breaks data into clusters for work○ Hbase: NoSQL database
Used by Facebook, Yahoo, NextBio
Using Databases to Improve Business Performance and Decision Making
In-memory computingUsed in big data analysisUse computers main memory (RAM) for data storage
to avoid delays in retrieving data from disk storageCan reduce hours/days of processing to secondsRequires optimized hardware
Analytic platformsHigh-speed platforms using both relational and non-
relational tools optimized for large datasets
Using Databases to Improve Business Performance and Decision Making
• Analytical tools: Relationships, patterns, trends– Tools for consolidating, analyzing, and providing
access to vast amounts of data to help users make better business decisions• Multidimensional data analysis (OLAP)• Data mining• Text mining• Web mining
Using Databases to Improve Business Performance and Decision Making
• Online analytical processing (OLAP)– Supports multidimensional data analysis
• Viewing data using multiple dimensions• Each aspect of information (product, pricing, cost,
region, time period) is different dimension• Example: How many washers sold in East in June
compared with other regions?– OLAP enables rapid, online answers to ad hoc queries
Using Databases to Improve Business Performance and Decision Making
Data mining:Finds hidden patterns, relationships in datasets
○ Example: customer buying patternsInfers rules to predict future behaviorTypes of information obtainable from data mining:
○ Associations○ Sequences○ Classification○ Clustering○ Forecasting
Using Databases to Improve Business Performance and Decision Making
Text miningExtracts key elements from large unstructured data
sets ○ Stored e-mails○ Call center transcripts○ Legal cases○ Patent descriptions○ Service reports, and so on
Sentiment analysis software○ Mines e-mails, blogs, social media to detect opinions
Using Databases to Improve Business Performance and Decision Making
• Web mining– Discovery and analysis of useful patterns and
information from Web– Understand customer behavior– Evaluate effectiveness of Web site, and so on
– Web content mining• Mines content of Web pages
– Web structure mining• Analyzes links to and from Web page
– Web usage mining• Mines user interaction data recorded by Web server
Using Databases to Improve Business Performance and Decision Making
Read the Interactive Session and discuss the following questions
Interactive Session: Technology
Describe the kinds of big data collected by the organizations described in this case.
List and describe the business intelligence technologies described in this case.
Why did the companies described in this case need to maintain and analyze big data? What business benefits did they obtain?
Identify three decisions that were improved by using big data.
What kinds of organizations are most likely to need big data management and analytical tools?
Big Data, Big Rewards
• Databases and the Web– Many companies use Web to make some internal
databases available to customers or partners– Advantages of using Web for database access:
• Ease of use of browser software• Web interface requires few or no changes to database• Inexpensive to add Web interface to system
Using Databases to Improve Business Performance and Decision Making
Users access an organization’s internal database through the Web using their desktop PCs and Web browser software.
FIGURE 6-14
LINKING INTERNAL DATABASES TO THE WEB
Establishing an information policyFirm’s rules, procedures, roles for sharing, managing,
standardizing dataData administration
○ Establishes policies and procedures to manage dataData governance
○ Deals with policies and processes for managing availability, usability, integrity, and security of data, especially regarding government regulations
Database administration○ Creating and maintaining database
Managing Data Resources
• Ensuring data quality – More than 25% of critical data in Fortune 1000
company databases are inaccurate or incomplete– Redundant data– Inconsistent data– Faulty input
– Before new database in place, need to:• Identify and correct faulty data • Establish better routines for editing data once
database in operation
Managing Data Resources
• Data quality audit:– Structured survey of the accuracy and level of
completeness of the data in an information system• Survey samples from data files, or• Survey end users for perceptions of quality
• Data cleansing– Software to detect and correct data that are incorrect,
incomplete, improperly formatted, or redundant– Enforces consistency among different sets of data
from separate information systems
Managing Data Resources
Google, UPS and Visa are using new kinds of data and new tools to improve their operations
Google: analyzes connections between web pages UPS: uses regression to read handwritten zip codes
from envelopes Visa: uses anomaly detection to identify fraud (they
now look at all credit card data instead of sampling)
BIG DATA: How Business Intelligence is Transforming the World
Billions of computers (smartphones, tablets, laptops, supercomputrs, etc.) are part of our lives.
BIG DATA: How Business Intelligence is Transforming the World
In 2010 the global volume of digital data stored and managed was over one trillion gigabytes (about one terabyte per person in the whole world)
By 2020 the number is estimated to be 40 trillion gigabytes.
Therefore, the way that data was analyzed traditionally is no longer valid.
Big Data forces us to come up with new methods to analyze huge data sets
BIG DATA
Amazon and Netflix gather millions of ratings that customers use.
Data is transformed and used in the areas of politics, sports, health care, finance, entertainment, science, industry, etc.
Collecting data like never before creates new opportunities and challenges.
BIG DATA
Volume Velocity and Variety
BIG DATA: 3 V’s
Volume: Can all the works of Shakespeare be stored in a DVD?
All the works of Shakespeare = 10 Mb or 10 million bytes
A DVD can hold 4 Gb or 4 billion bytes Answer: A DVD can hold the equivalent of
400 complete works of Shakespeare
BIG DATA - Volume
Data is accessed multiple times Google alone processes 20 Petabytes every
day. 20 Pb = 20,000 Tb = 20,000,000 Gb. Google processes 1 Exabyte every 50 days. All words ever spoken by mankind is 5
Exabytes Bank of America manages petabytes of data
for advanced analytics
BIG DATA – Volume (continued)
Eight bits = 1 byte. Ten bytes = average word
1 Kb = 1,000 bytes = short paragraph
1 Mb = 1,000 Kb = short novel
1 Gb = 1,000 Mb = 7 minutes HD video
DVD = 4 Gb and Blu-ray = 50 Gb
1 Tb = 1,000 Gb and 10 Tb = all text info in the US Library of Congress
400 Tb = all books ever written
1 Exabyte = 1,000 Petabytes = 1,000,000 Tb. All words ever spoken = 5 Exabytes
BIG DATA – Volume (continued)
YouTube users upload 72 hours of new video content per minute
There are 100,000 credit card transactions per minute in the USA alone
Google receives 2 million search queries per minute 200 million emails are sent per minute in the USA
alone Global banks handle trillions of messages in a single
day’s trading.
BIG DATA – Velocity
Data today is very different than data from yesterday
Jet engines transmit data Smartphones talk to us and answer us GPS information from Cameras, phones,
tablets
BIG DATA – Variety
Data analysis gives you “an answer” not “the answer.” There is so much data that it’s impossible to optimize and get the absolute best answer.
However, data analysis of big data can predict a much better answer than not using it at all.
There is no best tool or method
You probably will not get the data in the way you need it. It may be incomplete, or incorporated in different locations (may need to be merged).
Data may not be easily available. May need to search for it.
BIG DATA – Misconceptions
What data is created when: you rate songs? you use a credit card? you make a cell phone call? you update your status on Facebook?
BIG DATA – Questions
What data is created when: you rate songs? you use a credit card? you make a cell phone call? you update your status on Facebook?
BIG DATA – Questions
Data Collection is for Big Companies and for Individuals
Medical data and History => Personalized medicine
Financial details
BIG DATA - Collecting
Collecting data of every day life Eating Sleep Activity levels Moods Habits Communications
BIG DATA - Collecting
Digital DevicesExercise data: how far you runHow far you bikedHow fast you swamBreak down information by mile or minuteHow many calories you burnedIdle time during the dayWhether you are climbingHeart rate
BIG DATA - Collecting
Does Big Data mean insight? Not necessarily “Haystacks without Needles” (Darian Shirazi)
= not knowing what you are looking for and thinking that Big Data will solve the problem
Having a clear goal or question to be answered will lead to creativity when collecting data
BIG DATA - Collecting
List of contacts (addresses, phone numbers, email addresses), recipes, etc.
20% of all data Computer generated Human generated Combination (doctor enters information into
system and appears in combination with scanned data)
BIG DATA – Structured Data
Billy Beane’s method of data analysis led the Oakland A’s to victory while cutting costs
2011 Moneyball Building rosters with conventional wisdom
vs. thinking outside the box With Data Analysis, Beane bought players
with the lowest payroll in baseball
BIG DATA – Sports
(Runs2)/(Runs2+Runs allowed2) = Win % Example: 2002 Oakland A’s scored 800 runs and
allowed 654 => (8002)/(8002+6542) = 59.94 percent of the time will win
Out of 167 games, 59.94% = 97.1 games (they won 103 games)
To win 99 games => 99/167 = (Runs2)/(Runs+6452) Solve for Runs = 808 runs needed
BIG DATA – Pythagorean Expectation
Two thirds of Tech firms are keeping more than one year worth of data on-line
43% have more than 3 years worth of data on-line
BIG DATA – Usage
Burst into the headlines in December 2011 offering iPhone and Android users $5 off for sharing in-store prices while shopping.
Result: increase Amazon’s Price Check App bar-code scanning application AND collecting comparative intelligence on store prices.
Using the Amazon app shoppers scan a barcode, take a picture of the item or conduct a text search to find the lowest price
BIG DATA – Usage: Amazon
Measuring influence across the social web by storing, processing, and analyzing real time social media data streams.
BIG DATA – Usage: Social Media
Recommending complementary products based on predictive analytics for cross selling.
Result: increase an average order size
BIG DATA – Usage: Cross-selling
Using behavioral patterns, credit card companies are detecting fraud (including tax and claims) in online systems in real time
BIG DATA – Usage: Fraud
Can collect tweets from million of users and analyze terms to identify who to follow to measure conversation, news, interest, and activity.
BIG DATA – Usage: Human Behavior
Collect information on keywords that work best to influence surfers into shopping.
They can collect information on their Tweets or Facebook posts in addition to the key word entered on the site search window.
BIG DATA – Usage: Keyword Campaigns
Management Information Systems by Laudon and Laudon
Big Data: How Data Analytics is Transforming the World by Chartier
For Big Data Analytics There’s No Such Things as Too Big
Bibliography