Download ppt - Creating and Using Attribute Databases

Creating and Using Attribute Databases

In this lesson you will learn:

• concept of the attribute database as a table

• database elements: variables, observations, data, labels, data dictionary, aliases, indexes

• data types and formats

• basic database operations

• attribute queries

• attribute statistics

• attribute data graphs

The attribute database as a table

The attribute database as a table

Database elements

Portion of the data dictionary for the Illinois Historic Tornado database.

Creating the database

Steps in creating the attribute database

1. identify the attributes to be captured2. create attribute columns for each attribute; label each column3. specify the data type for each attribute4. specify validation rules for each attribute5. specify the data format for each attribute

Data types

Data formats

Data type Formatting Options

text length of data field; i.e., maximum number of charactersEx.: “Indian Shoals State Park “ has a length of 25 characters

integer numeric short vs. long integershort integer – stores numbers from –32,768 to 32,767

long integer – stores numbers from –2,147,483,648 to 2,147,483,647

decimal numeric precision and scale:precision = maximum length of the decimal number, including the decimal point and digits to the left and right of the decimal point. Ex.: 176.8859 has a precision value of 8.

scale =maximum number of digits to the right of the decimal place. Ex.: 176.8859 has a scale value of 4.

Tabular database formats

Database standards MS-Access filename.mdb

Paradox filename.db

dBase II, III, IV, 5, 7 filename.dbf

Spreadsheet standards MS-Excel filename.xls

Lotus 1-2-3 filename.wks

Quattro Pro filename.wq1

Open Database Connectivity ODBC-compliant applications: MS-Access, Visual FoxPro, SQL Server, Oracle, dBase, Paradox, DB2, Sybase, etc.

..assorted..

Formatted “text” file Delimited text (comma delimited, tab delimited, etc.)

filename.csv

filename.txt

Fixed-width text filename.txt

Common database “exchange” formats

Tabular database formats

Comma-delimited text (filename.csv)

Tab-delimited text (filename.txt)

Fixed-width text (filename.txt)

Basic database operations

a. data entry & editing

b. sorts

c. queries

d. data statistics

e. data graphs

Basic database operations: data maintenance

• add/delete observations

• add/delete attribute fields

• edit data

− spell check (text & memo fields)

− find/replace

− re-enter data

− append new observations

• restructure attribute data

− calculate new field based on existing fields

− modify format

− change data type

Basic database operations: sorts

Obs. Index Tract Popln AvgInc

13 A.013 101 2324 44200

147 A.147 103 977 57800

419 B.219 104 854 63400

83 B.083 107 3842 33460

6 A.006 109 2771 50050

214 B.014 211 1644 38880

189 A.189 212 1897 40010

164 A.164 215 1330 39770

97 A.097 217 1018 40005

255 B.055 323 1226 47340

337 B.137 618 1897 30500

392 B.192 620 2170 30390

… … … … …

Single-column sort

Obs. City Ward CityID Alderman ResidLU

18 Decatur 001 121 “R” 0.92

19 Decatur 002 121 “I” 0.67

21 Decatur 004 121 “R” 0.89

24 Decatur 005 121 “R” 0.70

20 Decatur 007 121 “D” 0.74

22 Decatur 009 121 “I” 0.23

115 Dixon 001 144 “R” 0.88

111 Dixon 002 144 “I” 0.80

113 Dixon 003 144 “R” 0.54

114 Dixon 004 144 “D” 0.66

112 Dixon 005 144 “R” 0.45

79 Elgin 003 207 “D” 0.61

… … … … … …

Multi-column sort

Basic database operations: hierarchical sorts

Aa

Aa

Aa

Aa

Aa

Aa

Ab

Ab

Ab

Ab

Ab

Ab

(1)

01

01

01

02

02

02

01

01

01

02

02

02

i

i

(3)(2)

ii

i

ii

ii

ii

iii

ii

ii

iv

Hierarchical sort: column 1 (ascending); column 2 (ascending); column 3 (descending)

Simple attribute queries

Simple attribute queries

Land-use percentage, by city ward

residential

commercial

industrial

transportation & utilities

parks & open space

Compound attribute queries

Color

AgeBlack Chestnut Bay Gray Buckskin White

0-1 yr

1-2 yrs

2-3 yrs

3-5 yrs

> 5 yrs

The contingency table view of compound attributes

Multi-attribute queries

Color

AgeBlack Chestnut Bay Gray Buckskin White

0-1 yr

1-2 yrs

2-3 yrs

3-5 yrs

> 5 yrs

Multi-attribute queries

Operator Set action Logic Outcome

NOT set complement Logical converse of the operand.

AND intersection of two sets True if both operands are true, false otherwise.

OR union of two sets True if either 1st or 2nd operand is true, or if both are true. False if both operands are false.

XOR union less intersection True if 1st operand is true or 2nd operand is true. False if both are true or both are false.

Compound statements are written in the form: operand-1 LOGICAL OPERATOR operand-2; i.e., horse = black AND horse = 5 years of age or older

The set of all horses

The set of Black horses

“NOT Black” horses The set of

Black horses

The set of horses ≥ 5 yrs old

Data statistics

Measurement scale (model)

Properties Allowable operations Examples

1. Nominalmeasures “categories”

count eye color, land use

2. Ordinalidentifies order: most to least, smallest to largest;

count, <, =, >class standing (fr, so, jr, sr), physiographic relief

3. Intervalquantitative: no true zero, but preserves equal intervals

count, <, =, >, +, -

average, range, median, standard deviation, etc.

°F, soil productivity rating

4. Ratio quantitative: has true zero, preserves ratios

count, <. =, >, +, -, ×,÷, ln()…

average, range, median, standard deviation, etc.

distance, population density, snow pack depth

Measures of central tendency

Median: center point of a data distribution

exactly 50% of the observations have a data value < the median and 50% have a data value > the median

Mean: the average data value = 1/n × Σ (all data values)

the mean = the median only if the data are unimodal and symmetrically distributed about the mean

-6 -4 -2 0 2 4

V2

0.0

0.1

0.2

0.3

0.4mean

Measures of dispersion

Range: the span, or extent of data values

range = maximum data value – minimum data value

Variance: average squared distance of all observations from the mean

Standard Deviation: the square root of the variance, interpreted as the average distance of all observations away from the mean.

for a unimodal symmetric distribution, approximately 68% of all data values will lie within one standard deviation of the mean and 95.4% within 2 standard deviations of the mean

Data graphs

1.0 1.1 1.2 1.2 1.3 1.4 1.5 1.5 1.6 1.7 1.8 1.8 1.9

Bulk.Density

0

10

20

30

40

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5

wet.pH

0

40

80

120

Data graphs for visualizing the distribution of data

4.5 5.5 6.5 7.5 8.5

wet.pH-3 -2 -1 0 1 2 3

Normal Distribution

1.0

1.2

1.4

1.6

1.8

Bul

k.D

ens

ity

270 272 274 276 278 280 282

Elev.m.

0.00

0.05

0.10

0.15

0.20

0.25

-3.2 -2.6 -2.1 -1.6 -1.0 -0.5 0.0 0.5 1.1 1.6 2.1 2.7 3.2

V2

0.0

0.1

0.2

0.3

0.4

Box-whisker plot Quantile-Quantile plot, with Normal distribution reference line

Density plot Histogram with density plot (Normal distribution)

Data graphs for visualizing data relations

3000 4000 5000 6000 7000

Calcium

10

15

20

25

30

CE

C

A bivariate scatterplot illustrating the relationship between soil Calcium and Cation Exchange Capacity in a northern Illinois soil.

What you have learned

In this lesson you learned:• Tabular databases are organized as tables, with rows as observations, columns as attributes, and the data or information contained inside the table. It may also contain indexes, a data dictionary, and aliases.

• The data dictionary is vital to the proper interpretation and use of data. It should contain a description of each attribute’s measurement scale, how it was measured, when and where it was collected, by whom, and for what purpose.

• Database design includes: which attributes and how they are labeled, what data type to use for each attribute, data validation rules, and data storage format.

• Basic data types include text string or memo for text or qualitative information, and integer, decimal, and byte for numeric or quantitative information.

• Tabular databases can be created in database, spreadsheet, statistical analysis and other software and exchanged in standard database, spreadsheet, ODBC, and formatted text file formats.

• Nearly all database software has functional capabilities for data entry and editing, sorts, queries, data statistics, and data graphs.

• Save a copy of your database before performing any maintenance or segmentation! Be especially careful with editing operations involving find/replace, and any operation that changes data formats or type.

• Single- and multi-column sorts are useful for isolating more obvious data errors and as a starting point for segmenting the data into smaller databases, classifying observations, and creating indexes.

• Query operations can take the form of find queries, filter queries or subset queries, of which only the last effects permanent change to the content of the database.

• Compound queries utilize the logical operators NOT, AND, OR and XOR to join query operands.

• Measures of central tendency, measures of dispersion, data distribution graphs, and scatterplots are often useful in data verification, but their greatest value is in data segmentation.