Creating and Using Attribute Databases
In this lesson you will learn:
• concept of the attribute database as a table
• database elements: variables, observations, data, labels, data dictionary, aliases, indexes
• data types and formats
• basic database operations
• attribute queries
• attribute statistics
• attribute data graphs
The attribute database as a table
The attribute database as a table
Database elements
Portion of the data dictionary for the Illinois Historic Tornado database.
Creating the database
Steps in creating the attribute database
1. identify the attributes to be captured2. create attribute columns for each attribute; label each column3. specify the data type for each attribute4. specify validation rules for each attribute5. specify the data format for each attribute
Data types
Data formats
Data type Formatting Options
text length of data field; i.e., maximum number of charactersEx.: “Indian Shoals State Park “ has a length of 25 characters
integer numeric short vs. long integershort integer – stores numbers from –32,768 to 32,767
long integer – stores numbers from –2,147,483,648 to 2,147,483,647
decimal numeric precision and scale:precision = maximum length of the decimal number, including the decimal point and digits to the left and right of the decimal point. Ex.: 176.8859 has a precision value of 8.
scale =maximum number of digits to the right of the decimal place. Ex.: 176.8859 has a scale value of 4.
Tabular database formats
Database standards MS-Access filename.mdb
Paradox filename.db
dBase II, III, IV, 5, 7 filename.dbf
Spreadsheet standards MS-Excel filename.xls
Lotus 1-2-3 filename.wks
Quattro Pro filename.wq1
Open Database Connectivity ODBC-compliant applications: MS-Access, Visual FoxPro, SQL Server, Oracle, dBase, Paradox, DB2, Sybase, etc.
..assorted..
Formatted “text” file Delimited text (comma delimited, tab delimited, etc.)
filename.csv
filename.txt
Fixed-width text filename.txt
Common database “exchange” formats
Tabular database formats
Comma-delimited text (filename.csv)
Tab-delimited text (filename.txt)
Fixed-width text (filename.txt)
Basic database operations
a. data entry & editing
b. sorts
c. queries
d. data statistics
e. data graphs
Basic database operations: data maintenance
• add/delete observations
• add/delete attribute fields
• edit data
− spell check (text & memo fields)
− find/replace
− re-enter data
− append new observations
• restructure attribute data
− calculate new field based on existing fields
− modify format
− change data type
Basic database operations: sorts
Obs. Index Tract Popln AvgInc
13 A.013 101 2324 44200
147 A.147 103 977 57800
419 B.219 104 854 63400
83 B.083 107 3842 33460
6 A.006 109 2771 50050
214 B.014 211 1644 38880
189 A.189 212 1897 40010
164 A.164 215 1330 39770
97 A.097 217 1018 40005
255 B.055 323 1226 47340
337 B.137 618 1897 30500
392 B.192 620 2170 30390
… … … … …
Single-column sort
Obs. City Ward CityID Alderman ResidLU
18 Decatur 001 121 “R” 0.92
19 Decatur 002 121 “I” 0.67
21 Decatur 004 121 “R” 0.89
24 Decatur 005 121 “R” 0.70
20 Decatur 007 121 “D” 0.74
22 Decatur 009 121 “I” 0.23
115 Dixon 001 144 “R” 0.88
111 Dixon 002 144 “I” 0.80
113 Dixon 003 144 “R” 0.54
114 Dixon 004 144 “D” 0.66
112 Dixon 005 144 “R” 0.45
79 Elgin 003 207 “D” 0.61
… … … … … …
Multi-column sort
Basic database operations: hierarchical sorts
Aa
Aa
Aa
Aa
Aa
Aa
Ab
Ab
Ab
Ab
Ab
Ab
(1)
01
01
01
02
02
02
01
01
01
02
02
02
i
i
(3)(2)
ii
i
ii
ii
ii
iii
ii
ii
iv
Hierarchical sort: column 1 (ascending); column 2 (ascending); column 3 (descending)
Simple attribute queries
Simple attribute queries
Land-use percentage, by city ward
residential
commercial
industrial
transportation & utilities
parks & open space
Compound attribute queries
Color
AgeBlack Chestnut Bay Gray Buckskin White
0-1 yr
1-2 yrs
2-3 yrs
3-5 yrs
> 5 yrs
The contingency table view of compound attributes
Multi-attribute queries
Color
AgeBlack Chestnut Bay Gray Buckskin White
0-1 yr
1-2 yrs
2-3 yrs
3-5 yrs
> 5 yrs
Multi-attribute queries
Operator Set action Logic Outcome
NOT set complement Logical converse of the operand.
AND intersection of two sets True if both operands are true, false otherwise.
OR union of two sets True if either 1st or 2nd operand is true, or if both are true. False if both operands are false.
XOR union less intersection True if 1st operand is true or 2nd operand is true. False if both are true or both are false.
Compound statements are written in the form: operand-1 LOGICAL OPERATOR operand-2; i.e., horse = black AND horse = 5 years of age or older
The set of all horses
The set of Black horses
“NOT Black” horses The set of
Black horses
The set of horses ≥ 5 yrs old
Data statistics
Measurement scale (model)
Properties Allowable operations Examples
1. Nominalmeasures “categories”
count eye color, land use
2. Ordinalidentifies order: most to least, smallest to largest;
count, <, =, >class standing (fr, so, jr, sr), physiographic relief
3. Intervalquantitative: no true zero, but preserves equal intervals
count, <, =, >, +, -
average, range, median, standard deviation, etc.
°F, soil productivity rating
4. Ratio quantitative: has true zero, preserves ratios
count, <. =, >, +, -, ×,÷, ln()…
average, range, median, standard deviation, etc.
distance, population density, snow pack depth
Measures of central tendency
Median: center point of a data distribution
exactly 50% of the observations have a data value < the median and 50% have a data value > the median
Mean: the average data value = 1/n × Σ (all data values)
the mean = the median only if the data are unimodal and symmetrically distributed about the mean
-6 -4 -2 0 2 4
V2
0.0
0.1
0.2
0.3
0.4mean
Measures of dispersion
Range: the span, or extent of data values
range = maximum data value – minimum data value
Variance: average squared distance of all observations from the mean
Standard Deviation: the square root of the variance, interpreted as the average distance of all observations away from the mean.
for a unimodal symmetric distribution, approximately 68% of all data values will lie within one standard deviation of the mean and 95.4% within 2 standard deviations of the mean
Data graphs
1.0 1.1 1.2 1.2 1.3 1.4 1.5 1.5 1.6 1.7 1.8 1.8 1.9
Bulk.Density
0
10
20
30
40
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
wet.pH
0
40
80
120
Data graphs for visualizing the distribution of data
4.5 5.5 6.5 7.5 8.5
wet.pH-3 -2 -1 0 1 2 3
Normal Distribution
1.0
1.2
1.4
1.6
1.8
Bul
k.D
ens
ity
270 272 274 276 278 280 282
Elev.m.
0.00
0.05
0.10
0.15
0.20
0.25
-3.2 -2.6 -2.1 -1.6 -1.0 -0.5 0.0 0.5 1.1 1.6 2.1 2.7 3.2
V2
0.0
0.1
0.2
0.3
0.4
Box-whisker plot Quantile-Quantile plot, with Normal distribution reference line
Density plot Histogram with density plot (Normal distribution)
Data graphs for visualizing data relations
3000 4000 5000 6000 7000
Calcium
10
15
20
25
30
CE
C
A bivariate scatterplot illustrating the relationship between soil Calcium and Cation Exchange Capacity in a northern Illinois soil.
What you have learned
In this lesson you learned:• Tabular databases are organized as tables, with rows as observations, columns as attributes, and the data or information contained inside the table. It may also contain indexes, a data dictionary, and aliases.
• The data dictionary is vital to the proper interpretation and use of data. It should contain a description of each attribute’s measurement scale, how it was measured, when and where it was collected, by whom, and for what purpose.
• Database design includes: which attributes and how they are labeled, what data type to use for each attribute, data validation rules, and data storage format.
• Basic data types include text string or memo for text or qualitative information, and integer, decimal, and byte for numeric or quantitative information.
• Tabular databases can be created in database, spreadsheet, statistical analysis and other software and exchanged in standard database, spreadsheet, ODBC, and formatted text file formats.
• Nearly all database software has functional capabilities for data entry and editing, sorts, queries, data statistics, and data graphs.
• Save a copy of your database before performing any maintenance or segmentation! Be especially careful with editing operations involving find/replace, and any operation that changes data formats or type.
• Single- and multi-column sorts are useful for isolating more obvious data errors and as a starting point for segmenting the data into smaller databases, classifying observations, and creating indexes.
• Query operations can take the form of find queries, filter queries or subset queries, of which only the last effects permanent change to the content of the database.
• Compound queries utilize the logical operators NOT, AND, OR and XOR to join query operands.
• Measures of central tendency, measures of dispersion, data distribution graphs, and scatterplots are often useful in data verification, but their greatest value is in data segmentation.