MC0077 Assignment

1. List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms?

ANS: Normal Forms: Relations are classified based upon the types of anomalies to which they're vulnerable. A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies. Normal forms are hierarchical in nature.

First Normal Form: Any table having any relation is said to be in the first normal form. The criteria that must be met to be considered relational is that the cells of the table must contain only single values, and repeat groups or arrays are not allowed as values

Second Normal Form: If all a relational database's non-key attributes are dependent on the entire key, then the database is considered to meet the criteria for being in the second normal form. Third Normal Form: A database is in the third normal form if it meets the criteria for a second normal form and has no transitive dependencies.

Boyce-Codd Normal Form: A database that meets third normal form criteria and every determinant in the database is a candidate key, it's said to be in the Boyce-Codd Normal Form. This normal form solves the issue of functional dependencies.

Fourth Normal Form: Fourth Normal Form (4NF) is an extension of BCNF for functional and multi-valued dependencies. A schema is in 4NF if the left hand side of every non-trivial functional or multi-valued dependency is a super-key.

Domain/Key Normal Form: The domain/key normal form is the Holy Grail of relational database design, achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met.

BCNF differs from the Third Normal Form and 4th Normal forms: Third normal form states that a table must have no transitive dependencies. This means that a row could be uniquely identified by each column individually but that no column depends on any other column to identify the row. If columns X, Y and Z exist, deleting any two columns will still leave a set of uniquely identifiable rows. BCNF extends 3NF, stating that no non-trivial functional dependencies can exist on anything other than the superkey - that is, a superset of the candidate keys.

2. Describe the concepts of Structural Semantic Data Model (SSM).

ANS: The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modeling concepts defined in Table 4.4 and compared in Table 4. Figure 4.2 shows the concepts and graphic syntax of SSM, which include:

Three types of entity specifications: base (root), subclass, and weak

Four types of inter-entity relationships: n-ary associative, and 3 types of classification hierarchies,

Four attribute types: atomic, multi-valued, composite, and derived, Domain type specifications in the graphic model, including;

standard data types, Binary large objects (blob, text, image, ...), user-defined types (UDT) and functions (UDF),

Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and

Data value constraints.

3. Describe the following with respect to object Oriented Databases:

a. Query Processing in Object-Oriented Database Systems (5)

b. Query Processing Architecture (5)

ANS: a. Query Processing in Object-Oriented Database Systems: The optimization and execution of OODBMS query languages (which we collectively call query processing). Query optimization techniques are dependent upon the query model and language. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model.

Type System: Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. In contrast, object systems have richer type systems.

Encapsulation: Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer.

Complex Objects and Inheritance: Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages.

Object Models: OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems.

b. Query Processing Architecture: In Query Processing Architecture, there are two architectural issues: the query processing methodology and the query optimizer architecture.

Query Processing Methodology:A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs.

The steps of the methodology are as follows.

1. Queries are expressed in a declarative language

2. It requires no user knowledge of object implementations, access paths or processing strategies

3. The calculus expression is first

4. Calculus Optimization

5. Calculus Algebra Transformation

6. Type check

7. Algebra Optimization

8. Execution Plan Generation

9. Execution

Query optimizer architecture: query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. Encapsulation raises issues related to the accessibility of storage information by the query optimizer.

4. Describe the Differences between Distributed & Centralized Databases.

ANS: A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites. To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations.

Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies' implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity.

A distributed database does not share main memory or disks. A centralized database has all its data on one place. As it is totally different from distributed database which has data on different places. In centralized database as all the data reside on one place so problem of bottle-neck can occur, and data availability is not efficient as in distributed database. Let me define some advantages of distributed database, it will clear the difference between centralized and distributed database.

5. Explain the following:

a. Query Optimization (5)

b. Text Retrieval Using SQL3/Text Retrieval (5)

ANS:

a. Query Optimization: The goal of any query processor is to execute each query as efficiently as possible. Efficiency here can be measured in both response time and correctness.

The traditional, relational DB approach to query optimization is to transform the query to an execution tree, and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. A commonly used execution heuristic is:

1. Execute all select and project operations on single tables first, in order to eliminate unnecessary rows and columns from the result set.

2. Execute join operations for further reduce the result set.

3. Execute operations on media data, since these can be very time consuming.

4. Prepare the result set for presentation.

b. Text Retrieval Using SQL3/Text Retrieval: SQL3 supports storage of multimedia data, such as text documents, in an or-database using the blob/clob data types. However, the standard SQL3 specification does not include support for such media content processing functions as indexing or searching using elements of the media content.

Basically, the new - to SQL3 - functionality includes:

· Indexing Routines for the various types of media data, for example using:

Content terms for text data and

Color, shape, and texture features for image data.

· Selection Operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval.

· Text Processing Sub-Systems for similarity evaluation and result ranking.

Unfortunately, the result of this 'independent' activity, is non standard or-dbms/mm (multimedia) systems that differ in the functionality included and limit data retrieval from multiple or-dbm system types. Since the syntax of the SQL3 extensions varies between or-dbms/mm implementations, the examples used in the following are given in generic SQL3/TextRetrieval (or sql3/tr) statements.

6. Describe the following:

a.Data Mining Functions (5)

b.Data Mining Techniques (5)

ANS:

a.Data Mining Functions: Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are:

Classification : Data mining tools have to infer a model from the database, and in the case of Supervised Learning this requires the user to define one or more classes.

Associations : Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items.

Sequential/Temporal patterns : Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends.

Clustering/Segmentation : Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric.

IBM – Market Basket Analysis example: IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by market

b.Data Mining Techniques: this method may be classified by the technique they perform to the class of application they can be used in. Some of the main techniques used in data mining are:

Cluster Analysis: In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram.

Induction: A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i.e. deduction and induction.

Decision Trees: Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names,

Decision Trees: Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names.

Neural Networks: Neural Networks are an approach to computing that involves developing mathematical structures with the ability to learn..

Structure of a neural network: The bottom layer represents the input layer, in this case with 5 inputs labels X1 through X5. It is the hidden layer that performs much of the work of a network.

OLAP Example: An example OLAP database may be comprised of sales data which has been aggregated by region, product type, and sales channel.

Data Visualization: Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and as such can work well along side data mining.

Documents

MC0077 Assignment