Query Processing

INTRODUCTION

1

1. INTRODUCTION

1.1 OBJECTIVE :

• The main objective is to find out the approximate answers for the systems

facing Dynamic Failure, in less time.

• We are going to do query processing in peer to peer network.

1.2 EXISTING SYSTEM :

The existing system uses structured P2P network with Distributed

Hash Table. In the existing system exact query processing is possible, but

there are certain disadvantages.

Structured network is organized in such a way that data items are

located at specific nodes in the network, and those nodes maintain

some state information to enable efficient retrieval of data.

Structured nodes are not efficient and flexible enough for

applications where nodes join or leave the network frequently.

The Sequential Algorithm is used which increases the latency of

the project.

Since the nodes are selected sequentially, if any node gets

disconnected , the exact answer is not received. And identification

of the disconnected node becomes tedious and some times

impossible.

2

PROPOSED SYSTEM :

The proposed system uses unstructured P2P network.

No assumptions about the location of the data items in the node are

made in our project.

We can join the nodes at random times and depart without a prior

notification.

We use approximate query processing to reduce the latency, which is

the aim of our project.

It is possible to run the project by dynamically adding and removing

the nodes.

3

FEASIBILITY STUDY

4

2. FEASIBILITY STUDY

2.1 SYSTEM ANALYSIS :

As P2P systems mature beyond file sharing applications and start

getting deployed in increasingly sophisticated e-business and

scientific environments, the vast amount of data within P2P databases

poses a different challenge that has not been adequately researched.

Aggregation queries have the potential of finding applications in

decision support, data analysis, and data mining. For example,

millions of peers across the world may be cooperating on a grand

experiment in astronomy, and astronomers may be interested in asking

queries that require the aggregation of vast amounts of data covering

thousands of peers.

There is real-world value for aggregation queries in Intrusion

Detection Systems, and application signature analysis in P2P

networks.

2.2 SYSTEM REQUIREMENTS :

2.2.1 HARDWARE REQUIREMENTS :

Hard disk : 40 GB

RAM : 512 MB

Processor Speed : 3.00GHz

Processor : Pentium IV Processor

5

2.2.2 SOFTWARE REQUIREMENTS :

Front End : VS .NET 2005

Code Behind : C#.net

Back End : SQL SERVER 2000

2.3 FLOW CHART:

The data flow diagrams (fig. 2.1) are used to explain the process of

working of the project in detail. After the registration and login process,

the query is passed from query node. Internally the random selection of

node is done from which the data is retrieved and stored.

6

FIGURE 2.1 The Initial Process

7

Start

Login in Query Node

Peer Listers

Sql Server connected peers

Two Phase Sampling

Visited nodes (Active peers)

UnVisited nodes (Inactive Peers)

Random walk for the Active nodes

Passing Aggregate Rules

Select table (Product or order details)

Calculate Probability of Active node

Prove the result of InActive node

Generate the report

MODULE DESCRIPTION

3. MODULE DESCRIPTION :

In this project SIX modules are used as follows:

Sign in

Peerlister

Activepeers

Aggregation

Viewtable

Report

Sign In:

The registration of the users and their passwords are done here.

Next using the login form, the users can enter inside the Query

processing Unit.

Peerlister:

Peerlister lists the peers which are connected with the query node.

FIGURE 3.1

Activepeers:

Login

Login Error

No Yes

Peer Lister

Peer 1

Peer 2 Peer 3

Peer 4

Peer 5

8

This module is to get all the peers which are connected with the sql

server.

All sql server connected peers are in the group visited peers.

Remaining peers are maintained by unvisited peers.

After this the two phase sampling is carried out.

FIGURE 3.2 SQL connected peers

FIGURE 3.3 Two phase sampling

Peer Lister

SQL Server Connected PeersDisconnected

PeersConnected Peers

Connected peers

Random walk from nodes

Segment into Two phases

9

Aggregation (Process):

Pass the aggregate rules to the table selected in the northwind

database, for the peers in the visited nodes.

Viewtable:

This module enables us to view tables and its respective fields in

any database.

Report module:

We are going to produce the report from our two phases and

visited and unvisited peers in a chart representation and enter the

representation of time of each peers.

10

LITERATURE REVIEW

11

4. LITERATURE REVIEW

4.1 GENERAL

P2P systems are becoming very popular because they provide an

efficient mechanism for building large scalable systems.

Recent work has developed powerful techniques for employing

sampling in the database engine to approximate aggregation queries

and to estimate database statistics.

Recent techniques have focused on providing formal foundations and

algorithms for block-level sampling and are thus most relevant to our

work. The objective in block-level sampling is to derive a

representative sample by randomly selecting a set of disk blocks of a

relation.

4.2 GOAL OF THE PROJECT

Given an aggregation query at a query node, compute with “minimum

cost” an approximate answer to this query with least errors.

4.3 CHALLENGES FACED

Picking even a set of uniform random peers is a difficult problem, as

the query node does not have the Internet Protocol (IP) addresses of

all peers in the network. This is a well-known problem that other

researchers have tackled (in different contexts) by using random-walk

techniques on the P2P graph . That is, a Markovian random walk is

12

initiated from the query node that picks adjacent peers to visit, with

equal probability and under certain connectivity properties, the

random walk is expected to rapidly reach a stationary distribution. If

the graph is badly clustered with small cuts, then this affects the speed

at which the walk converges.

Even if we could select a peer (or a set of peers) uniformly at random,

it does not make the problem of selecting a uniform random set of

tuples much easier. This is because visiting a peer at random has an

associated overhead; thus, it makes sense to select multiple tuples at

random from this peer during the same visit. However, this may

compromise the quality of the final set of tuples retrieved, as the

tuples within the same peer are likely to be correlated

4.4 THE PEER TO PEER MODEL

Each peer p is identified by the processor’s IP address and a port

number (IPp and portp).

The peer p is also characterized by the capabilities of the processor on

which it is located, including its CPU speed (pcpu), memory

bandwidth (pmem), and disk space (pdisk).

The node also has a limited amount of bandwidth in the network, say

pband.

In unstructured P2P networks, a node becomes a member of the

network by establishing a connection with at least one peer currently

in the network. Each node maintains a small number of connections

with its peers.

13

4.5 QUERY COST MEASURE

The primary cost measure that we consider is latency, which is

the time that it takes to propagate the query across multiple peers and receive

replies at the query node.

14

SYSTEM ENVIRONMENT

15

5. SYSTEM ENVIRONMENT:

5.1 FRONT END USED:

Microsoft Visual Studio dot Net is used as front end tool. The

reason for selecting Visual Studio dot Net as front end tool is as follows.

FEATURES OF MICROSOFT VISUAL STUDIO DOT NET:

Visual Studio .Net has flexibility, allowing one or more language

to interoperate to provide the solution. This Cross Language

Compatibility allows us to do projects at faster rate.

Visual Studio. Net has Common Language Runtime, which

allows the entire component to converge into one intermediate

format and then interact.

Visual Studio. Net provides excellent security when an

application is executed in the system

Visual Studio.Net has flexibility, allowing us to configure the

working environment to best suit our individual style. We can

choose between a single and multiple document interfaces, and

we can adjust the size and positioning of the various IDE

elements.

Visual Studio. Net has intelligent feature that makes the coding

easy and also dynamic help provides very less coding time.

16

The working environment in Visual Studio.Net is often referred

to as Integrated Development Environment because it integrates

many different functions such as design, editing, compiling and

debugging within a common environment.

After creating a Visual Studio. Net application, if we want to

distribute it to others we can freely distribute any application to

anyone who uses Microsoft windows. We can distribute our

applications on disk, on CDs, across networks, or over an

intranet or the internet.

Toolbars provide quick access to commonly used commands in

the programming environment. We click a button on the toolbar

once to carry out the action represented by that button. By

default, the standard toolbar is displayed when we start Visual

Basic dot Net. Additional toolbars for editing, form design, and

debugging can be toggled on or off from the toolbars command

on the view menu.

Many parts of Visual Studio are context sensitive. Context

sensitive means we can get help on these parts directly without

having to go through the help menu. For example, to get help on

any keyword in the Visual Basic language, place the insertion

point on that keyword in the code window and press F1.

Visual Studio interprets our code as we enter it, catching and

highlighting most syntax or spelling errors on the fly. It’s almost

like having an expert watching over our shoulder as we enter our

code.

5.2 BACK END USED:

17

Microsoft SQL SERVER 2000 is used as back end tool. The reason

for selecting SQL SERVER 2000 as a back end tool is as follows:

FEATURES OF SQL SERVER 2000

The OLAP Services feature available in SQL Server version 7.0

is now called SQL Server 2000 Analysis Services. The term OLAP

Services has been replaced with the term Analysis Services. Analysis

Services also includes a new data mining component. The Repository

component available in SQL Server version 7.0 is now called Microsoft

SQL Server 2000 Meta Data Services. References to the component now

use the term Meta Data Services. The term repository is used only in

reference to the repository engine within Meta Data Services.

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

1) TABLE:

A database is a collection of data about a specific topic.

We can View a table in two ways,

18

a) Design View

b) Datasheet View

a)Design View

To build or modify the structure of a table, we work in the table

design view. We can specify what kind of datas will be holded.

b)Datasheet View

To add, edit or analyses the data itself, we work in tables datasheet

view mode.

2) QUERY:

A query is a question that has to be asked to get the required data.

Access gathers data that answers the question from one or more table.

The data that make up the answer is either dynaset (if you edit it) or a

snapshot (it cannot be edited).Each time we run a query, we get latest

information in the dynaset. Access either displays the dynaset or snapshot

for us to view or perform an action on it, such as deleting or updating.

3) FORMS:

A form is used to view and edit information in the database record. A

form displays only the information, we want to see in the way we want to

see it. Forms use the familiar controls such as textboxes and checkboxes.

This makes viewing and entering data easy.We can work with forms in

several views. Primarily there are two views,

19

They are,

a) Design View

b) Form View

a) Design View

To build or modify the structure of a form, we work in form’s design

view. We can add control to the form that are bound to fields in a table or

query, includes textboxes, option buttons, graphs and pictures.

b) Form View

The form view displays the whole design of the form.

4) REPORT:

A report is used to view and print the information from the database.

The report can ground records into many levels and compute totals and

average by checking values from many records at once. Also the report is

attractive and distinctive because we have control over the size and

appearance of it.

5) MACRO:

A macro is a set of actions. Each action in a macro does something,

such as opening a form or printing a report .We write macros to automate

the common tasks that work easily and save the time.

20

SYSTEM TESTING

&

21

MAINTANENCE

6. SYSTEM TESTING & MAINTANENCE

6.1 TESTING :

6.1.1 SYSTEM TESTING:

Testing is done for each module. After testing all the modules,

the modules are integrated and testing of the final system is done with the

test data, specially designed to show that the system will operate

successfully in all conditions. The procedure level testing is made first. By

giving improper inputs, the errors occurred are noted and eliminated. Thus

the system testing is a confirmation that everything is correct and an

opportunity to show the user that the system works. The final step involves

Validation testing, which determines whether the software, functions as the

user expected. The end-user rather than the system developer conducts this

test.

Most software developers has a process called “Alpha and Beta test”

to uncover those, that only the end user seems able to find. This is the final

step in system life cycle. Here we implement the tested error-free system

into real-life environment and make necessary changes, which runs in an

online fashion. Here system maintenance is done every months or year based

on company policies, and is checked for errors like runtime errors, long run

errors and other maintenances like table verification and reports.

6.1.2 UNIT TESTING:

22

Unit testing verifies the smallest unit of software design

module. This is known as “Module Testing”. The modules are tested

separately. This testing is carried out during programming stage itself. In

these testing steps, each module is found to be working satisfactorily as

regard to the expected output from the module.

6.1.3 INTEGRATION TESTING:

Integration testing is a systematic technique for constructing

tests to uncover error associated within the interface. In the project, all the

modules are combined and then the entire program is tested as a whole. In

the integration-testing step, all the errors uncovered is corrected for the next

testing steps.

6.1.4 VALIDATION TESTING:

To uncover functional errors, that is, to check whether

functional characteristics confirm to specification or not.

6.2 SYSTEM MAINTANANCE :

The objective of this maintenance work is to make sure that the

system gets into work all time, without any bugs. Provision must be made

for environmental changes which may affect the computer or software

system. This is called the maintenance of the system. Nowadays there is

rapid change in the software world. Due to this rapid change, the system

should be capable of adapting these changes. Maintenance plays a vital role.

The system should be designed to favor all new changes. Doing this will not

affect the system’s performance or its accuracy.

23

CONCLUSION

24

&

FUTURE ENHANCEMENT

7. CONCLUSION & FUTURE ENHANCEMENT

7.1 CONCLUSION :

Our approach requires a minimal number of communications over the

network and provides tunable parameters to maximize performance for

various network topologies.

Our approach provides a powerful technique for approximating the

aggregates of various topologies and data clustering, but comes with

limitations based upon a given topologies structure and connectivity.

For topologies with very distinct clusters of peers, it becomes

increasingly difficult to accurately obtain random samples due to the

inability of random-walk process to quickly reach all clusters.

7.2 FUTURE ENHANCEMENT :

The APPROXIMATE QUERY PROCESSING may be enhanced to

EXACT query processing, which at the present poses many difficulties

because of the use of Unstructured network instead of a Structured one and

also because of congestion, high latency and difficulty posed while

frequently joining or leaving the network without prior information.

The Approximation of query processing technique used in this project

decreases the latency, which is one of the major considerations compared to

accuracy.

25

SNAPSHOTS

26

8. SNAPSHOTS

FIGURE 8.1 Main Form

27

FIGURE 8.1 Register

28

FIGURE 8.3 Login

29

FIGURE 8.4 PeerLister

30

FIGURE 8.5 SQL CONNECTED PEERS

31

FIGURE 8.6 Two Phase Sampling

32

FIGURE 8.7 Random Nodes

33

FIGURE 8.8 View Table & Fields

34

FIGURE 8.9 Aggregation Rules

35

FIGURE 8.10

36

FIGURE 8.11

37

FIGURE 8.12

38

FIGURE 8.13 Report

39

FIGURE 8.14

40

FIGURE 8.15 Error Rate

41

FIGURE 8.15

42

TABLES :

Table 1 Table register in aqp Database

Column Name Data Type Length

uname varchar 50

upwd varchar 50

Table 2 Table peers in aqp Database


pid int 4

peername varchar 50

Table 3 Table visitpeers in aqp Database


vpid int 4

npname varchar 50

Table 4 Table unvisitpeers in aqp Database

43


vpid int 4

npname varchar 50

Table 5 Table revisit in aqp Database

Table 6 Table apValue in aqp Database

Column NameData Type Length

pid int 4

vpname varchar 50

prob nvarchar 50

fname varchar 50

aggregate varchar 50

Table 7 Table errorrate in aqp Database


probvarchar 50

sprob varchar 50


vpid int 4

vpname varchar 50

res varchar 50

stime varchar 50

etime varchar 50

resptime varchar 50

44

err varchar 50

45

REFERENCES

sREFERENCES:

[1] S. Acharya, P.B. Gibbons, and V. Poosala, “Aqua: A Fast DecisionSupport System Using Approximate Query Answers,” Proc. 25thInt’l Conf. Very Large Data Bases (VLDB ’99), 1999.

[2] L. Adamic, R. Lukose, A. Puniyani, and B. Huberman, “Search inPower-Law Networks” Physical Rev. E, 2001.

[3] B. Babcock, S. Chaudhuri, and G. Das, “Dynamic Sample Selectionfor Approximate Query Processing” Proc. 22nd ACM SIGMODInt’l Conf. Management of Data (SIGMOD ’03), pp. 539-550, 2003.

[4] A.R. Bharambe, M. Agrawal, and S. Seshan, “Mercury: SupportingScalable Multi-Attribute Range Queries” Proc. ACM Ann. Conf.Applications, Technologies, Architectures, and Protocols for ComputerComm. (SIGCOMM ’04), 2004.

[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Analysis andOptimization of Randomized Gossip Algorithms” Proc. 43rd IEEEConf. Decision and Control (CDC ’04), 2004.

[6] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip andMixing Times of Random Walks on Random Graphs” Proc. IEEEINFOCOM ’05, 2005.

[7] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya,“Towards Estimation Error Guarantees for Distinct Values” Proc.19th ACM Symp. Principles of Database Systems (PODS ’00), 2000.

[8] S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya,“Overcoming Limitations of Sampling for Aggregation Queries”Proc. 17th IEEE Int’l Conf. Data Eng. (ICDE ’01), pp. 534-542, 2001.

46

47

Documents

Query Processing