Arama Motoru Geliyoo'nun düzenlemesi hakkında

GELIYOO.COM

Search Engine Development

Project Description

[Type the abstract of the document here. The abstract is typically a short summary of the contents of the document. Type the abstract of the document here. The abstract is typically a short summary of the contents of the document.]

Buray Savas ANIL

3/1/2014

2 of 56

3 of 56

Table of Contents Introduction: ....................................................................................................................................... 5

Phases of Development: ..................................................................................................................... 6

1. Component Selection .............................................................................................................. 7

Objective ..................................................................................................................................... 7

Components Considered ............................................................................................................. 8

1. Apache Nutch .................................................................................................................. 8

2. Hadoop .......................................................................................................................... 10

3. Apache Hortonworks ..................................................................................................... 12

4. MongoDB ....................................................................................................................... 13

5. HBase ............................................................................................................................. 14

6. Cassandra....................................................................................................................... 15

7. Elastic Search ................................................................................................................. 17

8. Apache solr .................................................................................................................... 18

9. Spring MVC .................................................................................................................... 19

10. Easy-Cassandra .............................................................................................................. 21

Conclusion ................................................................................................................................. 22

2. Architecture Design ............................................................................................................... 24

Objective ................................................................................................................................... 24

System Design ........................................................................................................................... 25

1. Web Server for Geliyoo.com : ....................................................................................... 26

2. Web Server for hosting WebService API: ...................................................................... 28

3. ElasticSearch Cluster: ..................................................................................................... 28

4. Hadoop Cluster: ............................................................................................................. 29

5. Cassandra Cluster .......................................................................................................... 34

6. Horizontal Web Server Clustering ................................................................................. 35

Conclusion ..................................................................................................................................... 37

3. Component Configuration..................................................................................................... 38

Objective ................................................................................................................................... 38

Configuration Parameters ......................................................................................................... 39

Hortonworks Configuration .................................................................................................. 39

Installing and running Ambari Server .................................................................................... 42

4 of 56

Nutch configuration on hadoop Master Node ...................................................................... 44

Cassandra Configuration ....................................................................................................... 46

ElasticSearch Configuration .................................................................................................. 47

Run Nutch jobs on Hadoop Master Node ............................................................................. 49

Conclusion ..................................................................................................................................... 49

4. Development ......................................................................................................................... 50

Objective ................................................................................................................................... 50

Development Completed .......................................................................................................... 51

Prototype .............................................................................................................................. 51

Implementation : ................................................................................................................... 51

Future Development ................................................................................................................. 54

Content Searching ................................................................................................................. 54

Semantic Search .................................................................................................................... 55

Prototypes ............................................................................................................................. 56

Video search .......................................................................................................................... 56

5 of 56

Geliyoo Search Engine Project Documentation

Introduction: We are developing a semantic search engine, which will be available for the general user for

searching the internet. Also, this product can be customized so that it can also be installed on a

company's intranet, which will help the users to search the documents & images that are made

available by the each individual and company as a whole to be available for general access.

Objective is to create semantic search engine. For searching we need data from different

websites. We need to crawl many websites for collecting data. Data is stored in a large data store.

For searching from this data store it needs to index those data. For this all process we need

different components for different task.

The semantic search engine development process has following three major components:

1. Crawling

Web crawling is harvesting web contents by visiting each website, and find its all outlinks

for fetching their content too. Crawler is continuous process for fetching web content up

to Nth depth of website. It is restricted by robot.txt from many sites. Web content mean

all text content, images, docs etc. In short Fetching all content from web that available in

website. We need tool for fetch all content , parse by their mime type and finds their

outlinks for same.

Crawlers are rather dumb processes that fetch content supplied by Web servers

answering (HTTP) requests of requested URIs. Crawlers get their URIs from a crawling

engine that’s feeded from different sources, including links extracted from previously

crawled Web documents.

2. Indexing

Indexing means making sense out of the retrieved contents, storing the processing results

in a document index. All harvested data must be indexed for searching from them.

3. Searching

We need component for getting efficient result of search query. For searching from large

indexed data, it must be processed in speedy manner and return all relevant results.

6 of 56

Phases of Development: Search engine development follows the following phases of development :

1. Component Selection : Select the components that are useful for the implementation of

the search engine.

2. Architecture Design : Design architecture of the system that will allow both internet based

search as well as intranet based search.

3. Component Configuration: Configure the selected components as per our requirements

that will augment the searching process.

4. Development : We will develop a search engine web application and the remaining

components of the system that are not available with the current components.

5. Future Development : The tasks that still needs development are mentioned here.

7 of 56

1. Component Selection

Objective

There are many open source components available which can help us to develop the search

engines. Instead of creating everything from scratch we planned to used some open source

components and customize and extend them as per our requirements. This will save us lot of time

and money to recreate the same thing that has already been developed. For this we need to

figure out the right components that will match our requirements.

8 of 56

Components Considered

We go through many tools for achieve this project’s objective.

1. Apache Nutch

First component which we evaluated was Apache Nutch for crawling the website links.

Apache Nutch is an open source web crawler written in Java. By using it, we can find webpage

hyperlinks in an automated manner, reduce lots of maintenance work, for example checking

broken links, and create a copy of all the visited pages for searching over.

It is highly scalable and relatively feature rich crawler. It can easily crawl lots of web pages and

can find its invert links for crawl them again. It provides easy integration with Hadoop, Elastic

Search and Apache Cassandra.

Fig 1. Basic Workflow of Apache Nutch

List of Nutch Jobs

1. Inject

The nutch inject command allows you to add a list of seed URLs in database for

your crawl. It takes urls seed files. We can define url validation with nutch and it

will check with injection and parsing operation, urls which are not validates are

rejected while rest of urls are inserted in database.

Command: bin/nutch inject <url_dir>

9 of 56

Example: bin/nutch inject urls

2. Generate

The nutch generate command will take the list of outlinks generated from a

previous cycle and promote them to the fetch list and return a batch ID for this

cycle. You will need this batch ID for subsequent calls in this cycle. Number of top

URLs to be selected by passing top score as argument with this operation.

Command: bin/nutch generate batch id or -all

Example: bin/nutch generate -all

3. Fetch

The Nutch Fetch command will crawl the pages listed in the column family and

write out the contents into new columns. We need to pass in the batch ID from

the previous step. We can also pass ‘all’ value instead of batch id if we want to

fetch all url.

4. Parse

The nutch parse command will loop through all the pages, analyze the page

content to find outgoing links, and write them out in the another column family.

5. Updatedb

The nutch updatedb command takes the url values from the previous stage and

places it into the another column family, so they can be fetched in the next crawl

cycle.

Features

● Fetching and parsing are done separately by default, this reduces the risk of an

error corrupting the fetch parse stage of a crawl with Nutch

● Plugins have been overhauled as a direct result of removal of legacy Lucene

dependency for indexing and search.

● Easily configurable and movable

● We can create or add extra plugins for scale its functionality

● Validation rules are available for restrict other websites or contents.

● Tika parser plugin available for parsing all types of content types

● OPIC Scoring plugin or LinkRank plugin is used for calculation of webpage rank

with nutch.

10 of 56

2. Hadoop

Hadoop itself refers to the overall system that runs jobs in one or more machines parallel,

distributes tasks (pieces of these jobs) and stores data in a parallel and distributed fashion.

Hadoop cluster has Multiple Process node it include some master Node and some slave Node.

It has it own Filesystem it’s called Hdfs.the HDFS is managed through a dedicated NameNode

server to host the file system index, and a secondary NameNode that can generate snapshots

of the NameNode's memory structures, the HDFS manage Replication on one or more Machit.

so if Data loss from one Node it can recover from another Node itself.

Hadoop is easily configure with apache nutch. Hence all nutch crawling and indexing processes

are performed parallely in different nodes for decrease processes time. Nutch gives job to

hadoop for their operation and hadoop perform its job and return to nutch.

HDFS user manual Screen

Browse HDFS file System and HDFS storage information

11 of 56

Nutch running job information.

12 of 56

3. Apache Hortonworks

Hortonworks Data Platform (HDP) is open source, fully-tested and certified, Apache™

Hadoop® data platform.

Hortonworks Data Platform is designed for facilitates integrating Apache Hadoop with an

enterprise’s existing data architectures. We can say HDP is bunch of all components that

provides reliable access for hadoop clustering.

The Apache Ambari project is aimed at making Hadoop management simpler by developing

software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari

provides an intuitive, easy-to-use Hadoop management web UI backed by its REST APIs.

HDP saves our time for manages hadoop cluster by giving attractive web ui. We can easily

scale hadoop cluster by web application. We can also analyse performance and health of

hadoop job and cluster by different graphs. Like we can get details of Memory usage, network

usage, cluster load, cpu usage etc.

13 of 56

4. MongoDB

MongoDB is an open-source document database that provides high performance, high

availability, and automatic scaling.

A record in MongoDB is a document, which is a data structure composed of field and value

pairs. MongoDB documents are similar to JSON objects. The values of fields may include other

documents, arrays, and arrays of documents.

The advantages of using documents are:

● Documents (i.e. objects) correspond to native data types in many programming language.

● Embedded documents and arrays reduce need for expensive joins.

● Dynamic schema supports fluent polymorphism.

Features:

1. High Performance

MongoDB provides high performance data persistence. In particular,

Support for embedded data models reduces I/O activity on database system.

Indexes support faster queries and can include keys from embedded documents and

arrays.

2. High Availability

To provide high availability, MongoDB’s replication facility, called replica sets, provide:

automatic failover.

data redundancy.

A replica set is a group of MongoDB servers that maintain the same data set, providing

redundancy and increasing data availability.

14 of 56

5. HBase

HBase is a column-oriented database that’s an open-source implementation of Google’s Big

Table storage architecture. It can manage structured and semi-structured data and has some

built-in features such as scalability, versioning, compression and garbage collection. Since its

uses write-ahead logging and distributed configuration, it can provide fault-tolerance and

quick recovery from individual server failures. HBase built on top of Hadoop / HDFS and the

data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities.

HBase Architecture:

The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown

below. Typically, the HBase cluster has one Master node, called HMaster and multiple Region

Servers called HRegionServer. Each Region Server contains multiple Regions. Regions Just like

in a Relational Database, data in HBase is stored in Tables and these Tables are stored in

Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These

Regions are assigned to Region Servers across the cluster.

HBase Components

1. HMaster

Performing Administration

Managing and Monitoring the Cluster

Assigning Regions to the Region Servers

Controlling the Load Balancing and Failover

2. HRegionServer

Hosting and managing Regions

Splitting the Regions automatically

Handling the read/write requests

Communicating with the Clients directly

Features

Linear and modular scalability.

Strictly consistent reads and writes.

Automatic and configurable sharding of tables

Automatic failover support between RegionServers.

Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase

tables.

Easy to use Java API for client access.

Block cache and Bloom Filters for real-time queries.

Query predicate push down via server side Filters

Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary

data encoding options.

15 of 56

6. Cassandra

Cassandra is open source distributed database system that is designed for storing and

managing large amounts of data across commodity servers.Cassandra is designed to have

peer-to-peer symmetric nodes, instead of master or named nodes, to ensure there can never

be a single point of failure .Cassandra automatically partitions data across all the nodes in the

database cluster, we can add N number of node in cassandra.

Features

1. Decentralized

Every node in the cluster has the same role. There is no single point of failure. Data is

distributed across the cluster (so each node contains different data), but there is no master as

every node can service any request.

2. Supports replication and multi data center replication

Replication strategies are configurable. Cassandra is designed as a distributed system, for

deployment of large numbers of nodes across multiple data centers. Key features of

Cassandra’s distributed architecture are specifically tailored for multiple-data center

deployment, for redundancy, for failover and disaster recovery.

3. Scalability

Read and write throughput both increase linearly as new machines are added, with no

downtime or interruption to applications.

4. Fault-tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across

multiple data centers is supported. Failed nodes can be replaced with no downtime.

5. MapReduce support

Cassandra has Hadoop integration, with MapReduce support.

6. Query language

CQL (Cassandra Query Language) was introduced, a SQL-like alternative to the traditional RPC

interface. Language drivers are available for Java (JDBC).

Replication in Cassandra

Replication is the process of storing copies of data on multiple nodes to ensure reliability and

fault tolerance. When you create a keyspace in Cassandra, you must decide the replica

placement strategy: the number of replicas and how those replicas are distributed across

nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it

determine the physical location of nodes and their proximity to each other.

16 of 56

Replication Strategies:

1. Simple Strategy:

Simple Strategy is the default replica placement strategy when creating a keyspace using

Cassandra CLI. Simple Strategy places the first replica on a node determined by the

partitioner. Additional replicas are placed on the next nodes clockwise in the ring without

considering rack or data center location.

Fig: Simple Strategy diagram

2. Network Topology Strategy:

As the name indicates, this strategy is aware of the network topology (location of nodes

in racks, data centers etc.) and is much intelligent than Simple Strategy. This strategy is a

must if your Cassandra cluster spans multiple data centers and lets you specify how many

replicas you want per data center. It tries to distribute data among racks to minimize

failures. That is, when choosing nodes to store replicas, it will try to find a node on a

different rack.

17 of 56

7. Elastic Search

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-

capable full-text search engine with a RESTful web interface and schema-free JSON

documents. Elasticsearch is developed in Java and is released as open source under the

terms of the Apache License.

It easily integrates with Apache Nutch and Nutch operates this for indexing web pages.

Indexed data will store on its file system.

ElasticSearch is distributed, which means that indices can be divided into shards and each

shard can have zero or more replicas. Each node hosts one or more shards, and acts as a

coordinator to delegate operations to the correct shard(s). Rebalancing and routing are

done automatically.

We have a series of distinct ElasticSearch instances work in a coordinated manner without

much administrative intervention at all. Clustering ElasticSearch instances (or nodes)

provides data redundancy as well as data availability.

Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a

full query based on JSON to define queries. In general, there are basic queries such as

term or prefix. There are also compound queries like the bool query. Queries can also

have filters associated with them such as the filtered or constant_score queries, with

specific filter queries. Query is pass to elasticSearch cluster and it will match query

parameter and return result.

Features

First, by having a rich RESTful HTTP API, it’s trivial to query elastic search with Ajax.

(elasticsearch further supports JavaScript developers with cross-origin resource

sharing by sending an Access-Control-Allow-Origin header to browsers.)

Second, since elasticsearch stores schema-free documents serialized as JSON —

coming from “JavaScript Object Notation”, so obviously a native entity in JavaScript

code —, it can be used not only as a search engine, but also as a persistence engine.

18 of 56

8. Apache solr

Apache Solr is an open source search platform built upon a Java library called Lucene.

Solr is a popular search platform for Web sites because it can index and search multiple

sites and return recommendations for related content based on the search query’s

taxonomy. Solr is also a popular search platform for enterprise search because it can be

used to index and search documents and email attachments.

Solr works with Hypertext Transfer Protocol (HTTP) Extensible Markup Language (XML). It

offers application program interfaces (APIs) for Javascript Object Notation (JSON), Python,

and Ruby. According to the Apache Lucene Project, Solr offers capabilities that have made

it popular with administrators including:

o Indexing in near real time

o Automated index replication

o Server statistics logging

o Automated failover and recovery

o Rich document parsing and indexing

o Multiple search indexes

o User-extensible caching

o Design for high-volume traffic

o Scalability, flexibility and extensibility

o Advanced full-text searching

o Geospatial searching

o Load-balanced querying

19 of 56

9. Spring MVC

Spring MVC is the web component of Spring’s framework. Spring Framework is a Java platform

that provides comprehensive infrastructure support for developing Java applications. Spring

handles the infrastructure so one can focus on his/her application. It provides a rich

functionality for building robust Web Applications. The Spring MVC Framework is architected

and designed in such a way that every piece of logic and functionality is highly configurable.

Following is the Request process lifecycle of Spring 3.0 MVC

*Here, User needs to define BeanNameUrlHandlerMapping / SimpleUrlHandlingMapping etc that inherits

HandlerMapping interface.

**Here, You can define multiple controllers like SimpleFormController/MultiActionController etc that ultimately

inherits Controller interface.

Features

Spring enables developers to develop enterprise-class applications using POJOs. The

benefit of using only POJOs is that no need an EJB container product such as an

application server instead there is an option of using only a robust servlet container such

as Tomcat or some commercial product.

Spring is organized in a modular fashion. Even though the number of packages and classes

are substantial, so need to worry only about needed ones and ignore the rest.

20 of 56

Spring does not reinvent the wheel instead, it truly makes use of some of the existing

technologies like several ORM frameworks, logging frameworks, JEE, Quartz and JDK

timers, other view technologies.

Testing an application written with Spring is simple because environment-dependent code

is moved into this framework. Furthermore, by using JavaBean-style POJOs, it becomes

easier to use dependency injection for injecting test data.

Spring's web framework is a well-designed web MVC framework, which provides a great

alternative to web frameworks such as Struts or other over engineered or less popular

web frameworks.

Spring provides a convenient API to translate technology-specific exceptions (thrown by

JDBC, Hibernate, or JDO, for example) into consistent, unchecked exceptions.

Lightweight IoC containers tend to be lightweight, especially when compared to EJB

containers, for example. This is beneficial for developing and deploying applications on

computers with limited memory and CPU resources.

Spring provides a consistent transaction management interface that can scale down to a

local transaction (using a single database, for example) and scale up to global transactions

(using JTA, for example).

Spring has @Async annotation. Using this annotation one can run necessary processes

asynchronously. This feature is very useful for Geliyoo Search Engine to minimize the

search time.

21 of 56

10. Easy-Cassandra

We use cassandra ,which is nosql, to save and retrive data. So we need to make integration

between Spring MVC and Cassandra. For that we use easy-cassandra api.

Easy-Cassandra is a framework ORM API and a high client for Apache Cassandra in java.

Using this, it is possible to persist information from the Java Object in easy way.

To persist information, it adds some annotations at some fields and classes.

It works like an abstraction's tier in the Thrift, doing call for Cassandra.

The EasyCassandra uses the Thrift implementation and has like the main objective be one

simple ORM( Object relational manager).

Features

An ORM easy to use in Cassandra.

Only need is to use some Annotations in a class to persist informations.

Persists many Java Objects in way extremely easy (e.g: all primitives types,

java.Lang.String, java.lang.BigDecimal, java.io.File, etc.).

Compatible with CQL 3.0.

In the Apache version 2.0 license.

Supporting JPA 2.0 annotation.

Work with multi-nodes.

Complex rowkey (a key with tow or more keyrow).

Map some collections (java.util.List, java.util.Set, java.util.Map).

Find automatically the others clusters which do part of the same cluster.

May use multiple keyspaces simultaneously.

Integrations with Spring.

22 of 56

Conclusion

We had different options for NoSQL database and we compared them based on the features we

need for the development of this project. Following is the component feature compatibility table

Feature HBase MongoDB Cassandra

Hortonwork Suppot 0.96.4 No Support No Support

Developed language Java Java C++

Best used Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.

When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

Main point Billions of rows X millions of columns

Retains some friendly properties of SQL. (Query, index)

Best of BigTable and Dynamo

Server-side scripts yes JavaScript no

Replication methods selectable replication factor

Master-slave replication

selectable replication factor

Consistency concepts Immediate Consistency

Eventual Consistency,Immediate Consistency

Eventual Consistency,Immediate Consistency

Nutch Support 0.90.4 2.22 2.2

hadoop Support 1.2.1 1.1.X 2.2

23 of 56

Apache Solr v/s Elastic search

ElasticSearch was released specifically designed to make up for the lacking distributed

features of Solr. For this reason, it may find it easier and more intuitive to start up an

ElasticSearch cluster rather than a SolrCloud cluster

ElasticSearch will automatically load balance and move shards to new nodes in the cluster.

This automatic shard rebalancing behavior does not exist in Solr.

There was an issue in solr + nutch for making solr distributed, hence we choose elastic

search for its great features of distribution, searching query etc.

Final Component Selection :

We had gone through all above component and based on our requirements and their respective

features we have finalized the following components :

Service Selected Component

Parallel Processing Apache Hadoop

Crawling Apache Nutch

NoSQL Cassandra

Searching Elastic Search

MVC Spring MVC

ORM EasyCassandra

24 of 56

2. Architecture Design

Objective

To identify an architecture that will meet all the project requirements for the Geliyoo Search

engine development. The design will be based on the components that we selected and based on

the configurable items that they provide. Also, we need to consider the other non functional

factors for the development like number of requests per second, active users etc. Since there is a

fair chance that this site would have so much load we need to architect in order to come up with a

decent architecture.

25 of 56

System Design

26 of 56

1. Web Server for Geliyoo.com :

There are three parts of the web application that we propose to develop.

1. Super Admin Panel :

This panel will allow the super user to manage various settings of the system and also

perform functionality like adding urls ,scheduling the indexing and crawling of the urls,

manage users, etc.

2. User Admin Panel :

Admin panel will have allow the registered administrator user to add their sites they

propose to crawl, index and also see their results.

3. General Users

The general user, is a user who will be allowed to search various site indexed by

geliyoo search engine. They will be given an interface to search the web.

Since we expect that there would be too much load on this server we will have a cluster of

the webservers for load balancing and high availability.

27 of 56

28 of 56

2. Web Server for hosting WebService API:

This webserver will host Web Service API for searching and related functionalities. We have

bifurcated this with the admin panel functionalities, so as to manage the load for searching.

The web services will call the elastic search cluster's API to get the search results.

3. ElasticSearch Cluster:

Searching

29 of 56

Figure 2.3

Indexed data is stored in the file system of nodes within cluster. Elasticsearch provides a full

query based on JSON to define queries. In general, there are basic queries such as term or

prefix. There are also compound queries like the bool query. Queries can also have filters

associated with them such as the filtered or constant_score queries, with specific filter

queries.

Query is pass to elasticSearch cluster and it will match query parameter and return result.

4. Hadoop Cluster:

This is the most important part of the system. It will host all the services related to

crawling and indexing. It will also host the web services for the functionalities provided to

admin and super admin users.

30 of 56

Fig: Hadoop Cluster Diagram

Nutch (Crawling & Indexing ) :

For crawling and indexing we will use Nutch. Following is the current architecture of

Nutch Crawler

Basic Components:

Nutch Flow:

31 of 56

There are two procedure take place in overall system as per following:

1. Crawling :

Crawling is continuous process, injection is done by only once when injecting urls, but all

other operations expect inject is perform continuously until how depth we want to go in

urls.

This all operation gives their job to hadoop and hadoop will perform these tasks parallelly

by distributing their task among nodes.

Following operations are performed for crawling with nutch:

Inject

The nutch inject command allows you to add a list of seed URLs in database for your

crawl. It takes urls seed files from hdfs directory. We can define url validation with nutch

and it will check with injection and parsing operation, urls which are not validates are

rejected while rest of urls are inserted in database.

Generate

The nutch generate command will take the list of outlinks generated from a previous cycle

and promote them to the fetch list and return a batch ID for this cycle. You will need this

batch ID for subsequent calls in this cycle. Number of top URLs to be selected by passing

top score as argument with this operation.

32 of 56

Fetch

The Nutch Fetch command will crawl the pages listed in the column family and write out

the contents into new columns. We need to pass in the batch ID from the previous step.

We can also pass ‘all’ value instead of batch id if we want to fetch all url.

Parse

The nutch parse command will loop through all the pages, analyze the page content to

find outgoing links, and write them out in the another column family.

Update db

The nutch updatedb command takes the url values from the previous stage and places it

into the another column family, so they can be fetched in the next crawl cycle.

2. Indexing

Figure 2.2

Indexing can be done by elasticSearch which is configured with nutch, and nutch is

responsible for fire operation of indexing.

Elastic Index command takes two mandatory arguments:

● First argument is Cluster Name and

● Second is either of “batch Id” (which is get by previous operations of nutch), “all” (for all

non indexed data) or “reindex” (for doing again index of all data).

After executing command nutch will give job to hadoop and hadoop will divide job into

smaller tasks. Each task stores indexed data on to the file system of elasticSearch cluster

in distributed manner.

33 of 56

Hadoop:

A small Hadoop cluster includes a single master and multiple worker nodes. The master

node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker

node acts as both a DataNode and TaskTracker, though it is possible to have data-only

worker nodes and compute-only worker nodes.

In a larger cluster, the HDFS is managed through a dedicated NameNode server to host

the file system index, and a secondary NameNode that can generate snapshots of the

NameNode's memory structures, thus preventing file-system corruption and reducing loss

of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters

where the Hadoop MapReduce engine is deployed against an alternate file system, the

NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the

file-system-specific equivalent.

34 of 56

5. Cassandra Cluster

Cassandra cluster contains one or more data centers and each data center have number of

nodes. Cassandra stores crawled data as distributed manner resulting in a good load

balancing. Key features of Cassandra’s distributed architecture are specifically tailored for

multiple-data center deployment, for redundancy, for failover and disaster recovery.

35 of 56

6. Horizontal Web Server Clustering

Objective:

The circumstances may occur in which the machine ,on which Geliyoo Search Api or

Geliyoo web application deployed, will down or become slow because of heavy traffic. To

copup this circumstances we need to make tomcat server clustering. In which our Api and

Application will be deployed on multiple machines(at least more than one). So that if one

server in the cluster goes down, then other servers in the cluster should be able to take

over -- as transparently to the end user as possible.

Process:

Under Horizontal Clustering there can be any no of systems and on each system

we have one Tomcat server running.To make Horizontal tomcat clustering, we are using

Apache http server. The Apache Httpd Server runs on only one of the system and it

controls all the Tomcats running on other systems including the one which installed on the

same system.We are also using mod_jk as load balancer. mod_jk is an Apache module

used to connect the Tomcat servlet container with web servers such as Apache.

36 of 56

Apache http server and mod_jk can be used to balance server load across multiple Tomcat

instances, or divide Tomcat instances into various namespaces, managed by Apache http

server.

Requests hit the Apache server in front and are distributed to backend Tomcat containers

depending on load and availability.The clients know of only one IP (Apache) but the

requests are distributed over multiple containers.So this is in the case you deploy a kind of

distributed web application and you need it robust.

By using Apache HTTP as a front end you can let Apache HTTP act as a front door to your

content to multiple Apache Tomcat instances. If one of your Apache Tomcats fails, Apache

HTTP ignores it. The Apache Tomcats can then be each in a protected area and from a

security point of view, you only need to worry about the Apache HTTP server. Essentially,

Apache becomes a smart proxy server. you can load balance multiple instances of your

application behind Apache. This will allow you to handle more volume, and increase

stability in the event one of your instances goes down. Apache Tomcat uses Connector

components to allow communication between a Tomcat instance and another party, such

as a browser, server, or another Tomcat instance that is part of the same network.

Configuration of this involves enabling mod_jk in Apache, configuring a AJP connector in

your application server, and directing Apache to forward certain paths to the application

server via mod_jk.

37 of 56

The mod_jk connector allows HTTPD to communicate with Apache Tomcat instances over

the AJP protocol. AJP ,acronymn for Apache Jserv Protocol, is a wire protocol. It an

optimized version of the HTTP protocol to allow a standalone web server such as Apache

to talk to Tomcat. The idea is to let Apache serve the static content when possible, but

proxy the request to Tomcat for Tomcat related content.

Conclusion We have test current environment in different different combination of urls and cluster

node. on this test combination. we have measure HDFS_BYTES_READ (Bytes),Virtual

memory (bytes),Physical memory (bytes) of cluster.

38 of 56

3. Component Configuration

Objective

We are using the lot of open source components for the purpose of creating this search engine.

The components like Hadoop, Nutch and Casandra needs to be configured to achieve what is

required for the purpose of developing the search engine.

After analysis, we have decided to configure best combination of clusters on OVH Dedicated

server and also on development Environment. We have decide to implement the following

o One Master node of Hadoop,

o 4 Slave Node of Hadoop and

o 1 Node of Cassandra .

39 of 56

Configuration Parameters

Hortonworks Configuration

1. Minimum Requirement

● Operation System

○ Red Hat Enterprise Linux (RHEL) v5.x or 6.x (64-bit)

○ CentOS v5.x or 6.x (64-bit)

○ Oracle Linux v5.x or 6.x (64-bit)

○ SUSE Linux Enterprise Server (SLES) 11, SP1 (64-bit)

● Browser Requirements

○ Windows (Vista, 7)

○ Internet Explorer 9.0 and higher (for Vista + Windows 7)

○ Firefox latest stable release

○ Safari latest stable release

○ Google Chrome latest stable release

○ Mac OS X (10.6 or later)

Firefox latest stable release

Safari latest stable release

Google Chrome latest stable release

○ Linux (RHEL, CentOS, SLES, Oracle Linux)

Firefox latest stable release

Google Chrome latest stable release

● Software Requirements

○ yum

○ rpm

○ scp

○ curl

○ php_curl

○ wget

○ JDK Requirement

Oracle JDK 1.6.0_31 64-bit

Oracle JDK 1.7 64-bit

Open JDK 7 64-bit

2. Set Up Password-less SSH

Generate public and private SSH keys on the Ambari Server host.

o ssh-keygen

Copy the SSH Public Key (.ssh/id_rsa.pub) to the root account on your target

o scp /root/.ssh/id_rsa.pub <username>@<hostname>:/root/.ssh

40 of 56

Add the SSH Public Key to the authorized_keys file on your target hosts.

o cat id_rsa.pub >> authorized_keys

o .......................directory (to 700) and the authorized_keys file in that directory

(to 600) on the target hosts.

o chmod 700 ~/.ssh

o chmod 600 ~/.ssh/authorized_keys

From the Ambari Server, make sure you can connect to each host in the cluster using

SSH.

o ssh root@{remote.target.host}

3. Enable ntp

If not installed then install

o yum install ntp

o chkconfig ntpd on

o ntpdate 0.centos.pool.ntp.org

o service ntpd start

4. Check DNS

Edit Host file

o Open host file on every host in your cluster

vi /etc/hosts

o Add a line for each host in your cluster. The line should consist of the IP address

and the FQDN. For example:

1.2.3.4 fully.qualified.domain.name

Set Hostname

o Use the "hostname" command to set the hostname on each host in your cluster.

For example:

hostname fully.qualified.domain.name

o Confirm that the hostname is set by running the following command:

hostname -f

Edit the Network Configuration File

o Using a text editor, open the network configuration file on every host. This file is

used to set the desired network configuration for each host. For example:

vi /etc/sysconfig/network

Modify the HOSTNAME property to set the fully.qualified.domain.name.

NETWORKING=yes

NETWORKING_IPV6=yes

HOSTNAME=fully.qualified.domain.name

5. Configuring Iptables

Temporary disable iptables

41 of 56

chkconfig iptables off

/etc/init.d/iptables stop

Note: You can restart iptables after setup is complete.

6. Disable SELinux and PackageKit and check the umask Value

● SELinux must be temporarily disabled for the Ambari setup to function. Run the following

command on each host in your cluster:

o setenforce 0

● On the RHEL/CentOS installation host, if PackageKit is installed, open

/etc/yum/pluginconf.d/refresh-packagekit.conf with a text editor and make this change:

o enabled=0

● Make sure umask is set to 022.

42 of 56

Installing and running Ambari Server

1. Log into the machine that serves the Ambari Server as root. You may login and sudo as su

if this is what your environment requires. This machine is the main installation host.

2. Download the the Ambari repository file and copy it to your repos.d.

Platform Access

RHEL, CentOS,

and Oracle

Linux 5

wget http://public-repo-1.hortonworks.com/ambari/centos5/1.x/updates/1.4.1.61/ambari.repo

cp ambari.repo /etc/yum.repos.d

RHEL, CentOS

and Oracle

Linux 6

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.4.1.61/ambari.repo


SLES 11 wget http://public-repo-1.hortonworks.com/ambari/suse11/1.x/updates/1.4.1.61/ambari.repo


Table I.2.1. Download the repo

3. Install ambari server on master

yum install ambari-server

4. Set up the Master Server

ambari-server setup

o If you have not temporarily disabled SELinux, you may get a warning. Enter ‘y’ to

continue.

o By default, Ambari Server runs under root. If you want to create a different user to

run the Ambari Server instead, or to assign a previously created user, select y at

Customize user account for ambari-server daemon and give the prompt the

username you want to use.

o If you have not temporarily disabled iptables you may get a warning. Enter y to

continue. See Configuring Ports for (2.x) or (1.x) for more information on the ports

that must be open and accessible.

o Agree to the Oracle JDK license when asked. You must accept this license to be

able to download the necessary JDK from Oracle. The JDK is installed during the

deploy phase.

Note: By default, Ambari Server setup will download and install Oracle JDK 1.6. If you

plan to download this JDK and install on all your hosts, or plan to use a different

version of the JDK, skip this step and see Setup Options for more information

o At Enter advanced database configuration:

43 of 56

To use the default PostgreSQL database, named ambari, with the default

username and password (ambari/bigdata), enter n.

To use an existing Oracle 11g r2 instance or to select your own database

name, username and password for either database, enter y.

Select the database you want to use and provide any information required

by the prompts, including hostname, port, Service Name or SID,

username, and password.

o Setup completes

5. Start the Ambari Server

1) To start the Ambari Server:

o ambari-server start

2) To check the Ambari Server processes:

o ps -ef | grep Ambari

3) To stop the Ambari Server:

o ambari-server stop

6. Installing, Configuring and deploying cluster

1) Step 1: Point your browser to http://{main.install.hostname}:8080.

2) Step 2: Log in to the Ambari Server using the default username/password:

admin/admin.

3) Step 3: At welcome screen, type a name for the cluster you want to create in the text

box. No white spaces or special characters can be used in the name.

Select version of hdp and click on next.

4) Step 4: At Install option:

o Use the Target Hosts text box to enter your list of host names, one per line. You

can use ranges inside brackets to indicate larger sets of hosts. For example, for

host01.domain through host10.domain use host[01-10].domain

o If you want to let Ambari automatically install the Ambari Agent on all your hosts

using SSH, select Provide your SSH Private Key and either use the Choose File

button in the Host Registration Information section to find the private key file that

matches the public key you installed earlier on all your hosts or cut and paste the

key into the text box manually.

o Fill in the username for the SSH key you have selected. If you do not want to use

root, you must provide the username for an account that can execute sudo

without entering a password

o If you do not want Ambari to automatically install the Ambari Agents, select

Perform manual registration. See Appendix: Installing Ambari Agents Manually for

more information.

o Advanced Options

44 of 56

a) If you want to use a local software repository (for example, if your installation

does not have access to the Internet), check Use a Local Software Repository.

For more information on using a local repository see Optional: Configure the

Local Repositories

b) Click the Register and Confirm button to continue.

5) Step 5: Confirm hosts

If any hosts get warning, Click Click here to see the warnings to see a list of what was

checked and what caused the warning. On the same page you can get access to a

python script that can help you clear any issues you may encounter and let you run

Rerun Checks.

Python script for clear host:

python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py

6) When you are satisfied with the list of hosts, click Next.

7) Step 7: Choose services

8) Step 8: Assign masters

9) Step 9: Assign slaves and clients

10) Step 10: Customize Services

o Add property in hbase custom_site.xml

o hbase.data.umask.enable = true

o Add nagios password and email address for notification.

11) Step 11: Review it and install.

Nutch configuration on hadoop Master Node

● Download Nutch

○ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz

● Untar this file Nutch tar File

○ tar -vxf apache-nutch-2.2.1-src.tar.gz

● Export Nutch Class path

○ export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1

○ export PATH=$NUTCH_HOME/runtime/deploy/bin

● change /$NUTCH_HOME/conf as below

○ Add property in nutch-site.xml file

org.apache.gora.cassandra.store.CassandraStore

<property>

<name>org.apache.gora.cassandra.store.CassandraStore</name>

<value>hdfs://master:9001/hbase</value>

</property> <property>

<name>http.agent.name</name>

45 of 56

<value>GeliyooBot</value>

</property> <property>

<name>http.robots.agents</name>

<value>GeliyooBot.*</value>

</property>

○ add property in gora-cassandra.property file

gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore

gora.cassandrastore.servers=localhost:9160

○ Add Dependency in $NUTCH_HOME/ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-cassandra" rev="0.3" conf="*-

>default" />

● go to nutch installation folder($NUTCH_HOME) and run

ant clean

ant runtime

46 of 56

Cassandra Configuration

● Download the DataStax Community tarball

curl -L http://downloads.datastax.com/community/dsc.tar.gz | tar xz

● Go to the install directory:

○ $ cd dsc-cassandra-2.0.x

● Start Cassandra Server

○ $ sudo bin/cassandra

● Verify that DataStax Community is running. From the install:

○ $ bin/nodetool status

Install GUI Client for Cassandra

● Download WSO2 Carbon Server

○ wget https://www.dropbox.com/s/m00uodj1ymkpdzb/wso2carbon-4.0.0-

SNAPSHOT.zip

● Extract zip File

● Start WSO2 Carbon Server

○ Go to $WSO2_HOME/bin

○ sh wso2server.sh -Ddisable.cassandra.server.startup=true

and log in with default username and password (admin, admin)

List Key Spaces.

https://www.dropbox.com/s/m00uodj1ymkpdzb/wso2carbon-4.0.0-SNAPSHOT.zip

https://www.dropbox.com/s/m00uodj1ymkpdzb/wso2carbon-4.0.0-SNAPSHOT.zip

47 of 56

ElasticSearch Configuration

● Download ElasticSearch

○ wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-

0.19.4.tar.gz

● Untar file of ElasticSearch

○ tar -vxf elasticsearch-0.19.4.tar.gz

● Start the ElasticSearch server in the foreground

○ bin/elasticsearch -f

● User Interface of ElasticSearch

○ Index information

○ Index Data

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.19.4.tar.gz



48 of 56

49 of 56

Run Nutch jobs on Hadoop Master Node

● Create a directory in HDFS to upload the seed urls.

○ hadoop dfs -mkdir urls

-urls HDFS directory Name

● Create a text file with the seed URLs for the crawl.

○ hadoop dfs -put seed.txt urls

● Run Inject job

○ nutch inject urls

● Run generate job

○ nutch generate -topN N

● Run nutch fetch

○ nutch fetch -all

● Run nutch parse job

○ nutch parse -all

● Run nutch updatedb job

○ nutch updatedb

Conclusion

After configure all this framework. we achieved basic crawl and basic text search.Now we are

ready for crawl billions of urls and index it. after indexing this content to elasticSearch we can get

text Result as json format .we were use CURL for fetching data from

elasticsearch.when we pass some parameter using CURL. we were get json result it had

Content.url.Content Type,digest,

50 of 56

4. Development

Objective

Main goal of this Development is to implement intermediate api. this api is communicate with

Geliyoo Search engine. when user pass some Query to Geliyoo Search UI then pass this Query to

GeliyooSearchApi. Base on this Query this api get result from Elasticsearch and return back to

user.

51 of 56

Development Completed

Prototype

We focused on the user side of the application, i.e. the basic search engine development and

hence we decided to work on the prototype development of the same first. So for that we made

following two prototypes for this web application.

● Search Page

● Result Page

We will make more prototypes as we continue further development.

Implementation :

Implementation has basically four main parts:

1) Configuration of the selected components, which we covered in the previous topic,

2) The Web API development

3) The extension of these components that will allow extended searching (i.e. the search

that is not provided by these components)

4) The web application.

The Web Application :

With above prototypes, we implement basic search as below,

52 of 56

To search information for any word or text, user need to enter that text in search box as shown in

image.

Search functionality will start as soon as user enter any single character in the search box. Search

results will be displayed as above image.

Following things should be noticed in the above image

● Titles: Titles are links pointing to the urls containing information regarding searched word.

● Highlighted word: Words or text searched by user will be highlighted in the results.

● Pagination: Below on the screen there is a pagination. Each page will show 10 results on it.

So user can easily navigate between pages for desired results and no need to do much

scrolling on a single page.

● Search Box: Above on the screen there is a search box. User can edit the text or word he

searched for. He can also search for new word or text using this search box. So using this

search box, user need not to go back to search page for new search.

53 of 56

If there will no information for the word or text search by user, we will display message as above.

REST API

When user makes request for searching, crawling or indexing, it is call to web server api for

crawling which is deployed in another server, which server have hadoop master and connecting

with all hadoop slave. Nutch will manage this hadoop cluster by giving job for crawling and

indexing. This part is still remains and cover in future development.

54 of 56

Currently we are working on part as figure above from overall architecture. When user submit

query for search, web application calls to restful api. An API is responsible for web search based on

query. API builds query and call to elasticsearch cluster for searching result using jest client. We

have developed for web searching, a keywords is enters by users as query and get list of web urls

with small content of that site contains keywords with highlight as result. Each query is stored in

cassandra database for semantic search functionality.

We are working on image searching based on keywords. For this we need to crawl and index all

web images. Apache nutch 2.2 is unable to crawl images. We tried to add many parser plugins for

parsing images, gone through tika parser and modified code for crawling procedure and enable to

fetch images and parse it for create indexes.

Jest is a Java HTTP Rest client for ElasticSearch. As mention in above section ElasticSearch is an

Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.

ElasticSearch already has a Java API which is also used by ElasticSearch internally, but Jest fills a

gap, it is the missing client for ElasticSearch Http Rest interface. Jest client is request elasticsearch

cluster for result. A json is returns from this api and then it is forwarded to web application from

where request is initiated.

As result of this search API returns total number of pages, total result and list of all founded web

site, their content and titles.

We deployed web application and web api in different server for load balancing of requests to

server.

Future Development

Content Searching

Currently we are working on basic search functionality. For that we crawl websites and save their text contents. When user will search for any text or word, we will use these contents to get results. So results are limited to the text contents of websites. No doubt, currently users are allowed to search any text or word in any language with this functionality. We are planning that once we fully achieve this basic search functionality, we will work on the functionalities using which search on all the information of a content like name, text information, metadata will be possible and user will also allow for specific search for the following categories.

Image search

Video search

News search

Sports search

Audio search

55 of 56

Forum search

Blog search

Wiki search

Pdf search

Ebay, Amazon, Twitter, iTunes search Using these functionalities, user can make more specific search and get desired results more faster. For that we will crawl whole websites with images, videos, news etc and save their information like name, url, metadata and content-type. Now, when user will search for any text or word, we will use this information to get search results. It means, because of these functionalities search will be possible each and every information of a content, and user will get best results. When user wants to make specific search, we will make this kind of search using content-type of saved information to get the results. For e.g. user wants to search images only, we will use content-type equal to image and go through our saved information for images only. It is important to note that we will search for the images but we will search the entered text in images' name, url, metadata etc for results.

Semantic Search

We will use "Semantic search" concept to improve our search functionality so that user will get desired result more faster. Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace. Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results. We will save users url, country, browser, time etc information and text for which user searching. When search for any information we will use his passed searches and history to get more user specific results.

56 of 56

Prototypes

Protypes for some of future development may be as below : Image search

Video search

Technology

Arama Motoru Geliyoo'nun düzenlemesi hakkında