DS_CS_Neustar_JMR_0

7/24/2019 DS_CS_Neustar_JMR_0

1/3


2/3

Efficient crawling of web pages. (URLs crawled, parsed and

classified per day were in the order of millions)

Solution

We implemented the solution using several popular open source big

data components such as:

Hive (data retrieval and exploratory analysis).

Nutch (web crawling and parsing).

Mahout and R (Machine learning).

These platforms were a good fit because they can handle large

and unstructured data. Heres a high-level overview of the

workflow:

Accessing and processing data from

multiple databases in multiple

formats.

Crawler customization for efficient

and fast crawling (open source

contribution as a patch).

Language identification and

multilingual classification of web

pages.

Machine Learning techniques like

clustering and matrix

decomposition for customer

segmentation and text mining.

Geo-locating IPs to identify user

patterns from different regions.

Analysing persistent of cookies.

Highlights


3/3

Heres how it works:

The list of URLs from the DNS data passes onto a web crawler.

The web crawler fetches the text contents.

It then categorizes the text into pre-defined categories such as sports, arts & entertainment, family &

parenting, etc. using a classifier based on matrix factorization techniques.

This method models the latent structure of a given collection of texts and finds semantically related

documents.

The scope of this project required that we also categorize Spanish language documents as well. Thus, we build

an additional model on the same lines.

The above procedure yields the following:

res how it works:

he number of times a user visits a particular category of sites within a given time span.

The vectors (representing each users behavior) are then passed to clustering algorithm.

The algorithm then clusters together the users with similar browsing behavior.

The system then leverages a dataset containing the users IP, browser ID and browser type and drops the

relevant ad into the users browser.

The system analyzes the co-occurrence pattern of the IP-Browser-ID pair to find the strength of association

of an IP with the browser.

A targeted ad is delivered to the end user based on their browsing pattern.

2015 Impetus Technologies, Inc.

All rights reserved. Product and

company names mentioned herein

may be trademarks of their

respective companies.

June 2015

bout Impetus

Impetus is focused on creating big business impact through Big Data Solutions for

Fortune 1000 enterprises across multiple verticals. The company brings together a

unique mix of software products, consulting services, Data Science capabilities and

technology expertise. It offers full life-cycle services for Big Data implementations and

real-time streaming analytics, including technology strategy, solution architecture, proof of

concept, production implementation and on-going support to its clients.

To learn more, visit www.impetus.com or write to us at [email protected].

Documents

DS_CS_Neustar_JMR_0