Upload
ramesh158
View
215
Download
0
Embed Size (px)
Citation preview
7/24/2019 DS_CS_Neustar_JMR_0
1/3
7/24/2019 DS_CS_Neustar_JMR_0
2/3
Efficient crawling of web pages. (URLs crawled, parsed and
classified per day were in the order of millions)
Solution
We implemented the solution using several popular open source big
data components such as:
Hive (data retrieval and exploratory analysis).
Nutch (web crawling and parsing).
Mahout and R (Machine learning).
These platforms were a good fit because they can handle large
and unstructured data. Heres a high-level overview of the
workflow:
Accessing and processing data from
multiple databases in multiple
formats.
Crawler customization for efficient
and fast crawling (open source
contribution as a patch).
Language identification and
multilingual classification of web
pages.
Machine Learning techniques like
clustering and matrix
decomposition for customer
segmentation and text mining.
Geo-locating IPs to identify user
patterns from different regions.
Analysing persistent of cookies.
Highlights
7/24/2019 DS_CS_Neustar_JMR_0
3/3
Heres how it works:
The list of URLs from the DNS data passes onto a web crawler.
The web crawler fetches the text contents.
It then categorizes the text into pre-defined categories such as sports, arts & entertainment, family &
parenting, etc. using a classifier based on matrix factorization techniques.
This method models the latent structure of a given collection of texts and finds semantically related
documents.
The scope of this project required that we also categorize Spanish language documents as well. Thus, we build
an additional model on the same lines.
The above procedure yields the following:
res how it works:
he number of times a user visits a particular category of sites within a given time span.
The vectors (representing each users behavior) are then passed to clustering algorithm.
The algorithm then clusters together the users with similar browsing behavior.
The system then leverages a dataset containing the users IP, browser ID and browser type and drops the
relevant ad into the users browser.
The system analyzes the co-occurrence pattern of the IP-Browser-ID pair to find the strength of association
of an IP with the browser.
A targeted ad is delivered to the end user based on their browsing pattern.
2015 Impetus Technologies, Inc.
All rights reserved. Product and
company names mentioned herein
may be trademarks of their
respective companies.
June 2015
bout Impetus
Impetus is focused on creating big business impact through Big Data Solutions for
Fortune 1000 enterprises across multiple verticals. The company brings together a
unique mix of software products, consulting services, Data Science capabilities and
technology expertise. It offers full life-cycle services for Big Data implementations and
real-time streaming analytics, including technology strategy, solution architecture, proof of
concept, production implementation and on-going support to its clients.
To learn more, visit www.impetus.com or write to us at [email protected].