DS_CS_Neustar_JMR_0

Embed Size (px)

Citation preview

  • 7/24/2019 DS_CS_Neustar_JMR_0

    1/3

  • 7/24/2019 DS_CS_Neustar_JMR_0

    2/3

    Efficient crawling of web pages. (URLs crawled, parsed and

    classified per day were in the order of millions)

    Solution

    We implemented the solution using several popular open source big

    data components such as:

    Hive (data retrieval and exploratory analysis).

    Nutch (web crawling and parsing).

    Mahout and R (Machine learning).

    These platforms were a good fit because they can handle large

    and unstructured data. Heres a high-level overview of the

    workflow:

    Accessing and processing data from

    multiple databases in multiple

    formats.

    Crawler customization for efficient

    and fast crawling (open source

    contribution as a patch).

    Language identification and

    multilingual classification of web

    pages.

    Machine Learning techniques like

    clustering and matrix

    decomposition for customer

    segmentation and text mining.

    Geo-locating IPs to identify user

    patterns from different regions.

    Analysing persistent of cookies.

    Highlights

  • 7/24/2019 DS_CS_Neustar_JMR_0

    3/3

    Heres how it works:

    The list of URLs from the DNS data passes onto a web crawler.

    The web crawler fetches the text contents.

    It then categorizes the text into pre-defined categories such as sports, arts & entertainment, family &

    parenting, etc. using a classifier based on matrix factorization techniques.

    This method models the latent structure of a given collection of texts and finds semantically related

    documents.

    The scope of this project required that we also categorize Spanish language documents as well. Thus, we build

    an additional model on the same lines.

    The above procedure yields the following:

    res how it works:

    he number of times a user visits a particular category of sites within a given time span.

    The vectors (representing each users behavior) are then passed to clustering algorithm.

    The algorithm then clusters together the users with similar browsing behavior.

    The system then leverages a dataset containing the users IP, browser ID and browser type and drops the

    relevant ad into the users browser.

    The system analyzes the co-occurrence pattern of the IP-Browser-ID pair to find the strength of association

    of an IP with the browser.

    A targeted ad is delivered to the end user based on their browsing pattern.

    2015 Impetus Technologies, Inc.

    All rights reserved. Product and

    company names mentioned herein

    may be trademarks of their

    respective companies.

    June 2015

    bout Impetus

    Impetus is focused on creating big business impact through Big Data Solutions for

    Fortune 1000 enterprises across multiple verticals. The company brings together a

    unique mix of software products, consulting services, Data Science capabilities and

    technology expertise. It offers full life-cycle services for Big Data implementations and

    real-time streaming analytics, including technology strategy, solution architecture, proof of

    concept, production implementation and on-going support to its clients.

    To learn more, visit www.impetus.com or write to us at [email protected].